In-Depth Exploration of GroundX Document Ingest
Introduction
In this tutorial, we’ll cover how to add, or “ingest”, your files to GroundX.
With our proprietary ingest pipeline, your files undergo three critical processes:
- First, an object detection model is applied to your document. This allows GroundX to understand the key components, visual information, and formatting.
- Next, A variety of fine tuned VLMs are used to convert your data into a grounded textual representation which LLMs can understand.
- Finally, your data is passed through a contextualization pipeline to bake in contextual metadata about your document.
Unlike other RAG solutions that require you to convert your files into plain text, Ground X is compatible with a wide variety of file formats out of the box, allowing you to expose your document data directly to an LLM without custom configuration.
more information about document parsing can be found in our guide on GroundX Ingest for Parsing. In this article, we’ll focus on the ingest pipeline in general; how to ingest local files, remotely hosted files, directories, etc.
Getting started
API Key
- Go to the GroundX dashboard to get your API key.
- GroundX can be installed for Python via pip install groundx
- GroundX can be installed for NPM via npm i -s groundx
Before we begin, make sure you have the following information:
- The ID of the GroundX bucket in which you wish to store your file. If you don’t have a bucket, you can create one with the buckets.create endpoint, or through the GroundX dashboard
- The local path or public URL of the file you want to upload.
You may also want to prepare the following optional values:
- The file name you wish to give your file once it’s in the GroundX bucket. This can be the name of the file being uploaded, or some different name.
- The file type. The following file types are excepted:
Example:
Ingesting Individual Files
Now that we have a GroundX bucket we can upload content to, we can explore how ingest functions in GroundX. The simplest way to ingest content into GroundX is by uploading files one at a time.
First, you’ll need to set up authentication with the GroundX client.
Once you’ve authenticated your client, you can ingest a document into GroundX via the ingest endpoint
The file_path
specified in the ingest
endpoint can either be that of a local path or a public URL.
After making the request, you should receive a response with processId
and status
. This response indicates that GroundX is uploading or ingesting your file into the indicated bucket.
the processId
can be polled to get the most up-to-date upload status via the documents.get_processing_status_by_id endpoint.
Ingesting Directories
if you’re using the Python SDK, you can use the method ingest_directory to ingest the contents of a directory to a particular bucket.
This is a function that asynchronously batch uploads all of the documents within a directory tree, based on the top level path specified. It will render a tqdm
progress bar, and automatically poll for updates on the batch currently being uploaded.
Adding extra search data
GroundX automatically generates contextual search data for your files. However, you can add extra search data to take maximum advantage of GroundX’s search capabilities, help maintain document context in the search query responses, and add tags or notes indicating instructions on how to handle the search results.
Example:
Final details
Processing time depends on the size of your files. For upload restrictions like file and batch size, see the prompting and integration guide.
After automatically ingesting your files and eliminating the typical complexity of other RAG solutions, GroundX has prepared your content for searchability and automated response generation for your queries.