Quickstart

The primary function of GroundX is to make integrating data from complex human documents into LLM powered applications simple and performant. By following this quick-start guide you will understand the core workflow of GroundX, and how you can use it to upload complex documents to be parsed, how those documents are stored, and how they can be queried for use in LLM powered applications.

This Guide Covers the following:

  1. How to get your GroundX API Key
  2. Setting up GroundX
  3. How to create a new bucket
  4. How to use an existing bucket
  5. How to upload content to a bucket
  6. How to check the status of your upload
  7. How to search for content
  8. How to use your search results to augment an LLM (i.e. RAG)

Step 1: Getting Your API Key

Before you can use our APIs, you will need to create an account.

Log into the GroundX Dashboard and navigate to API Keys.

Navigate to API Keys

Copy your API Key and save it somewhere for use later in this tutorial.

Copy Your API Key

Step 2: Setting Up GroundX

Users that wish to use the Python or TypeScript SDKs can get started by downloading the relevant package via the following shell commands:

1pip install groundx

The GroundX client can be set up with the following code in either language:

1from groundx import GroundX
2
3client = GroundX(
4 api_key=GROUNDX_API_KEY,
5)

If you’re using a language other than Python or TypeScript, or if you don’t wish to install the Python or TypeScript SDKs, you can use cURL to communicate with the API via HTTP directly.

Step 3: Creating a New Bucket

When you upload a document to GroundX:

  • The document is passed through a vision model to identify key components
  • Each of those components are passed through different pipelines, depending on if the component is textual, graphical, or tabular, in order to re-represent that data into an LLM friendly textual representation
  • contextual information from other parts of the document are baked into that textual representation, creating a context rich and LLM friendly textual representation of each section of the document.

These re-represented components of the document are called “Semantic Objects”, and can be thought of as objects which contain fully contextualized ideas within the document.

After these semantic objects are created, they’re stored within a bucket.

Buckets are queryable containers which store semantic objects. Essentially, you upload a bunch of files to a bucket, the semantic objects derived from those documents are stored in said bucket, then you can search the bucket for semantic objects that are relevant to a natural language query.

Thus, to use GroundX, it’s useful to first create a bucket. You can give a bucket a name, which does not necessarily have to be unique, and you’ll get back a unique bucket_id which can be used to upload documents to and search from that bucket.

The following code creates a new bucket and gets back the bucket_id which we’ll use in future steps:

1# Upload documents to GroundX
2
3response = client.buckets.create(
4 name="your_bucket_name",
5)
6
7bucket_id = response.bucket.bucket_id

Step 4: Using an Existing Bucket

You can list all existing buckets in your GroundX account via the following command:

1# printing a list of existing buckets
2
3#here, the `list` function is not casting, but rather
4#is calling the API which lists all buckets.
5buckets = client.buckets.list()
6
7#the response object can be cast to a dictionary for
8#legibility
9print(buckets.dict())
10# The value of `buckets.dict()` will resemble the following:
11# {
12# "buckets": [
13# {
14# "bucketId": 1,
15# "created": "2023-10-03T08:59:39Z",
16# "fileCount": 1,
17# "fileSize": "3.1GB",
18# "name": "name",
19# "updated": "2023-10-03T08:59:39Z"
20# }
21# ...
22# ]
23# }

This code can be used to retrieve the bucketId from an existing bucket.

Step 5: Upload Content

To upload content to a bucket, the Document Upload API can be used. This will allow you to upload complex documents in a variety of formats to a particular bucket. In the upload process, the documents will automatically be parsed and the final representation stored in the bucket will be a set of semantic objects.

There are two key ways a document can be uploaded; either by uploading a locally hosted document, or one which is publicly hosted behind some endpoint.

Uploading a locally hosted document can be done with the following code:

1# Upload documents to GroundX
2
3ingest = client.ingest(
4 documents=[
5 Document(
6 bucket_id=bucket_id,
7 file_name="my_file1.txt",
8 file_path="/local/path/file2.pdf",
9 file_type="txt",
10 search_data=dict(
11 key = "value",
12 ),
13 )
14 ]
15)

Uploading remotely hosted documents can be done with the following code:

1# Upload documents to GroundX
2
3ingest = client.ingest(
4 documents=[
5 Document(
6 bucket_id=bucket_id,
7 file_name="my_file1.txt",
8 file_path="https://my.source.url.com/file1.txt",
9 file_type="txt",
10 search_data=dict(
11 key = "value",
12 ),
13 )
14 ]
15)

see the Ingest API for more information on arguments.

If your request is successful, regardless of if the upload is from remote or local documents, you will receive a response that looks something like this:

1{
2 "ingest": {
3 "processId": "<unique_system_generated_id",
4 "status": "<enumerated_status>"
5 }
6}

The processId can be used to check the status of the upload.

Step 6: Check the Status of Your Upload

The following request can be used to query the status of your upload:

1ingest = client.documents.get_processing_status_by_id(
2 process_id=ingest.ingest.process_id
3)

Be sure to use processId from the previous step.

If your request is successful, you will receive a response that looks something like this:

1{
2 "ingest": {
3 "processId": "<unique_system_generated_id>",
4 "progress": {
5 "complete": {
6 "documents": [
7 {
8 "documentId": "<unique_system_generated_id>",
9 "fileName": "<given_file_name>",
10 "fileSize": "<files_size_total>",
11 "fileType": "<file_type>",
12 "bucketId": <your_bucket_id>,
13 "processId": "<unique_system_generated_id>",
14 "sourceUrl": "<document_url>",
15 "status": "<enumerated_status>"
16 }
17 ],
18 "total": 1
19 }
20 },
21 "status": "<enumerated_status>"
22 }
23}

The value of status will be one of queued, processing, error, or complete. This can be used to, for instance, wait for a document to be uploaded via incrementally polling.

Step 7: Search Your Content

Make the following request to search your ingested content:

1search_response = client.search.content(
2 id=bucket_id,
3 query=query,
4)

You can also use projectId or groupId in place of bucketId in your search query. These will allow you to search an entire project, a group of buckets, or an individual bucket, respectively. Replace query with the query you want to use to search your content.

If your request is successful, will receive a response that looks something like this:

1{
2 "search": {
3 "count": <int_number_of_results>,
4 "query": "<your_query>"
5 "score": <float_highest_relevance_score_in_results>,
6 "text": "<combined_text_of_search_results>",
7 "nextToken": "<token_for_next_set_of_results>",
8 "results":[
9 {
10 "documentId": "<unique_system_generated_id>",
11 "score": <float_relevance_score_of_result>,
12 "searchData": {
13 <document_metadata>
14 },
15 "sourceUrl": "<source_document_url>",
16 "suggestedText": "<rewritten_text_for_LLM_completions>",
17 "text": "<original_text_of_result>"
18 }
19 ]
20 }
21}

If you need to look up a projectId, groupId, or bucketId, you can find them in the GroundX Dashboard or by querying for them using the APIs.

Step 8: Using Search Results to Augment an LLM

After search has been completed, the search.text can be used to provide context to a language model. We strongly recommend you use search.text for your LLM completions. We provide search.results in case you want to create your own context from the search results. If you choose to do this, rather than use search.text, we strongly recommend you use search.results[n].suggestedText for your context.

First, the search.text can be unpacked from the response

1llm_text = search_response.search.text

Once you have relevant text for your request, you will need to combine the text with instructions and submit them to OpenAI.

Here is an example of what a completion instruction could look like:

You are a helpful virtual assistant that answers questions
using the content below. Your task is to create detailed answers
to the questions by combining your understanding of the world
with the content provided below. Do not share links.

Combine your completion instructions with your curated GroundX search results:

1completion = openai.ChatCompletion.create(
2 model=openaiModel,
3 messages=[
4 {
5 "role": "system",
6 "content": """%s
7===
8%s
9===
10"""
11 % (instruction, llm_text),
12 },
13 {"role": "user", "content": query},
14 ],
15)

Replace openaiModel with your preferred model, instruction with your completion instructions, llmText with your curated GroundX search results, and query with your query. This, effectively is an implementation of “Retrieval Augmented Generation” (RAG) where an augmented prompt consisting of a query and retrieved context about that query are passed to a language model for generation.

More information about integrating GroundX with Chat