For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Sign Up
DocumentationAPI ReferenceSDKs
DocumentationAPI ReferenceSDKs
  • Fundamentals
    • Welcome
    • Quickstart
    • API Concepts
    • Document Types and Ingest Capacities
    • Filtering Content
    • Bulk Uploading in Python
  • Evaluation
    • How We Approach Testing
  • Guides
    • Prompting and Integration
    • GroundX Ingest for Parsing
    • Extract Data from Documents
    • In-Depth Exploration of GroundX Document Ingest
    • In-Depth Exploration of GroundX Search
    • MCP Support
  • GroundX On-Prem
    • GroundX On-Prem on AWS
    • GroundX On-Prem on OpenShift
    • Debugging GroundX On-Prem
Sign Up
LogoLogo
On this page
  • Extract Data from Documents
  • What You’ll Do
  • 1. Describe The JSON You Want Back
  • 2. Turn The YAML Into Workflow Settings
  • 3. Create And Assign The Workflow
  • 4. Upload A Document
  • 5. Get The JSON Back
  • 6. Improve The Results
Guides

Extract Data from Documents

Was this page helpful?
Previous

In-Depth Exploration of GroundX Document Ingest

Next
Built with

Extract Data from Documents

Use GroundX when you want a document to come back as JSON your application can use. For example, a utility statement can return statement.account_number, statement.total_amount_due, and service.service_address.

This guide uses the GroundX Python SDK to create the workflow, upload a document, and read the extracted JSON.

What You’ll Do

  1. Write a YAML file that names the JSON keys you want back.
  2. Create a GroundX workflow from that YAML.
  3. Upload a document so GroundX can run the workflow.
  4. Call get_extract to read the extracted JSON.

1. Describe The JSON You Want Back

Create a YAML file. Names such as statement and service become top-level objects in the returned JSON. Under each one, fields: is the list of values GroundX should extract.

statement.yaml
1statement:
2 prompt:
3 instructions: Extract billing statement values from the document.
4 fields:
5 account_number:
6 prompt:
7 description: The utility account number printed on the statement.
8 identifiers:
9 - Account Number
10 - Account #
11 instructions: Return the account number exactly as printed.
12 type: str
13 due_date:
14 prompt:
15 description: The payment due date.
16 identifiers:
17 - Due Date
18 - Payment Due
19 instructions: Return the due date as YYYY-MM-DD.
20 type: str
21 total_amount_due:
22 prompt:
23 description: The final amount due for the statement.
24 identifiers:
25 - Total Amount Due
26 - Amount Due
27 instructions: Return only the numeric amount, without a currency symbol.
28 type: float
29
30service:
31 prompt:
32 instructions: Extract service location values from the statement.
33 fields:
34 service_address:
35 prompt:
36 description: The service address for the billed location.
37 identifiers:
38 - Service Address
39 - Service Location
40 instructions: Return the address exactly as printed.
41 type: str

This YAML tells GroundX to return JSON like this:

1{
2 "statement": {
3 "account_number": "123456789",
4 "due_date": "2026-06-30",
5 "total_amount_due": 128.55
6 },
7 "service": {
8 "service_address": "100 Main St, Denver, CO 80202"
9 }
10}

Keep this file focused on the JSON your application needs. Use names your application will read, such as statement or service, not names that describe how extraction runs.

2. Turn The YAML Into Workflow Settings

Use prepare_extraction_yaml to check the YAML and produce the setting you pass as extract when you create the workflow.

prepare_workflow.py
1from pathlib import Path
2
3from groundx.extract import prepare_extraction_yaml
4
5yaml_text = Path("statement.yaml").read_text()
6prepared = prepare_extraction_yaml(yaml_text)
7
8extract_settings = prepared.workflow_groups

You do not need to inspect prepared.workflow_groups in most applications. Pass it to GroundX as the workflow’s extract setting.

3. Create And Assign The Workflow

Create the workflow with the settings from the previous step. Then assign the workflow to the bucket where you will upload documents.

create_workflow.py
1import os
2from pathlib import Path
3
4from groundx import GroundX
5from groundx.extract import prepare_extraction_yaml
6
7client = GroundX(api_key=os.environ["GROUNDX_API_KEY"])
8bucket_id = 1234
9
10prepared = prepare_extraction_yaml(Path("statement.yaml").read_text())
11
12workflow_response = client.workflows.create(
13 name="statement extraction",
14 extract=prepared.workflow_groups,
15)
16
17workflow_id = workflow_response.workflow.workflow_id
18if workflow_id is None:
19 raise RuntimeError("GroundX did not return a workflow ID")
20
21client.workflows.add_to_id(id=bucket_id, workflow_id=workflow_id)

Use client.workflows.add_to_account(...) instead when the workflow should be the account default.

4. Upload A Document

Upload documents to the bucket that has the workflow assigned to it. Use process_level="full" so GroundX runs the workflow during ingest.

ingest_statement.py
1import os
2import time
3
4from groundx import Document, GroundX
5
6client = GroundX(api_key=os.environ["GROUNDX_API_KEY"])
7bucket_id = 1234
8
9ingest_response = client.ingest(
10 documents=[
11 Document(
12 bucket_id=bucket_id,
13 file_name="statement.pdf",
14 file_type="pdf",
15 file_path="https://example.com/statement.pdf",
16 process_level="full",
17 )
18 ],
19)
20
21process_id = ingest_response.ingest.process_id
22
23while True:
24 ingest_response = client.documents.get_processing_status_by_id(
25 process_id=process_id,
26 )
27 status = ingest_response.ingest.status
28
29 if status in {"complete", "error", "cancelled"}:
30 break
31
32 time.sleep(3)
33
34if status != "complete":
35 raise RuntimeError(f"GroundX ingest ended with status: {status}")

5. Get The JSON Back

After ingest completes, find the processed document and request its extracted JSON.

read_extraction.py
1import os
2
3from groundx import GroundX
4
5client = GroundX(api_key=os.environ["GROUNDX_API_KEY"])
6bucket_id = 1234
7process_id = "PROCESS_ID_FROM_INGEST"
8
9lookup = client.documents.lookup(
10 id=bucket_id,
11 status="complete",
12 n=100,
13)
14
15document = next(
16 (
17 item
18 for item in lookup.documents or []
19 if item.process_id == process_id
20 ),
21 None,
22)
23
24if document is None:
25 raise RuntimeError("Could not find the completed GroundX document")
26
27extraction = client.documents.get_extract(document.document_id)
28print(extraction)

The result uses the same names from statement.yaml.

6. Improve The Results

When a value is missing or wrong, change the smallest part of the YAML that explains the miss.

  • If one value is wrong, edit that value’s description, identifiers, or instructions.
  • If several statement values are wrong, improve the prompt under statement:.
  • If the document parsed poorly, inspect the document X-Ray before changing prompts.
  • If the returned JSON uses the wrong names, update the YAML before tuning prompts.

Then prepare the YAML again, update the GroundX workflow, ingest another document, and read the result again.