Extract Data from Documents

Use GroundX when you want a document to come back as JSON your application can use. For example, a utility statement can return statement.account_number, statement.total_amount_due, and service.service_address.

This guide uses the GroundX Python SDK to create the workflow, upload a document, and read the extracted JSON.

If you’d rather have an agent draft and iterate the YAML schema for you instead of writing it by hand, see Use GroundX With Your Agent; it walks through the same workflow with an agent doing the schema work.

Workflows are more general than extraction alone. The same mechanism can tune RAG output for a specific use case: inject custom prompts at any pipeline stage (document, section, or chunk), control chunk and section strategy, or route a step to a custom OpenAI-compatible LLM endpoint. This guide covers only the extraction-focused subset. The broader RAG-tuning capability isn’t yet documented as its own guide on this site.

What You’ll Do

Write a YAML file that names the JSON keys you want back.
Create a GroundX workflow from that YAML.
Upload a document so GroundX can run the workflow.
Call get_extract to read the extracted JSON.

1. Describe The JSON You Want Back

Create a YAML file. Names such as statement and service become top-level objects in the returned JSON. Under each one, fields: is the list of values GroundX should extract.

statement.yaml

1 statement:
2   prompt:
3     instructions: Extract billing statement values from the document.
4   fields:
5     account_number:
6       prompt:
7         description: The utility account number printed on the statement.
8         identifiers:
9           - Account Number
10           - Account #
11         instructions: Return the account number exactly as printed.
12         type: str
13     due_date:
14       prompt:
15         description: The payment due date.
16         identifiers:
17           - Due Date
18           - Payment Due
19         instructions: Return the due date as YYYY-MM-DD.
20         type: str
21     total_amount_due:
22       prompt:
23         description: The final amount due for the statement.
24         identifiers:
25           - Total Amount Due
26           - Amount Due
27         instructions: Return only the numeric amount, without a currency symbol.
28         type: float
29 
30 service:
31   prompt:
32     instructions: Extract service location values from the statement.
33   fields:
34     service_address:
35       prompt:
36         description: The service address for the billed location.
37         identifiers:
38           - Service Address
39           - Service Location
40         instructions: Return the address exactly as printed.
41         type: str

This YAML tells GroundX to return JSON like this:

1 {
2   "statement": {
3     "account_number": "123456789",
4     "due_date": "2026-06-30",
5     "total_amount_due": 128.55
6   },
7   "service": {
8     "service_address": "100 Main St, Denver, CO 80202"
9   }
10 }

Keep this file focused on the JSON your application needs. Use names your application will read, such as statement or service, not names that describe how extraction runs.

2. Install The Extract Extra

Install the extract extra before using extraction workflow helpers.

1 pip install "groundx[extract]"

Most applications can pass the YAML path directly to create or update. Load an extraction definition only when you need to inspect settings or reuse the same loaded definition across calls.

Custom workflow steps

For larger schemas, split extraction across named custom workflow steps. Each group uses workflow_step to choose the step that extracts it, and each field uses workflow_output_key to name the step output that maps back to the final JSON field. Keep each custom step to 20 fields or fewer. The platform’s hard limit is 30, but staying well under it improves extraction accuracy.

line-items.yaml

1 workflow:
2   template:
3     "{{LANGUAGE}}": English
4     "{{LANGUAGE_UNKNOWN}}": ""
5   custom_steps:
6     - name: line_item_labels
7       level: chunk
8       kind: keys
9       required_template_keys:
10         - "{{LANGUAGE}}"
11 
12 invoice:
13   fields:
14     invoice_number:
15       prompt:
16         description: The invoice number printed near the top of the document.
17         identifiers:
18           - Invoice Number
19         instructions: Return the invoice number exactly as printed.
20         type: str
21 
22 line_items:
23   workflow_step: line_item_labels
24   fields:
25     description:
26       workflow_output_key: label
27       prompt:
28         description: The line-item description.
29         identifiers:
30           - Description
31         instructions: Return the description for each line item.
32         type: str

The returned JSON still uses the customer-facing YAML names, such as invoice.invoice_number and line_items[].description.

For chunk- or section-level custom steps like line_item_labels above, get_extract can currently return a 404 in some hosted environments, even though the workflow ran correctly and was attached before ingest. This happens because structured output for chunk/section-level steps can surface first in the document’s X-Ray output (customChunkOutputs / customSectionOutputs), rather than in the document-level extract artifact that get_extract reads. A 404 here doesn’t mean you did something wrong; it’s known platform behavior for chunk/section-level custom steps, not a sign the extraction failed.

3. Create And Assign The Workflow

Create the workflow from the YAML file. Then assign the workflow to the bucket where you will upload documents.

Use template for workflow-level prompt variables, such as {{LANGUAGE}} and {{LANGUAGE_UNKNOWN}}, that should be available while the workflow runs. Template values are strings.

Python

1 import os
2 
3 from groundx import GroundX
4 
5 client = GroundX(api_key=os.environ["GROUNDX_API_KEY"])
6 bucket_id = 1234
7 
8 workflow_response = client.create_extraction_workflow(
9     path="statement.yaml",
10     name="statement extraction",
11 )
12 
13 workflow_id = workflow_response.workflow.workflow_id
14 if workflow_id is None:
15     raise RuntimeError("GroundX did not return a workflow ID")
16 
17 client.workflows.add_to_id(id=bucket_id, workflow_id=workflow_id)

Use client.workflows.add_to_account(...) instead when the workflow should be the account default.

Update a workflow

Workflow updates send the full extraction settings, not a name-only patch. Pass the YAML path again, or pass a previously loaded definition when you intentionally want to reuse one.

Python

1 import os
2 
3 from groundx import GroundX
4 
5 client = GroundX(api_key=os.environ["GROUNDX_API_KEY"])
6 
7 client.update_extraction_workflow(
8     "workflow-id",
9     path="statement.yaml",
10     name="statement extraction",
11 )

Load from an existing workflow

Use this when you want to inspect or reuse the extraction definition that is already stored on a workflow.

Python

1 import os
2 
3 from groundx import GroundX
4 
5 client = GroundX(api_key=os.environ["GROUNDX_API_KEY"])
6 
7 definition = client.load_extraction_definition(workflow_id="workflow-id")

See the SDK Method Reference at the end of this page for the full parameter list of load_extraction_definition, load_extraction_definition_from_yaml, load_extraction_definition_from_workflow, create_extraction_workflow, and update_extraction_workflow.

4. Upload A Document

Upload documents to the bucket that has the workflow assigned to it. Use process_level="full" so GroundX runs the workflow during ingest.

Python

1 import os
2 import time
3 
4 from groundx import Document, GroundX
5 
6 client = GroundX(api_key=os.environ["GROUNDX_API_KEY"])
7 bucket_id = 1234
8 
9 ingest_response = client.ingest(
10     documents=[
11         Document(
12             bucket_id=bucket_id,
13             file_name="statement.pdf",
14             file_type="pdf",
15             file_path="https://example.com/statement.pdf",
16             process_level="full",
17         )
18     ],
19 )
20 
21 process_id = ingest_response.ingest.process_id
22 
23 while True:
24     ingest_response = client.documents.get_processing_status_by_id(
25         process_id=process_id,
26     )
27     status = ingest_response.ingest.status
28 
29     if status in {"complete", "error", "cancelled"}:
30         break
31 
32     time.sleep(3)
33 
34 if status != "complete":
35     raise RuntimeError(f"GroundX ingest ended with status: {status}")

5. Get The JSON Back

After ingest completes, find the processed document and request its extracted JSON.

Python

1 import os
2 
3 from groundx import GroundX
4 
5 client = GroundX(api_key=os.environ["GROUNDX_API_KEY"])
6 bucket_id = 1234
7 process_id = "PROCESS_ID_FROM_INGEST"
8 
9 lookup = client.documents.lookup(
10     id=bucket_id,
11     status="complete",
12     n=100,
13 )
14 
15 document = next(
16     (
17         item
18         for item in lookup.documents or []
19         if item.process_id == process_id
20     ),
21     None,
22 )
23 
24 if document is None:
25     raise RuntimeError("Could not find the completed GroundX document")
26 
27 extraction = client.documents.get_extract(document.document_id)
28 print(extraction)

The result uses the same names from statement.yaml.

6. Improve The Results

The first pass is rarely perfect. Reaching high accuracy is an iteration loop: compare the extracted JSON against the values you expect, then fix the smallest part of the YAML that explains each miss. Change one value at a time so you can tell what each edit did.

For a single wrong or missing value:

Edit that value’s description, identifiers, or instructions. Add the exact labels the document uses to identifiers, and add a negative example to instructions when the value is confused with a nearby one (for example, “Do not confuse the amount due with the previous balance”).
Ask for the value verbatim when formatting matters (“Return the amount exactly as printed, without a currency symbol”).
If the value is missing entirely, confirm the document was read correctly before tightening the prompt: inspect the document X-Ray. If the value is not in the parsed content at all, no prompt change will recover it; if it is there but not returned, broaden identifiers.

For a repeating group (line items, charges, transactions):

If rows are missing, instruct the group to capture every row, not just the first or the totals.
If subtotals or a summary line are being captured as rows, exclude them explicitly (“Do not include subtotal, tax, or total lines”).
If the same row appears twice, tell the group to return each row once.
When several fields in the group are wrong, improve the group-level prompt rather than each field individually.

If the returned JSON uses the wrong names, update the YAML names before tuning prompts. Then prepare the YAML again, update the GroundX workflow, ingest another document, and read the result again. Repeat until the values match what you expect.

SDK Method Reference

These helpers require the extract extra: pip install "groundx[extract]".

client.load_extraction_definition(…)

Loads one source and returns an extraction definition you can inspect or reuse with create or update.

workflow_id

string

Existing workflow ID to read.

path

string | Path

Path to a local YAML file. This is the normal application path.

yaml_text

string

YAML content as a string.

mapping

object

YAML content as a mapping. Use mapping_kind="workflow_extract" only when the mapping is an existing workflow extract value.

prepared

PreparedExtractionYaml

A prepared extraction YAML object from the advanced SDK preparation API.

mapping_kind

string

Use workflow_extract only when mapping is an existing workflow extract value.

request_options

RequestOptions

Request options for workflow-ID loading.

If workflow_id is set, the SDK loads from that workflow before considering YAML or prepared inputs. Otherwise pass exactly one of path, yaml_text, mapping, or prepared.

1 definition = client.load_extraction_definition(path="statement.yaml")
2 existing = client.load_extraction_definition(workflow_id="workflow-id")

Returns an ExtractionDefinition with the workflow extract settings and any workflow template, custom steps, output routes, and leaf fields from the selected source.

client.load_extraction_definition_from_yaml(…)

Explicit YAML-only alias for client.load_extraction_definition(path=...), yaml_text=..., mapping=..., or prepared=....

1 definition = client.load_extraction_definition_from_yaml(path="statement.yaml")

client.load_extraction_definition_from_workflow(…)

Explicit workflow-only alias for client.load_extraction_definition(workflow_id=...).

workflow_id

stringRequired

The workflow ID to read.

1 definition = client.load_extraction_definition_from_workflow("workflow-id")

Returns an ExtractionDefinition. If the stored workflow does not include the original authored YAML metadata, the returned definition is still reusable for create and update, but authored YAML inspection data is unavailable.

client.create_extraction_workflow(…)

Creates a workflow from an extraction definition or one YAML source.

definition

ExtractionDefinition

A definition returned by load_extraction_definition(...) or one of its explicit YAML/workflow aliases.

path

string | Path

Path to a local YAML file. This is the normal create path.

yaml_text

string

YAML content as a string.

mapping

object

YAML content as a mapping. Use mapping_kind="workflow_extract" only for an existing workflow extract value.

prepared

PreparedExtractionYaml

A prepared extraction YAML object from the advanced SDK preparation API.

mapping_kind

string

Use workflow_extract only when mapping is an existing workflow extract value.

name

stringRequired

Workflow name.

request_options

RequestOptions

Request options forwarded to the workflow create call.

If definition is set, the SDK uses it before considering YAML or prepared inputs. Otherwise pass exactly one of path, yaml_text, mapping, or prepared.

1 workflow_response = client.create_extraction_workflow(
2     path="statement.yaml",
3     name="statement extraction",
4 )
5 workflow_id = workflow_response.workflow.workflow_id
6 client.workflows.add_to_id(id=bucket_id, workflow_id=workflow_id)

Returns the normal workflow response. Assign the workflow to a bucket, group, or account after create.

client.update_extraction_workflow(…)

Updates an existing workflow from an extraction definition or one YAML source.

workflow_id

stringRequired

The workflow ID to update.

definition

ExtractionDefinition

The full extraction definition to send.

path

string | Path

Path to a local YAML file. This is the normal update path.

yaml_text

string

YAML content as a string.

mapping

object

YAML content as a mapping. Use mapping_kind="workflow_extract" only for an existing workflow extract value.

prepared

PreparedExtractionYaml

A prepared extraction YAML object from the advanced SDK preparation API.

mapping_kind

string

Use workflow_extract only when mapping is an existing workflow extract value.

name

string

Optional workflow name to include with the full update.

request_options

RequestOptions

Request options forwarded to the workflow update call.

If definition is set, the SDK uses it before considering YAML or prepared inputs. Otherwise pass exactly one of path, yaml_text, mapping, or prepared. Update sends the full extraction workflow settings, not a name-only patch, so pass the YAML or definition again when custom settings should remain in effect.

1 client.update_extraction_workflow(
2     "workflow-id",
3     path="statement.yaml",
4     name="statement extraction",
5 )

Returns the normal workflow response.