Document Types and Ingest Capacities
Supported File Types
There are four ways to upload content GroundX via API:
- Ingest Local: uploads individual files from your local filesystem.
- Ingest Remote: uploads individual files hosted on a remote URL.
- Ingest Directories: Implemented in the PythonSDK, this automatically crawls a directory structure and batch uploads files. This is particularly convenient when uploading large amounts of files as it automatically manages batch uploading.
- Crawl Website: Crawls a website recursively and uploads the html to GroundX.
Of the endpoints which concern files, the currently supported file types are:
And the currently supported image types are:
Maximum File Size
If you are a free trial user, restrictions on document ingestion per-file include:
- Using
python
SDK:- 25 MB per file (for ingest, ingest_directory, or ingest_remote)
- 8 MB per file (for ingest_local)
- Using
TypeScript
SDK:- 25 MB per file (for hosted files using either ingest or ingest_remote)
- 8 MB per file (for local files using either ingest or ingest_local)
- Using APIs:
- 25 MB per file (for hosted files using ingest_remote)
- 8 MB per file (for local files using ingest_local)
If you are a subscription user, restrictions on document ingestion per-file include:
- Using
python
SDK:- 50 MB per file (for ingest, ingest_directory, or ingest_remote)
- 8 MB per file (for ingest_local)
- Using
TypeScript
SDK:- 50 MB per file (for hosted files using either ingest or ingest_remote)
- 8 MB per file (for local files using either ingest or ingest_local)
- Using APIs:
- 50 MB per file (for hosted files using ingest_remote)
- 8 MB per file (for local files using ingest_local)
Why the difference?
The python SDK will automatically upload local files using the EyeLevel.ai file upload endpoint and temporary pre-signed URLs. The API and TypeScript SDK do not include this feature.If you have files that are larger than the allowable file size limit, we recommend compressing assets (like images) within the file or dividing the file into several smaller files.
Ingest Directories has the same per-file ingest constraints as Ingest Local.
Maximum Concurrent Files
There is a restriction of a maximum 50 files being concurrently ingested at a time.
The Ingest Directories function in the Python SDK automatically handles batching uploads based on the batch_size
parameter. Otherwise, batching can be done by using get_processing_status_by_id to wait for an uploaded batch to complete, as described in the Quickstart.
Document Type Specific Restrictions
PDF | PPTX | DOCX | HWP
Maximum Pages
For document types with pages including: PDF, PPTX, DOCX, and HWP, there is a restriction of a maximum 750 pages.
CSV | TSV | XLSX
Maximum Words
For document types without pages including: CSV, TSV, and XLSX, there is a restriction of a maximum of 250,000 words for trial users and 500,000 words for subscription users.
Maximum Rows
For document types with rows including: CSV, TSV, and XLSX, there is a restriction of a maximum 1,500 lines.
TXT
Maximum Words
For raw text files, there is a restriction of a maximum of 250,000 words for trial users and 500,000 words for subscription users.
JSON
Maximum File Size
For JSON files, there is a maximum 5 MB file size restriction. This restriction is specific JSON files and supercedes the file size restrictions described above.
Maximum Levels
For JSON files, there is a maximum 20 levels of nesting for any JSON object. This refers to dictionaries or arrays with nested dictionaries or arrays.
Visual X-Ray Support
Processed documents of every supported document type include an x-ray analysis that is accessible via API and can be downloaded in the dashboard.
Some documents types do not go through the visual layout analysis pipeline and, therefore, are not viewable in the visual x-ray viewer in the dashboard. The document types that ARE NOT viewable in the x-ray viewer in the dashboard are:
- csv
- json
- tsv
- txt
- xlsx
Website Crawling
The GroundX ingestion pipeline can also crawl and ingest the content from websites using the Crawl Website endpoint.
For trial users, a maximum of 500 pages is supported, with a maximum crawl depth of 5 pages. For subscribers, a maximum of 2,000 pages at a crawl depth of 8 is supported.