Document Types and Ingest Capacities

Supported File Types

There are four ways to upload content GroundX via API:

  • Ingest Local: uploads individual files from your local filesystem.
  • Ingest Remote: uploads individual files hosted on a remote URL.
  • Ingest Directories: Implemented in the PythonSDK, this automatically crawls a directory structure and batch uploads files. This is particularly convenient when uploading large amounts of files as it automatically manages batch uploading.
  • Crawl Website: Crawls a website recursively and uploads the html to GroundX.

Of the endpoints which concern files, the currently supported file types are:

pdf, docx, pptx, xlsx, csv, tsv, json, txt, hwp

And the currently supported image types are:

bmp, gif (not animated), heif (or heic), ico, jpg (or jpeg), png, svg, tiff (or tif), webp

Maximum File Size

If you are a free trial user, restrictions on document ingestion per-file include:

If you are a subscription user, restrictions on document ingestion per-file include:

Why the difference?
The python SDK will automatically upload local files using the EyeLevel.ai file upload endpoint and temporary pre-signed URLs. The API and TypeScript SDK do not include this feature.

If you have files that are larger than the allowable file size limit, we recommend compressing assets (like images) within the file or dividing the file into several smaller files.

Ingest Directories has the same per-file ingest constraints as Ingest Local.

Maximum Concurrent Files

There is a restriction of a maximum 50 files being concurrently ingested at a time.

The Ingest Directories function in the Python SDK automatically handles batching uploads based on the batch_size parameter. Otherwise, batching can be done by using get_processing_status_by_id to wait for an uploaded batch to complete, as described in the Quickstart.

Document Type Specific Restrictions

PDF | PPTX | DOCX | HWP

Maximum Pages

For document types with pages including: PDF, PPTX, DOCX, and HWP, there is a restriction of a maximum 750 pages.

CSV | TSV | XLSX

Maximum Words

For document types without pages including: CSV, TSV, and XLSX, there is a restriction of a maximum of 250,000 words for trial users and 500,000 words for subscription users.

Maximum Rows

For document types with rows including: CSV, TSV, and XLSX, there is a restriction of a maximum 1,500 lines.

TXT

Maximum Words

For raw text files, there is a restriction of a maximum of 250,000 words for trial users and 500,000 words for subscription users.

JSON

Maximum File Size

For JSON files, there is a maximum 5 MB file size restriction. This restriction is specific JSON files and supercedes the file size restrictions described above.

Maximum Levels

For JSON files, there is a maximum 20 levels of nesting for any JSON object. This refers to dictionaries or arrays with nested dictionaries or arrays.

Visual X-Ray Support

Processed documents of every supported document type include an x-ray analysis that is accessible via API and can be downloaded in the dashboard.

Some documents types do not go through the visual layout analysis pipeline and, therefore, are not viewable in the visual x-ray viewer in the dashboard. The document types that ARE NOT viewable in the x-ray viewer in the dashboard are:

  • csv
  • json
  • tsv
  • txt
  • xlsx

Website Crawling

The GroundX ingestion pipeline can also crawl and ingest the content from websites using the Crawl Website endpoint.

For trial users, a maximum of 500 pages is supported, with a maximum crawl depth of 5 pages. For subscribers, a maximum of 2,000 pages at a crawl depth of 8 is supported.

The crawler scrapes the page content from the source HTML and can sometimes be confused by the structure of the page.