Supported Document Types and Ingestion Restrictions

Supported File Types

The currently supported document types are:

  • pdf
  • docx
  • pptx
  • xlsx
  • csv
  • tsv
  • json
  • txt
  • hwp

The currently supported image types are:

  • bmp
  • gif (not animated)
  • heif (or heic)
  • ico
  • jpg (or jpeg)
  • png
  • svg
  • tiff (or tif)
  • webp

Restrictions

Maximum File Size

If you are a free trial user, restrictions on document ingestion include:

If you are a subscription user, restrictions on document ingestion include:

Why the difference?
The python SDK will automatically upload local files using the EyeLevel.ai file upload endpoint and temporary pre-signed URLs. The API and TypeScript SDK do not include this feature.

Maximum Concurrent Files

There is a restriction of a maximum 50 files being concurrently ingested at a time.

Document Type Restrictions

PDF | PPTX | DOCX | HWP

Maximum Pages

For document types with pages including: PDF, PPTX, DOCX, and HWP, there is a restriction of a maximum 750 pages.

CSV | TSV | XLSX

Maximum Words

For document types without pages including: CSV, TSV, and XLSX, there is a restriction of a maximum 375,000 words.

Maximum Rows

For document types with rows including: CSV, TSV, and XLSX, there is a restriction of a maximum 1,500 lines.

TXT

Maximum Words

For raw text files, there is a restriction of a maximum 375,000 words.

JSON

Maximum File Size

For JSON files, there is a maximum 5 MB file size restriction. This restriction is specific JSON files and supercedes the file size restrictions described above.

Maximum Levels

For JSON files, there is a maximum 20 levels of nesting for any JSON object. This refers to dictionaries or arrays with nested dictionaries or arrays.

Visual X-Ray Support

Processed documents of every supported document type include an x-ray analysis that is accessible via API and can be downloaded in the dashboard.

Some documents types do not go through the visual layout analysis pipeline and, therefore, are not viewable in the visual x-ray viewer in the dashboard. The document types that ARE NOT viewable in the x-ray viewer in the dashboard are:

  • csv
  • json
  • tsv
  • txt
  • xlsx

Additional data sources

The GroundX ingestion pipeline can also crawl and ingest the content from websites using the Crawl Website endpoint.

The crawler scrapes the page content from the source HTML and can sometimes be confused by the structure of the page.