Any document in.
Structured data out.

Pull typed, structured data from any document. Define the fields you need, upload your files, and get back usable JSON. For most of our customers, this is the core of their entire document pipeline.

The extraction step

Where documents become data

Extract is a core step in the document intelligence pipeline. After a document is parsed and potentially classified, Extract applies your schema to identify and pull structured fields from the content. It is what turns raw text into structured JSON you can actually use.

You define the fields, Extract fills them in

Nested objects, arrays, and enums all supported

Every value traced back to its source location

Confidence scores for every single data point

Output matches your schema exactly, every time

Missing fields return null, not errors

Works on any document that has been parsed

How it works

You give Extract a document and a schema. It reads the full content, understands the layout and structure, finds every field you asked for, and returns clean JSON. Not pattern matching. Not regex. It reads the document the way a person would, then gives you back exactly what you defined in your schema.

PDF IMAGE DOC SCAN XLS See all supported types

Try a sample:

Invoice

Patient Form

Grant Deed

Upload your own

Drop a document here

or click a sample above to see extraction in action

Universal extraction

Any document. Any structure. Any language.

Extract handles printed text, cursive handwriting, tables, checkboxes, and nested layouts across 150+ languages and scripts. The same engine processes a German tax form, a Japanese sustainability report, and a photo of a handwritten medical note. No templates, no retraining, no per-language configuration. Consistent structured output every time.

Try it yourself

You control the extraction

Depth, schema, and output format

You choose how deep Extract goes. Standard runs a single pass with fast models. High runs stronger models and double-checks the output. Extended reads page by page with a sliding context window, building up results as it goes.

Schema-defined extraction

Define field names, types, nesting, and values. Extract returns JSON that conforms to your schema exactly. Build schemas manually or with the AI studio.

Schema-free extraction

Upload a document with no schema. Extract infers the structure, returns the data, and generates a reusable schema from the result for you to use on future documents.

What you can extract

Extraction capabilities at a glance

Extraction

2500+ pages per document

Enum fields with constrained values and Other fallback

Array deduplication with index/page overrides

Default values applied from schema

Output keys sorted to match schema field order

Excel, CSV, and JSON export

Extraction docs

Quality

Field-level confidence scores (high/medium/low)

Bounding box evidence for every extracted value

V3 agentic multi-pass extraction

3 effort levels (Standard, High, Extended)

Cross-page context preservation

Extraction effort docs

Processing

Batch uploads (100 docs per request)

Async processing with job polling

Smart result caching

Smart document splitting (Auto/Never/All)

3 display modes (Spatial, Sections, Image)

Post-processing with defaults and null-fill

Quick start

Integration

REST API (POST /v2/standardize/batch)

Webhook notifications on completion

JSON, CSV, and Excel export

Presigned URLs for raw originals

Workflow automation triggers

Integration docs

Every script, every alphabet

Extract handles 150+ languages across every major script: Latin, Cyrillic, CJK, Arabic, Hebrew, Devanagari, Thai, and more. No per-language setup required.

Latin

64 languages

English

Spanish

Portuguese

French

Indonesian

German

Swahili

Vietnamese

Turkish

Italian

Malay

Hausa

Polish

Dutch

Romanian

Ẹ

Yoruba

Ị

Igbo

Zulu

Xhosa

Somali

Show all 64 languages (+44 more)

CJK

8 languages

简

Chinese (Simplified)

日

Japanese

한

Korean

繁

Chinese (Traditional)

粵

Cantonese

吳

Wu Chinese

閩

Min Chinese

客

Hakka

Arabic

14 languages

Arabic

Urdu

Persian

Pashto

Kurdish (Sorani)

Sindhi

Uyghur

Balochi

Kashmiri

Dari

Hazaragi

Punjabi (Shahmukhi)

Saraiki

Brahui

Devanagari

16 languages

ह

Hindi

म

Marathi

ने

Nepali

सं

Sanskrit

कों

Konkani

ब

Bodo

ड

Dogri

मै

Maithili

भो

Bhojpuri

अ

Awadhi

छ

Chhattisgarhi

म

Magahi

रा

Rajasthani

ह

Haryanvi

ग

Garhwali

कु

Kumaoni

South Asian

14 languages

ব

Bengali

ਪ

Punjabi (Gurmukhi)

తె

Telugu

த

Tamil

ગ

Gujarati

ಕ

Kannada

മ

Malayalam

ଓ

Odia

অ

Assamese

ස

Sinhala

ꠍ

Sylheti

ꯃ

Meitei

ᱥ

Santali

Dhivehi

Cyrillic

23 languages

Russian

Ukrainian

Uzbek (Cyrillic)

Kazakh

Serbian (Cyrillic)

Tajik

Bulgarian

Belarusian

Kyrgyz

Tatar

Mongolian (Cyrillic)

Macedonian

Bashkir

Chechen

Chuvash

Ossetian

Mari

Udmurt

Komi

Sakha

Show all 23 languages (+3 more)

Southeast Asian

14 languages

ไ

Thai

မ

Myanmar

ខ

Khmer

ລ

Lao

བོད

Tibetan

རྫོང

Dzongkha

ၵ

Shan

က

Karen

မ

Mon

ᦟ

Tai Lue

ꦗ

Javanese

ᬩ

Balinese

ᮞ

Sundanese

ᨅ

Buginese

Greek

2 languages

Greek

ῶ

Ancient Greek

Hebrew

3 languages

Hebrew

Yiddish

Ladino

Other

14 languages

አ

Amharic

ት

Tigrinya

Armenian

ქ

Georgian

Ꮳ

Cherokee

ᐃ

Inuktitut

ᓀ

Cree

ᐅ

Ojibwe

ᠮ

Mongolian (Traditional)

Syriac

Thaana

N'Ko

ꕙ

Vai

ⵜ

Tifinagh

“We upload a 40-page batch of mixed checks and slips, and DocuPipe turns it into 22 structured donor records ready for federal upload. What used to take hours of data entry now takes minutes.”

Jennifer Bauer

Read the full story

The extraction engine

Verifiable output at production scale

Every extraction utilizes a schema to define exactly what data to pull. Every data point that comes back is backed by a confidence score and a bounding box pointing to where it was found in the source document.

PLACEHOLDER: Screenshot of the schema chat studio or schema editor.

Before extraction

Schemas: define what to extract

Build a schema through an AI-powered chat studio or define one manually via the API. The schema controls field names, types, nesting, enum values, and extraction guidelines. A good schema is the difference between useful output and noise.

AI chat studio for schema creation

Iterate with live extraction previews

Version and compare schemas over time

PLACEHOLDER: Screenshot of review UI with bounding boxes overlaid on document.

After extraction

Review: verify what was extracted

Every extracted value gets a confidence score and a bounding box showing where in the source document it was found. Click a field, see the evidence. Flag incorrect values, share secure review links with stakeholders who do not have DocuPipe accounts.

Per-field confidence scores (high, medium, low)

Bounding box evidence on every value

Secure shareable review links

Turn your first document into JSON today

Free tier included. No credit card required. First extraction in under 5 minutes.

Get started free Talk to sales

Transparent pricing

Simple pricing with no hidden fees

Pricing details

Start your integration

Get up and running with our API in minutes

API Docs

Any document in.Structured data out.

Where documents become data

How it works

Any document. Any structure. Any language.

Depth, schema, and output format

Extraction capabilities at a glance

Every script, every alphabet

Verifiable output at production scale

Turn your first document into JSON today

Any document in.
Structured data out.