DocuPipe Logo

DOCUPIPE

    Solutions

    Resources

    Pricing

Schema-Driven Governance: Forcing AI to Speak Your ERP's Language

Nitai Dean
Nitai Dean

Updated Mar 30th, 2026 · 12 min read

Table of Contents

  • The Danger of Unstructured LLM Outputs in Legacy Databases
  • The Shift from ETL to ELT: Raw Ingestion First
  • Schema-Driven Extraction in DocuPipe
  • Implementation Considerations
  • Conclusion
Schema-Driven Governance: Forcing AI to Speak Your ERP's Language
The invoice extraction worked perfectly in testing. In production, it crashes your SAP import on day one. The AI returned "invoice_date" but your database column is "InvoiceDate". Yesterday it was "2024-03-15", today it's "March 15, 2024". The amounts sometimes include currency symbols, sometimes don't. Your integration team spends more time fixing AI outputs than the old manual process took.
Large language models are probabilistic. Same input, different outputs. Field names vary. Date formats shift. Nested structures reorganize. For chatbots, this flexibility is a feature. For enterprise systems that demand exact schemas, it's catastrophic.
SAP expects dates in one format. Oracle expects another. Field names must match database columns exactly. Data types must satisfy validation rules. Any deviation causes import failures, data corruption, or silent errors that surface months later during audits.
This article examines why unstructured LLM outputs fail enterprise integration, how Schema-on-Read architecture addresses the problem, and how schema-driven extraction produces consistent outputs from probabilistic models.
For the broader context on enterprise document AI infrastructure, see the Enterprise Document AI Infrastructure hub article. For handling extraction failures and routing to human review, see The Human-in-the-Loop (HITL) Escalation Protocol.

The Danger of Unstructured LLM Outputs in Legacy Databases

Legacy enterprise systems were designed for predictable, structured data. They have rigid schemas, strict validation rules, and zero tolerance for variation. LLMs were designed for flexibility and natural language understanding. These design philosophies collide when LLM outputs must feed legacy systems.

Schema Mismatch Failures

Consider a simple invoice extraction. The LLM extracts these fields:
{
  "Invoice Date": "March 15, 2026",
  "Total": "$1,234.56",
  "Vendor": "Acme Corp."
}
The ERP system expects:
{
  "invoice_date": "2026-03-15",
  "total_amount": 1234.56,
  "vendor_id": "V-00847"
}
Every field has problems:
  • Field names use different conventions (title case vs. snake_case)
  • Date format is human-readable vs. ISO 8601
  • Amount includes currency symbol and is a string vs. a number
  • Vendor is a name vs. an internal identifier
Without transformation, this data cannot enter the ERP. With manual transformation, the process is error-prone and does not scale.

Type Coercion Errors

LLMs treat all outputs as text. They do not distinguish between:
  • The string "1234" and the integer 1234
  • The string "true" and the boolean true
  • The string "null" and the absence of a value
When downstream systems interpret these values, errors cascade:
  • String concatenation instead of numeric addition
  • String comparison instead of boolean logic
  • Null pointer exceptions from unexpected string values
These errors are insidious. They may not cause immediate failures. Instead, they corrupt data silently, creating problems discovered weeks or months later when reports do not reconcile or audits reveal inconsistencies.

Structural Variation

LLMs reorganize output structure based on input phrasing and model state:
First extraction:
{
  "line_items": [
    {"product": "Widget A", "qty": 10, "price": 5.00},
    {"product": "Widget B", "qty": 5, "price": 10.00}
  ]
}
Second extraction of the same document:
{
  "products": ["Widget A", "Widget B"],
  "quantities": [10, 5],
  "prices": [5.00, 10.00]
}
Both representations contain the same information, but their structures are incompatible. Code expecting the first structure fails on the second. Integration pipelines break unpredictably.

The Shift from ETL to ELT: Raw Ingestion First

Traditional document processing follows an ETL pattern: Extract structure during ingestion, Transform into normalized formats, Load into target systems. This approach assumes you know the required structure before processing begins.
ETL fails for document AI because requirements change. The schema you need today is not the schema auditors will demand next year. Fields you ignored during initial extraction become critical. Document types you never anticipated appear in your workflow.

ELT Architecture for Documents

ELT inverts the processing order: Load raw data first, apply structure at query time.
For document AI, this means:
  1. Ingest the original document and preserve it completely
  2. Parse the document to extract text and layout information
  3. Store both the original and parsed representation
  4. Apply schemas when extracting structured data
  5. Re-extract with new schemas when requirements change
The critical difference is step 3. By storing parsed representations (not just extraction results), you can apply new schemas to historical documents without re-ingesting originals.

Schema-on-Read Benefits

Schema-on-Read means structure is applied at extraction time, not ingestion time:
Requirement flexibility:
  • Add new fields to schemas without reprocessing
  • Change field types as integration requirements evolve
  • Support multiple schemas for the same document type
Historical reprocessing:
  • Apply new schemas to documents processed years ago
  • Extract fields that were not in the original schema
  • Respond to audit requests for information not previously extracted
Schema versioning:
  • Maintain multiple schema versions simultaneously
  • Route documents to appropriate schema versions
  • Compare extraction results across schema versions
Experimentation:
  • Test new schemas against existing documents
  • Measure extraction quality before production deployment
  • A/B test schema variations
Organizations that adopted ETL patterns for document AI are now stuck. They extracted specific fields years ago and discarded the rest. When auditors ask for additional information, they must reprocess original documents (if they still have them) or admit the data is lost.

Schema-Driven Extraction in DocuPipe

DocuPipe schema builder interfaceDocuPipe schema builder interface
Schema-on-Read requires a mechanism to guide extraction consistently. DocuPipe uses schema definitions to tell the LLM exactly what fields to extract and how to format them.

JSON Schema Definitions

Every extraction operation uses a schema that specifies fields, types, and descriptions. The schema guides the LLM during extraction:
{
  "type": "object",
  "properties": {
    "invoice_date": {
      "type": "string",
      "description": "Invoice issue date in ISO 8601 format (YYYY-MM-DD)"
    },
    "due_date": {
      "type": "string",
      "description": "Payment due date in ISO 8601 format (YYYY-MM-DD)"
    },
    "total_amount": {
      "type": "number",
      "description": "Total invoice amount as a number without currency symbols"
    },
    "currency": {
      "type": "string",
      "enum": ["USD", "EUR", "GBP", "CAD", "Other"],
      "description": "Three-letter ISO 4217 currency code"
    },
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": {"type": "string"},
          "quantity": {"type": "integer"},
          "unit_price": {"type": "number"},
          "total": {"type": "number"}
        }
      }
    }
  }
}
The schema serves as instructions to the model: extract these fields, use these formats, output this structure. Clear field descriptions improve extraction accuracy because the model understands what each field should contain.

Enum Fields for Categorical Data

For fields with known possible values, enum types constrain outputs:
{
  "document_type": {
    "type": "string",
    "enum": ["Invoice", "Purchase Order", "Receipt", "Credit Memo", "Other"],
    "description": "Type of financial document"
  }
}
Best practice: include an "Other" value to catch documents that do not match expected categories. This prevents the model from forcing a bad match when none of the enum values fit.

Schema Versioning

Schemas evolve as requirements change. DocuPipe supports versioned schemas:
  • Create new schema versions (invoice_v1, invoice_v2, invoice_v3)
  • Compare extraction results across versions
  • Maintain multiple active versions for different use cases
  • Roll back to previous versions if new schemas produce issues
This versioning enables gradual schema evolution without breaking existing workflows.

What Schemas Control vs. What They Do Not

Schemas guide extraction but have important limitations to understand:
Schemas control:
  • Field names in the output
  • Expected data types (string, number, array, object)
  • Enum values for categorical fields
  • Output structure and nesting
  • Field descriptions that help the model understand intent
Schemas do not enforce:
  • Required vs. optional fields (all fields are treated as optional by the extraction model)
  • Value constraints like minimum/maximum
  • Format validation (the model attempts to follow format hints in descriptions)
  • Cross-field validation rules
For enterprise integrations requiring strict validation, implement a validation layer in your downstream pipeline. DocuPipe provides the structured extraction; your systems can apply business rules, type coercion, and constraint checking as needed.

Building Validation into Your Pipeline

After DocuPipe extracts structured data, your integration layer can apply validation:
In your code:
  • Validate required fields are present
  • Check data types match expectations
  • Apply business rules (amounts within ranges, dates in sequence)
  • Transform formats as needed for target systems
Tools that help:
  • Pydantic for Python validation
  • JSON Schema validators for any language
  • Database constraints as a final gate
  • ETL tools with built-in validation
This separation keeps DocuPipe focused on extraction while giving you full control over validation logic specific to your business rules.

Implementation Considerations

Deploying schema-driven extraction effectively requires attention to schema design, versioning, and downstream integration.

Schema Design Best Practices

Write schemas that help the model extract accurately:
Clear field descriptions:
  • Describe what the field should contain, not just its name
  • Include format hints ("date in YYYY-MM-DD format")
  • Specify units when relevant ("amount in dollars, no currency symbol")
Appropriate field types:
  • Use number for amounts you need to calculate with
  • Use string for identifiers even if they look numeric
  • Use array for repeating elements like line items
  • Use object for nested structures like addresses
Enum design:
  • Include common values plus "Other" for edge cases
  • Keep enum lists focused (5-10 values work better than 50)
  • Use descriptive values ("Invoice" not "INV")

Versioning Strategies

As requirements change, evolve schemas carefully:
Additive versioning:
  • Add new fields without removing existing ones
  • Old integrations continue working
  • Simplest approach for forward compatibility
Breaking changes:
  • When you must change field names or types
  • Create a new schema version (v2)
  • Migrate integrations before deprecating old version
Parallel versions:
  • Run multiple versions for different use cases
  • Different downstream systems may need different structures
  • Version identifier in output enables routing

Integration Patterns

Extracted data flows to enterprise systems through standard patterns:
Direct database insertion:
  • Map JSON fields to database columns
  • Apply type conversion in your insertion logic
  • Use database constraints as a validation layer
API integration:
  • Transform extracted data to match API contracts
  • Handle missing optional fields gracefully
  • Log extraction results for debugging
Message queue delivery:
  • Publish extractions to queues for async processing
  • Use dead letter queues for failed validations
  • Enable retry logic for transient downstream failures
File-based exchange:
  • Serialize extractions to CSV, XML, or EDI as needed
  • Apply format transformations for partner requirements
  • Validate before transmission

Conclusion

Enterprise document AI must produce consistent, structured outputs that legacy systems can consume. LLM variability is a challenge when data feeds ERPs, populates databases, or triggers automated workflows.
Schema-driven extraction addresses this by guiding the LLM with clear field definitions, types, and descriptions. The model extracts what you specify in the structure you specify. Combined with downstream validation in your integration layer, this produces reliable data pipelines.
The result is document AI that integrates with enterprise infrastructure rather than fighting it. Extracted data matches expected formats. Field structures align with database schemas. Your validation layer catches issues before they propagate.
For organizations building document AI pipelines, schema design and downstream validation work together. DocuPipe handles the extraction; you control the business rules.

LLMs are probabilistic and may produce different field names, date formats, or data structures for identical inputs. Enterprise systems like ERPs require exact schema matches and consistent data types. Schema-driven extraction guides the model to output specific fields in specific formats, reducing variability and making downstream integration reliable.

Schema-on-Read applies structure at extraction time rather than ingestion time. Documents are parsed and stored with their full content preserved, then schemas are applied when data is needed. This allows you to extract new fields from historical documents, support multiple schemas for the same document type, and respond to changing requirements without re-ingesting originals.

DocuPipe extracts data according to your schema but does not enforce validation rules like required fields or value constraints. All fields are treated as optional during extraction. For enterprise integrations requiring strict validation, implement a validation layer in your downstream pipeline using tools like Pydantic, JSON Schema validators, or database constraints.

Recommended Articles

Technical

HITL Protocol

Yadid Orlow

Yadid Orlow

Apr 22, 2026 · 12 min read

Technical

Secure RAG

Nitai Dean

Nitai Dean

Apr 19, 2026 · 12 min read

Technical

Table Extraction

Nitai Dean

Nitai Dean

Apr 17, 2026 · 13 min read

Related Documents

 

Related documents:

Related documents:

RIB

Credit Memo

CT-e

Non-Disclosure Agreement

Purchase Order

BAS

Check

Receipt

Invoice

NDA

i.SAF

Request for Information

Delivery Receipt

Rent Roll

Letter of Credit

+