Why AI Fails to Understand Documents

DOCUPIPE

Pricing

Resources

Why AI Fails to Understand Documents

Uri Merhav

Dec 19th, 2025 · 12 min read

Table of Contents

And How to Fix It

Watch a video version of this post: Here

When I tell people my startup applies AI to documents, they tend to fall asleep before I finish the sentence. But pause for a moment: what percentage of the world's high-value knowledge work comes down to reading a document carefully, understanding its context, and extracting that information for some downstream process?

I'd say somewhere between 20% and 80%. It could be as simple as reading a lab report and extracting hemoglobin counts. It could be as complex as finding inconsistencies in a 2,000-page legal document. The humble task of reading and extracting from documents is, in fact, incredibly high-value work.

So naturally, everyone wants AI to do it. You upload a PDF to ChatGPT, Claude, or Gemini. You ask questions. It answers confidently. It feels like magic.

Then you ask something verifiable — like counting how many times a word appears — and it tells you "41 times." You ask again: "93 times." Again: "94 times." One time out of eleven, it admits it can't actually do that.

Welcome to the reality of AI and documents. The technology is transformative when it works. But there are cliffs on both sides, and falling off gets you spectacularly wrong results.

When we polled webinar attendees on whether they'd caught AI making stuff up or missing crucial info from documents, 86% said yes. If that sounds familiar, you're in the right place.

The Jagged Frontier of AI

Here's a mental model that explains a lot about AI's current state: the jagged frontier.

Imagine a circle representing all human capabilities. You might expect AI to be a smaller circle inside it, growing uniformly larger over time. That's not what's happening. Instead, AI is this jagged, spiky shape — occasionally extending far beyond human capabilities, occasionally falling embarrassingly short.

AI capabilities: not a smooth circle, more like a sea urchin.

Take proofreading: AI is better than any human at catching grammatical errors across all languages simultaneously. No human proofreader covers every language — AI does. Coding: I use it every day as a developer, and it's a huge productivity multiplier. But the moment you go somewhere niche, or use a package the AI isn't familiar with, it starts to hallucinate and become unreliable.

So if proofreading is a wide hill that you can't possibly fall off, and coding is slightly more narrow, I'd say that Document processing sits at this pretty narrow peak. At its best, it's transformatively better than humans. An AI can go through a 2,000-page document and find needles in haystacks that no human would have the patience to find. But, to your left and right from this peek, you see a sheer cliff - fall of it, and get insane mistakes that no human would ever make.

The goal isn't to wait for AI to smooth out (that might take decades). The goal is to figure out how to climb onto the peaks and stay there — to harness AI at stable points where you know it's going to work.

The Good, The Bad, and The Ugly

The good news: AI is brilliant. The bad news: everything else.

Let's be specific about what LLMs can do with documents.

The Good: LLMs are brilliant. They're smart and relentless. They can find all the contradictions in a 2,000-page document — something no human would ever buckle down and do. They won't get lazy the way you and I will.

The Bad: Whether you get a good result is extremely dependent on problem setup. How you prompt the LLM, how you let it ingest the document — these choices matter enormously. Occasionally brilliant isn't enough. Businesses don't need occasional brilliance; they need strong consistency.

The Ugly: If inputs aren't phrased correctly, you get hallucinations. Worse, you get what I call black box operations: results that are wrong, and you can't even figure out that they're wrong until disaster strikes downstream, because you have no way of even knowing where a number of decision even came from.

Real-World Stakes

Everything LLMs can do with documents — when they work.

Before diving into the technical details, let's ground this in reality. A huge percentage of high-value knowledge work comes down to reading documents carefully and extracting useful information:

Bank statements: A lending company ingesting 10,000 bank statements daily needs every transaction's date, amount, description, and category extracted correctly.
Lab reports: A hemodialysis clinic needs to extract hemoglobin levels from patient bloodwork before treatment. Getting this wrong has real consequences.
Invoices and utility bills: Understanding what was paid, when, and for what category.
Legal documents: A medical lawsuit might involve 2,000 pages of patient history where you need to prove that back pain existed before the accident, not after.

If you work at any business, you probably have an ongoing operational burden where documents pile up. Right now they're either being ignored (missed opportunties) or processed by humans who are getting sick and tired of it (expensive and burns out humans). We could be freeing up those humans for more valuable work.

The ChatGPT Experiment: Sherlock Holmes Edition

Let's run an experiment. I uploaded a PDF of "The Adventures of Sherlock Holmes" to ChatGPT — a squeaky clean, text-based PDF with no OCR issues. Asked it what the book was about. It gave me a great summary.

Impressive! ChatGPT nails the plot summary. Surely it read the whole book, right?

Most people stop here. They think: "I uploaded a PDF, it told me stuff about the PDF, therefore it read and understood the whole thing."

This is wrong.

I asked a verifiable question: "How many times does the word 'Sherlock' appear?". I ran this query eleven times. Here are some of the answers I got:

First answer: 41 times
Second answer: 93 times
Third answer: 94 times
One answer out of eleven: "I can't give you an exact numeric count"

Same question, eleven tries. Pick a number, any number.

That last answer is the only honest one. And it reveals something important: GPT told me it has access to something called a "file search tool" — it's not actually reading the full document. It's looking at the document through a straw, fetching individual chunks based on queries.

Just because there's a UI that lets you upload a document and get authoritative-sounding answers doesn't mean the AI actually read your document.

The Table That Broke ChatGPT

Maybe you're thinking: "Okay, but Sherlock Holmes is a 100-page book. Everyone knows LLMs get worse with more context. A simple one-page table would work fine."

Let's test that.

A simple lab report table. How hard could it be?

Here's a lab report table — a single image. Not a PDF, not a 100-pager. One clean table. I asked GPT to transcribe it. It confidently declared: "Complete lossless transcription of the table."

It starts at 6.46. It looks right. You go down the list of numbers, Looks right. Looks right... wait. Where's June 22nd? The entire column is empty.

Uh, what?

Pull up the original. June 22nd has plenty of data. Circle it. Now it's looking quite bad.

What happened? GPT started transcribing beautifully, then began mis-assigning results to wrong columns, even though the visual alignment couldn't be easier. No human on planet Earth would fail to understand that these numbers belong to June 22nd. But GPT produced a horrendous result.

This is the kind of failure that destroys trust. It looks right at first glance. You only catch it if you check carefully.

Why This Happens: LLMs Are Word Machines

Here's the fundamental issue: LLMs were trained on 10-20 trillion tokens (essentially words). They're incredibly good at understanding and generating text. That's their home turf.

When you upload a PDF, the services you're using make one of a few guesses about how to handle it:

Best case: They run OCR (Optical Character Recognition) to extract all the words, then figure out some hacky way to prompt GPT with that text.

Worst case: They use AI's multimodal capabilities, meaning the AI sees pixels instead of words. The model has trained on far less image data, and there's a huge distance between "1 million pixels" and "text answer." This is a hard machine learning problem, which produces leaky, unreliable results.

LLMs speak tokens, not pixels. That's the root of the problem.

The symptoms are predictable:

Give it a crisp, nice one-pager and it works amazingly
Slightly corrupt the font or rotate the scan one imperceptible degree, and it starts hallucinating
When it makes mistakes, they don't make human sense — it makes stuff up instead of saying "I can't see the words"

Here's the kicker: recently OpenAI's system prompt got leaked. There's lots of interesting stuff in there, but one instruction stood out. They explicitly tell their own AI: "You can't do OCR. Don't try. It's too hard. You're gonna fail. It takes too much time. It's too expensive."

OpenAI's own instructions to GPT: "Don't even try to read documents."

That's OpenAI telling GPT not to try reading documents properly.

The Vegetative Electron Microscopy Problem

Here's a delightful example of what goes wrong. "Vegetative electron microscopy" is a made-up term. It's nonsense. Yet it's been cited in 25 different academic articles.

How? AI misreading across a double-column format.

A nonsense term that made it into 25 academic papers. Thanks, AI.

When you have two columns, no human would read across them. But if you're training an LLM by OCR-ing documents and just reading left to right, the word "vegetative" ends up right next to "electron microscopy" — and the LLM learns this nonsense phrase.

OCR is the right idea — extract words instead of processing pixels. But once you have the raw OCR, you still need to figure out the reading order. Instead of left-to-right across the whole page, you need to read first column, then second column. This sounds obvious. Apparently the frontier labs forgot.

What Actually Works

Enough doom and gloom. Here's how to get good results.

1. Proper OCR with Reading Order

The pixels are the only thing you can trust. They tell the story that human eyes can read. You want to:

Run OCR to identify every word, table, checkmark, and handwritten annotation
Take extreme care to get reading order right
Handle the structure, not just the raw text

For simple documents, the reading order might be trivial — maybe you always have a two-column structure, so you can handle that with a simple heuristic. You could vibe code something in an afternoon. But for invoices, forms, or complex formatting? It gets much trickier.

2. Markdown: The LLM's Native Language

Here's a practical tip: communicate with your LLM in Markdown.

The magic formula: Document → OCR → Markdown

Markdown preserves crucial breakpoints and structure. Tables render in a way both humans and LLMs can read. You can copy-paste a Markdown representation into an LLM and ask it to extract values, find outliers, whatever — and suddenly it works.

Remember that lab report that failed spectacularly? Same document, converted to Markdown, fed to the same LLM: suddenly June 22nd is full of correct data.

Same table, proper representation. Look — June 22nd exists!

3. Schema-Based Understanding

When you prompt "extract all the stuff from this invoice," you're asking for trouble. The problem is unbounded, you're more likely to get mistakes, and the output structure varies each time.

Instead, define a schema: "We're extracting an invoice. There's a field called customer_name, a field called invoice_date, a field called service_render_date (which is different from invoice_date)..."

Tell the LLM exactly what you want. It's not a mind reader.

When you establish exactly what you're looking for, the LLM doesn't have to spend attention figuring out what to output. It can execute something more mechanical. The more mindless you make the task, the better.

Breaking Down the Problem

Here's a counterintuitive truth: the more tokens you input, the dumber the LLM gets.

Even though Gemini claims it can handle 1 million tokens, on a practical level, it gets dumber the more you feed it. A recent paper called "Context Length Alone Hurts LLM Performance" demonstrated this beautifully. They gave an LLM trivial math questions — it got them right. Then they asked the exact same questions with 25,000 spaces added. The LLM obviously knew the spaces were irrelevant. It still got dumber and made mistakes it would never otherwise make.

Add 25,000 spaces to a math problem. Watch the AI get dumber.

The solution: break a single document comprehension problem into many independent tasks.

For a bank statement, process each page independently. Tell the LLM: "Here's all the text on this page. Extract all line items: date, amount, description." You get dramatically better performance — potentially 100% accuracy.

But real documents are messier. An insurance claim about an accident with five participants doesn't have clean page breaks. To extract information about driver #2, you might need to go back and read something about driver #1 — like the car they were both in. The report doesn't say it twice.

If you try to input a 100-page document and extract 10,000 line items in one shot, you're gonna get wrecked.

Agentic Flows for Complex Documents

For complex documents, you need what we call "agentic flows" — the LLM traverses the document with the power to build up understanding, never looking at all 1,000 pages at once.

How to read a 1,000-page document: one careful bite at a time.

At DocuPipe, we use a cursor-based method:

LLM looks at the document at a high level
Decides: "I can consume pages 1 through page 2, line 15 to extract the first accident participant"
Asks another LLM call to extract just that participant
Adds results to an expanding output
Decides how much to read next, possibly revising earlier decisions
Consumes the next chunk, looking at the previous one to catch duplicates or dropped items

Our record so far? Over 1,600 different LLM calls to understand a single document. That's fine. The goal isn't minimizing compute — it's maximizing accuracy.

Yes, really. Sometimes one document needs a thousand LLM calls.

The Human in the Loop

Here's an uncomfortable truth: sometimes you need humans.

For certain document types, you can hit 99.9% accuracy. Sometimes you can't. Sometimes compliance or legal reasons require human oversight regardless of accuracy. For those cases, build a process that puts humans in the loop efficiently.

Click a value, see where it came from. Human review at 10x speed.

Good engineering gets you from 80% to 99%. Sometimes only humans get you that last nine — from 99% to 99.9%. Whether you need that depends on the stakes. Starting hemodialysis based on hemoglobin counts? You need all the nines. Sending a patient to a hospital where the worst case is rejection for missing info? Different calculus.

The key is making human review fast. When reviewers can click on an extracted value and immediately see the source highlighted in the original document, they can verify and fix at 10x speed.

The Virtuous Cycle

But there's another reason to keep humans in the loop: improvement.

Every time a human fixes a mistake, you've collected a labeled example — a document, an extraction, and the correct answer. With 50 documents (not 5,000), you can do what we call auto-tuning.

You don't even need to fine-tune the model and change its weights. Training can be as simple as prompting your model to reflect: "Here's the document, here's what you got wrong, here's what we wanted. What was unclear? Change your own instructions so you won't make this mistake again."

Drake gets it.

This transforms document processing from an operational expense into an investment. Normally, document processing is a pit of fire: throw money at each document, repeat forever. But with this approach, every solved document makes the next one easier. Your reliability moves from 90% to 99% to 99.9% to full automation.

You're not burning money. You're building better models.

Tying It All Together

All the pieces, working together.

Real-world document processing needs all the pieces:

Classification: Identifying document types automatically
Splitting: Breaking large documents into components
Extraction: Getting structured data out
Review: Human verification when needed
Representation: Converting documents to LLM-friendly formats

If building all this sounds like a lot of work (because it is), that's why we built DocuPipe. We've done the hard work of OCR, reading order, schema management, human review interfaces, and auto-tuning — so you can focus on your actual business problem.

But whether you use DocuPipe or build your own solution, the principles remain:

Never trust raw PDF uploads to LLMs
OCR everything, get reading order right
Use Markdown to communicate with LLMs
Define clear schemas for what you want
Break big problems into small, independent LLM calls
Keep humans in the loop when accuracy matters
Use corrections to improve over time

The promise of AI for documents is real. The path to getting there just requires understanding what's actually happening under the hood.

This post is adapted from on a DocuPipe webinar. You can watch the full recording here: