Why You Can't Just Drag PDFs Into ChatGPT: A Primer on RAG and Why Most Search Systems Miss What Matters
“Just upload your documents to ChatGPT.”
It sounds so simple. And for casual use—skimming a contract, summarizing a report—it works well enough. But if you’ve ever tried to use AI to answer serious questions across a large set of documents, you’ve probably noticed something frustrating: it misses things. Important things. Things you know are in there.
This is how the technology is designed to work.
What Actually Happens When You “Upload” Documents
When you drag a PDF into ChatGPT (or Claude, or any similar tool), the AI doesn’t read your document the way you do. It can’t hold a 500-page report in its head and flip back and forth between sections. Instead, these systems use a technique called RAG—Retrieval Augmented Generation.
Here’s the basic idea: your documents get chopped into smaller pieces (called “chunks”), those chunks get indexed in a searchable database, and when you ask a question, the system retrieves the chunks that seem most relevant and feeds them to the AI along with your question.
The AI never sees your whole document. It sees a handful of snippets that a search algorithm decided were probably relevant.
This is why ChatGPT can read a 200-page PDF but still miss obvious information. The retrieval step—the search—is the bottleneck.1
The “Top-K” Problem
Most RAG systems use what’s called top-k retrieval. When you ask a question, the system searches your documents, ranks all the chunks by relevance, and returns the top 5, 10, or 20 results. Those results—and only those results—get passed to the AI.
This creates an obvious problem: what if the answer to your question exists in chunk #47?
Researchers have identified this as one of the fundamental failure points in RAG systems: “the answer to the question is in the document but did not rank highly enough to be returned to the user.”2
The information exists—the search algorithm just didn’t think it was important enough to include. And since you never see what got filtered out, you have no way of knowing what you missed.
How Search Actually Works (A Quick Primer)
To understand why this happens, it helps to know how document search works under the hood. Modern systems typically use two complementary approaches:
Keyword search (also called lexical or BM25 search) works like traditional search engines. It looks for documents containing the exact words you typed. If you search for “benzene contamination,” it finds chunks with those specific terms. This is precise but brittle—it won’t find a chunk that says “aromatic hydrocarbon pollution” even though it means the same thing.
Vector search (also called semantic or embedding search) is more sophisticated. It converts your query and all document chunks into mathematical representations called embeddings—essentially, coordinates in a high-dimensional space where similar concepts end up near each other. Instead of matching keywords, it finds chunks that are conceptually similar to your question.
The math is surprisingly elegant: each chunk becomes a point in a space with hundreds or thousands of dimensions. Your query becomes another point. The system finds chunks that are “close” to your query using a measurement called cosine similarity—basically, the angle between two arrows pointing from the origin to those points. Smaller angle = more similar.3
Hybrid search combines both approaches, typically using a technique called Reciprocal Rank Fusion (RRF). If a chunk ranks highly in both keyword and semantic search, it gets a boost. If it ranks highly in one but not the other, it still has a chance. This catches cases where either method alone would miss something important.
Why “Top-K” Isn’t Enough for Serious Work
Here’s the fundamental tension: retrieval is expensive. Searching through embeddings, ranking results, and processing text all take time and compute. The more chunks you retrieve, the more it costs and the slower it runs.
So most systems make a practical tradeoff: retrieve a small number of “best” results and hope they contain what you need.
For simple questions, this works fine. “What’s the contract expiration date?” probably lives in one specific chunk, and a good search will find it.
But many real questions aren’t like that:
- “What contaminants were detected across all sampling events?”
- “How many times does this report mention regulatory violations?”
- “Summarize all the key findings from these 50 documents.”
These questions require comprehensive coverage, not best-match precision. You need to find every relevant chunk, not just the top 10. And top-k retrieval, by design, can’t do that.
As one systematic review noted: “Dense retrievers frequently return many documents that are semantically related but do not contain the required answer.”4 The search finds things that seem relevant but aren’t, while missing things that actually matter.
The Consumer AI Upload Problem
This is why “just upload your PDFs” falls short for professional use cases.
Even paid consumer AI tools have hard constraints beyond basic RAG limitations. ChatGPT Plus caps files at 512MB and 2 million tokens.5 Claude Pro limits files to 30MB each.6 Both use text-only extraction for most PDFs—charts, diagrams, and tables rendered as images are often ignored. Complex formatting frequently breaks extraction entirely.
More fundamentally: you’re trusting a general-purpose retrieval system to understand what matters in your specific domain. It doesn’t know that a passing mention of “Phase I ESA” on page 47 might be critical context for understanding a contamination timeline. It just sees text and ranks by generic similarity.
What “Exhaustive Search” Actually Means
At Statvis, we built our document search around a different principle: when you need comprehensive coverage, you should actually get comprehensive coverage.
Our exhaustive search mode uses a two-pass approach. First, we identify every document in your collection that has any matches for your query—not just the top-ranked ones. Then we retrieve the best chunks from each of those documents, ensuring coverage across your entire document set.
The difference matters. Traditional top-k might return 10 chunks, all from the same 2-3 documents that happened to rank highest. Our exhaustive mode might return 50 chunks distributed across 15 documents—giving you the full picture instead of a biased sample.
We also use hybrid search (combining vector and keyword approaches) and reranking (a second pass that filters out chunks that matched superficially but aren’t actually relevant). The goal is finding everything that matters.
When You Need Precision vs. Coverage
Not every question requires exhaustive search. For specific, targeted queries—“What was the benzene concentration at MW-1?”—precision mode finds the best answer fast.
But for questions that span your document set, for counting and listing tasks, for building comprehensive summaries: you need a system that actually looks everywhere, not one that samples from the top and hopes for the best.
Modern language models are remarkably capable at synthesizing information and answering questions—if they’re given the right context. The limitation is retrieval: getting the right information in front of the model in the first place.
That’s the problem we built Statvis to solve: not just searching documents, but ensuring that when you ask a question, you get an answer based on everything relevant in your collection—with citations you can verify, pointing to the exact page and passage.
Because if there’s one thing worse than missing information, it’s not knowing you missed it.
Yes, we used AI to help write this blog post. No, we didn’t let it make up the citations. You can click every single one.
Footnotes
-
OpenAI Help Center. “File Uploads FAQ.” 2025. ↩
-
Barnett, S. et al. “Seven Failure Points When Engineering a Retrieval Augmented Generation System.” arXiv. January 2024. ↩
-
Anthropic. “Contextual Retrieval in AI Systems.” 2024. ↩
-
Alsaad, A. et al. “A Systematic Review of Key Retrieval-Augmented Generation (RAG) Systems.” arXiv. July 2025. ↩
-
OpenAI Help Center. “File Uploads FAQ.” 2025. ↩
-
Anthropic Help Center. “Uploading Files to Claude.” 2025. ↩