Garbage In, Garbage Retrieved

BACK TO BLOGS

Engineering

Garbage In, Garbage Retrieved

Your AI agent answers a customer question with complete confidence. The response cites specific numbers and references a product feature. It is also wrong — because the chunk it retrieved was an outdated draft that never should have entered the knowledge base.

Vector databases do not have quality gates. They compute similarity and return the closest matches. Whether content is accurate, current, or contextually appropriate is invisible to retrieval.

The Amplification Problem

Vector databases do not just pass through bad data — they amplify it. A poorly chunked document that loses context during ingestion produces an embedding representing surface-level language without qualifications or caveats.

The agent presents decontextualized information as complete fact. The qualifier that data applies only to Q3 2024 is gone. The caveat that numbers are preliminary was stripped during chunking. The retrieval succeeded by its own metrics — high similarity score, semantically related chunk — but the content was garbage.

Where Bad Context Originates

Quality problems start at ingestion. Most pipelines split documents into chunks, compute embeddings, and store them. Naive chunking — splitting on character count or paragraph boundaries — regularly separates facts from qualifying context.

A pricing table loses its header row. A conditional policy gets split so the condition is in one chunk and the policy in another. A multi-step process gets fragmented so steps are retrieved without preceding context.

Source quality is the other dimension. Knowledge bases accumulate drafts, deprecated documents, and conflicting versions. Without ingestion that performs temporal validation, all content enters the index as equally valid.

Why Post-Retrieval Filtering Falls Short

Validation after retrieval — checking source dates, flagging low-confidence results — helps but has limits. Post-retrieval filtering cannot recover context lost during chunking. If the chunk is decontextualized, no downstream processing restores the missing qualification.

These filters also add latency. Running validation over every chunk doubles retrieval computation. At scale, this is a meaningful cost — especially when the latency budget is already tight.

Context-Preserving Ingestion

The alternative is solving quality at the source. Context-preserving pipelines maintain source attribution, entity resolution, temporal markers, and structural awareness when processing documents.

A chunk retains knowledge of its origin, creation date, document section, and relationships to other chunks. Source-aware parsing understands that a Slack message, Notion page, and PDF contract have different structures requiring different chunking strategies.

Frequently Asked Questions

How do I audit my vector database content?

Sample retrieved results for representative queries and check against source documents. Look for chunks that lost qualifying context, outdated information still surfacing, and duplicates. This identifies common failure modes.

Is context-preserving chunking more expensive?

Upfront, yes. But teams invest less time debugging wrong answers and building post-retrieval filters. Total cost of ownership is typically lower.

Conclusion

The quality of what goes in determines what comes out. Similarity search cannot distinguish between accurate, well-contextualized content and stale fragments. This is one of the most dangerous limitations of vector databases — retrieval that succeeds technically while failing practically. Production agents need ingestion that preserves context and tracks provenance, because once information is lost during chunking, no retrieval optimization brings it back.

Enjoying this article?

Get the latest blogs and insights straight to your inbox.

Security

SOC2 in the Loop

Written by:

Sarah Vance

8 min read

Performance

Latency: The New Gold

Written by:

David Kim

4 min read

Architecture

Beyond the Vector DB

Written by:

Elena Ro

7 min read