BACK TO BLOGS

Engineering

The Latency Trap: Vector Search at Scale

Your vector search runs at 15ms on a million documents. Six months later, the corpus has grown to fifty million and that same query takes 180ms. Add the reranker and you are at 350ms. Add the LLM call and the user waits three seconds.

This is the latency trap. Vector search appears fast at prototype scale and degrades predictably as the index grows.

Why Vector Search Slows Down

Computing similarity requires measuring distance between a query vector and candidates in the index. Exact nearest neighbor search scales linearly — impractical for large datasets.

Approximate algorithms like HNSW and IVF-PQ trade accuracy for speed by partitioning the index. But even these degrade as indexes grow. Larger indexes require more partitions, deeper traversals, or wider search beams — each adding latency.

The result is a fundamental tradeoff between index size, speed, and quality that cannot be eliminated, only managed.

The Hidden Costs of Workarounds

Index partitioning distributes the corpus across shards. This maintains per-shard speed but adds routing complexity. Cross-shard queries either hit all shards (slow) or risk missing results (inaccurate).

Aggressive pre-filtering narrows candidates before similarity computation. This works when filters are selective but requires well-designed metadata layers.

Quantization compresses vectors to reduce computation. The tradeoff is precision — compressed vectors produce less accurate scores, pushing systems further into embedding collapse territory.

Latency Compounds Through the Pipeline

Vector search latency is only the first layer. Production pipelines add reranking (50-200ms), context assembly (10-50ms), and the LLM call (500-3000ms).

A 200ms vector search delays every downstream step. For applications needing multiple retrieval steps, latency multiplies — a three-step chain with 150ms per step adds 450ms before any generation.

Metadata-First Architecture

The scaling problem stems from searching too large a space. Vector databases compute similarity against the full index and filter after. Metadata-first architectures flip this — narrowing with structured filters before any vector computation.

If a query scopes to a specific tenant, time range, or category, vector search runs against thousands of candidates instead of millions. HydraDB uses this approach to maintain sub-50ms p50 latency at scale, and learning from query patterns further optimizes which filters are most effective.

Frequently Asked Questions

At what scale does latency become a problem?

Most teams notice meaningful degradation between ten million and one hundred million vectors. By one billion, latency management becomes a primary engineering concern requiring dedicated infrastructure investment.

Can more hardware fix it?

To a point. Additional memory and replicas help but have diminishing returns — algorithmic tradeoffs remain, and costs grow faster than linearly with index size.

Conclusion

Vector search latency is manageable at small scale and deceptive at medium scale. The degradation is gradual enough that teams miss it until production users are affected and performance budgets are already blown. Architectures that filter and scope before computing similarity avoid this trap entirely — keeping query latency constant regardless of how large the corpus grows.

Enjoying this article?

Get the latest blogs and insights straight to your inbox.