How Vector Similarity Search Actually Works Under the Hood

Every AI agent with memory or RAG ships a vector store. Almost no team understands HNSW, recall-vs-latency tradeoffs, or the failure modes that surface in production. Here is what the docs skip.

Abhishek Sharma· Founder, Fordel Studios

April 11, 2026Updated May 8, 202615 min read

How Vector Similarity Search Actually Works Under the Hood

Every RAG pipeline, every recommendation engine, every semantic search feature — they all depend on the same primitive: find me the most similar vectors. You call an API, vectors come back ranked. But what actually happens inside that index? The answer involves graph theory, probabilistic data structures, and some surprisingly elegant math.

···

What problem does vector similarity search actually solve?

Traditional databases index exact values. You query for rows where user_id = 42 or price < 100. B-trees handle this brilliantly. But embeddings are different. When you encode a sentence, an image, or a user profile into a 1536-dimensional vector, there is no exact match to look for. You need the closest neighbors in a space with more dimensions than you can visualize.

This is the nearest neighbor problem, and it has been studied since the 1960s. The brute-force solution is simple: compute the distance between your query vector and every vector in the database, sort by distance, return the top k. For 1,000 vectors, this is instant. For 10 million, it takes seconds. For a billion, it is unusable.

O(n × d)Brute-force complexity per queryWhere n = number of vectors and d = dimensionality. At 1B vectors × 1536 dims, that is roughly 1.5 trillion floating-point operations per query.

The key insight that powers every modern vector database: you do not need exact nearest neighbors. Approximate Nearest Neighbors (ANN) algorithms sacrifice a small amount of accuracy — typically returning 95-99% of the true nearest neighbors — in exchange for query times that drop from seconds to milliseconds. This is the recall-latency tradeoff, and it is the single most important concept in vector search engineering.

···

How do distance metrics shape everything downstream?

Before you can find similar vectors, you have to define similar. Three distance metrics dominate production systems, and choosing the wrong one is a silent accuracy killer.

Cosine similarity measures the angle between two vectors, ignoring magnitude. Two vectors pointing in roughly the same direction score high regardless of how long they are. This is why it dominates text embeddings — a short document and a long document about the same topic produce vectors with similar directions but different magnitudes.

Euclidean distance (L2) measures the straight-line distance between points. It cares about magnitude. If your vectors are not normalized, a vector that is numerically larger will be farther from everything, even semantically similar items. Most embedding models output normalized vectors, which makes cosine and L2 equivalent — but not all do, and this trips up teams constantly.

Dot product (inner product) combines direction and magnitude. It is fastest to compute because it skips the normalization step that cosine requires. When your embeddings are normalized — which OpenAI, Cohere, and most modern providers guarantee — dot product, cosine similarity, and L2 distance all produce identical rankings. When they are not normalized, dot product and cosine diverge, and your recall drops without any obvious error.

···

How does HNSW actually work?

Hierarchical Navigable Small World (HNSW) is the algorithm behind Pinecone, Qdrant, Weaviate, pgvector (since 0.5.0), Milvus, and essentially every serious vector database in production today. Published by Yuri Malkov and Dmitry Yashunin in 2016, it combines two ideas: navigable small-world graphs and a hierarchical skip-list structure.

The core concept

Imagine you have a million vectors. HNSW builds a multi-layer graph. The bottom layer (layer 0) contains every vector, each connected to its nearest neighbors. Layer 1 contains a random subset of vectors — roughly 1/M of them, where M is a tunable parameter. Layer 2 contains a subset of layer 1, and so on. The top layer might contain just a handful of nodes.

To find the nearest neighbors of a query vector, you start at the top layer, which is sparse. You greedily walk to the node closest to your query. Then you drop down one layer, which is denser, and walk again. At each layer, the graph gets more detailed, and your search gets more precise. By the time you reach layer 0, you are already in the right neighborhood, and you only need to explore a small region.

This is why HNSW achieves logarithmic query time. Each layer acts as an express lane. The top layers skip over vast regions of the space. The bottom layers do the fine-grained work. It is the same principle as a skip list, but in high-dimensional space instead of a sorted list.

The parameters that actually matter

HNSW has two critical build parameters and one query parameter. Getting these wrong is the difference between 99% recall and 80% recall at the same latency.

Parameter	What It Controls	Higher Value	Lower Value	Typical Production Range
M (max connections)	Edges per node in the graph	Better recall, more memory, slower build	Faster build, less memory, worse recall	16–64
efConstruction	Beam width during index build	More accurate graph, much slower build	Faster build, lower quality graph	128–512
efSearch	Beam width during query	Better recall, higher latency	Faster queries, lower recall	64–256

M determines how connected the graph is. Each node maintains at most M connections per layer (2×M on layer 0). More connections mean more paths to traverse during search, which improves recall but increases memory consumption. For 1536-dimensional embeddings, each connection costs about 12 bytes, so M=64 on a billion-row index adds roughly 70 GB of overhead just for the graph structure — on top of the vectors themselves.

efConstruction controls how thoroughly the algorithm searches for neighbors when building the index. It is a beam width: the algorithm maintains a priority queue of this size while looking for the best connections. Higher values produce a better graph but dramatically slow down index building. Setting efConstruction below M makes no sense — the algorithm cannot find enough neighbors to fill the connection slots. A rule of thumb: efConstruction should be at least 2× M.

efSearch is the query-time beam width. This is the parameter you tune in production. It controls the recall-latency tradeoff directly. Start at efSearch = 50, measure recall against a ground-truth brute-force result, and increase until you hit your target recall. Most production systems land between 100 and 200 for 95%+ recall.

~6.5 KBMemory per vector (1536-dim, M=16)The raw vector takes 6 KB (1536 × 4 bytes). The graph adds roughly 0.5 KB. For 10M vectors, budget around 65 GB of RAM.

···

What are the alternatives to HNSW, and when do they win?

HNSW dominates because it offers the best recall-latency tradeoff for most workloads. But it is not the only option, and it has real weaknesses — primarily its memory appetite and slow index build time.

IVF (Inverted File Index)

IVF partitions the vector space into clusters using k-means, then at query time searches only the nearest clusters. It is conceptually simpler than HNSW and uses significantly less memory because there is no graph structure — just cluster centroids and inverted lists. The tradeoff: IVF recall is sensitive to the number of clusters probed (nprobe), and it degrades badly on non-uniform distributions where clusters have vastly different sizes. IVF shines when memory is constrained and you can tolerate slightly lower recall.

Product Quantization (PQ)

PQ compresses vectors by splitting each one into subvectors and quantizing each subvector to the nearest centroid in a learned codebook. A 1536-dimensional float32 vector (6 KB) can be compressed to 192 bytes — a 32× reduction. The catch: PQ introduces quantization error, which directly reduces recall. In practice, PQ is rarely used alone. It is combined with IVF (IVF-PQ) or HNSW to reduce memory while keeping acceptable recall. Pinecone and Milvus both use IVF-PQ for their largest-scale tiers.

DiskANN

Developed at Microsoft Research, DiskANN stores the graph structure on SSD instead of RAM. It achieves near-HNSW recall at a fraction of the memory cost by using a Vamana graph (a variant of the navigable small-world graph) with an SSD-optimized access pattern. The latency is higher — typically 2-5ms versus sub-millisecond for in-memory HNSW — but for billion-scale deployments where the index would require hundreds of gigabytes of RAM, DiskANN changes the economics entirely. Azure AI Search and some Milvus configurations use DiskANN under the hood.

Algorithm	Recall @95%+ latency	Memory	Build Time	Best For
HNSW	Sub-millisecond	High (vectors + graph in RAM)	Slow (hours for 100M+)	Low-latency, high-recall workloads
IVF-PQ	1–5ms	Low (compressed vectors + centroids)	Moderate (training + assignment)	Billion-scale, memory-constrained
DiskANN	2–5ms	Very low (graph on SSD)	Moderate	Billion-scale, cost-sensitive
Brute Force	Exact (100%)	Vectors only	None	Under ~50K vectors

···

Where does vector search get tricky in production?

The demo always works. Production is where the interesting failures live.

The filtering problem

Almost every real query involves metadata filtering: find me similar products in category X, or similar documents from the last 30 days. This sounds trivial. It is not. There are two approaches, and both have sharp edges.

Pre-filtering narrows the candidate set before the ANN search. If your filter eliminates 99% of vectors, the HNSW graph is effectively destroyed — most of the edges lead to filtered-out nodes, and the algorithm cannot navigate efficiently. Recall craters. Post-filtering runs the ANN search first, then removes non-matching results. If the filter is restrictive, you might search for top-100 and get back 3 results after filtering, which means you overfetch massively or return too few results.

The production solution is partitioned indexes — build separate HNSW indexes per high-cardinality filter dimension (like tenant_id), and use bitmap intersections for low-cardinality filters. Qdrant and Weaviate handle this reasonably well out of the box. pgvector does not — you are on your own for filtered search performance tuning.

The update problem

HNSW indexes are designed for immutable data. Deleting a vector means removing a node from the graph, which can orphan regions of the space — nodes that were reachable only through the deleted node become disconnected, and recall drops. Most vector databases handle this by soft-deleting (marking nodes as invisible) and periodically rebuilding the index. This works until your deletion rate is high enough that the graph is more holes than structure.

Insertions are slightly better — HNSW supports incremental inserts by connecting new nodes to existing ones. But the connections chosen during incremental insert are lower quality than those chosen during a full build, because the algorithm has less global context. Over time, an index that has been heavily modified will have worse recall than one built from scratch on the same data. Plan for periodic full rebuilds.

“The dirty secret of vector databases: they are append-optimized systems pretending to be general-purpose databases. High update rates and frequent deletes will degrade recall silently, and the only fix is a full index rebuild.”

Production observation from three RAG deployments

The dimensionality trap

Higher dimensions do not always mean better results. OpenAI offers text-embedding-3-large at 3072 dimensions, but you can truncate to 1536 or even 256 dimensions with Matryoshka embeddings. Here is what matters: doubling dimensions roughly doubles memory, doubles build time, and increases query latency — but the recall improvement is logarithmic. For many workloads, 512 or 768 dimensions deliver 95% of the accuracy at 25-50% of the cost. Always benchmark your specific data before defaulting to max dimensions.

···

What are the real failure modes?

These are the failures we have seen or debugged in production vector search deployments. None of them generate obvious error messages.

Vector Search Failure Modes

Embedding drift: Model provider updates their embedding model, new vectors are subtly incompatible with old ones, recall degrades over weeks
Metric mismatch: Index built with L2, queries computed with cosine — results look plausible but are consistently suboptimal
Stale graph: High churn rate (>20% deletes/month) without rebuild — HNSW graph degrades, recall drops 10-15% silently
Memory pressure: HNSW graph gets swapped to disk — latency jumps from <1ms to 50ms+, looks like a network issue
Dimensionality overshoot: Using 3072-dim embeddings when 768 would perform identically for your data, burning 4× the RAM
Filter selectivity cliff: Filtered queries suddenly return zero results when a new category has too few vectors for effective ANN search

The most insidious failure is embedding drift. When OpenAI or Cohere updates their embedding model — even a minor version — the vector space shifts. Vectors generated by the old model and the new model are in slightly different spaces. Your index, built on old vectors, starts returning slightly worse results for queries embedded with the new model. There is no error. Recall degrades gradually. The fix is to re-embed everything and rebuild the index, which at scale can take hours to days.

···

How should you choose a vector database in 2026?

The market has consolidated around a few architectures. Here is an honest assessment based on what we have deployed and operated.

Database	Best For	Watch Out For	Operational Complexity
pgvector	Teams already on PostgreSQL, <5M vectors	No filtered search optimization, single-node only, full rebuild required for HNSW	Low (it is just Postgres)
Qdrant	High-performance, filtered search, Rust performance	Younger ecosystem, clustering is newer	Medium
Pinecone	Serverless, zero-ops teams, rapid prototyping	Cost at scale (per-read pricing), limited control over indexing	Very Low
Weaviate	Multi-modal search, hybrid keyword+vector	Java/Go memory overhead, complex schema model	Medium-High
Milvus	Billion-scale, GPU-accelerated, IVF-PQ workloads	Complex deployment (etcd + MinIO + Milvus), steep learning curve	High

For most teams building RAG or semantic search features, the decision tree is simple. If you are already on PostgreSQL and your dataset is under 5 million vectors, pgvector with HNSW indexing is the pragmatic choice — no new infrastructure, familiar tooling, good-enough performance. If you need filtered search at scale, Qdrant is the current performance leader. If you want zero operational overhead and your budget tolerates per-query pricing, Pinecone serverless removes the most headaches.

The one thing I would not do: build on a vector database that requires you to manage etcd, object storage, and a custom control plane just to store embeddings. Milvus is technically impressive, but unless you are operating at genuine billion-scale with a dedicated infrastructure team, the operational cost will dwarf the infrastructure savings.

···

When should you avoid vector search entirely?

Vector search is not always the right tool, and the hype cycle makes it easy to over-apply.

If your search is primarily keyword-based — product SKUs, exact names, log queries — a full-text search engine like Elasticsearch or Typesense will outperform vector search in both relevance and latency. Vector search finds semantically similar results, but it also finds results that are semantically adjacent in confusing ways. Search for "Python error handling" and you might get results about "Java exception management" — semantically related, but not what the user wanted.

Hybrid search — combining BM25 keyword scoring with vector similarity using reciprocal rank fusion — is the pragmatic middle ground. Most production search systems we build at Fordel use hybrid approaches. The keyword component handles exact matches and known-item searches. The vector component handles fuzzy, conceptual queries. Reciprocal rank fusion merges the two ranked lists without needing to normalize scores across different systems.

···

What does the architecture look like end-to-end?

A production vector search system has more moving parts than the index itself. Here is the architecture we have converged on across multiple deployments.

The write path: content arrives, gets chunked (for documents) or processed (for structured data), sent to an embedding model, and the resulting vector plus metadata gets upserted into the vector database. A hash of the original content is stored alongside the vector to detect when re-embedding is needed after model updates.

The query path: user input gets embedded using the same model, the vector database returns the top-k candidates with scores, a reranker (Cohere Rerank or a cross-encoder) re-scores the candidates for higher precision, and the final results are passed to the LLM or returned to the user.

The reranking step is non-optional for serious deployments. ANN search optimizes for recall — casting a wide net. Reranking optimizes for precision — sorting the contents of the net accurately. A two-stage pipeline with broad ANN retrieval (top-50 to top-100) followed by cross-encoder reranking (to top-5 or top-10) consistently outperforms a single-stage pipeline tuned for higher precision, at lower total latency.

10–30%Recall improvement from adding a rerankerMeasured across three production RAG deployments. The reranker adds 20-50ms latency but dramatically improves result relevance, especially for ambiguous queries.

···

What is the bottom line?

Vector similarity search is not magic. It is a well-understood family of algorithms — primarily HNSW in 2026 — with clear tradeoffs between recall, latency, and memory. The system works by building a multi-layer navigable graph where greedy traversal from sparse upper layers to dense lower layers achieves logarithmic query time at the cost of high memory usage.

The things that actually matter in production are not the algorithm itself — HNSW is HNSW everywhere — but the surrounding decisions: which distance metric matches your embeddings, how you handle filtered queries without destroying recall, how you detect and recover from embedding drift, and whether you are rebuilding indexes frequently enough to maintain graph quality.

The most common mistake we see is treating the vector database as a black box. Teams pick a managed service, throw embeddings at it, and wonder why recall degrades over time. Understanding the index structure — even at the level covered in this article — gives you the vocabulary to debug production issues, right-size your infrastructure, and make informed tradeoffs between cost and quality.

“The embedding model determines what your search can find. The index determines how fast it finds it. The distance metric determines what similar means. Get any one of these wrong and your search looks like it works right up until it does not.”

Build with us

Need this kind of thinking applied to your product?

We build AI agents, full-stack platforms, and engineering systems. Same depth, applied to your problem.

Start a conversation View services

Newsletter

Enjoyed this? Get the weekly digest.

Research highlights and AI news, delivered every Thursday. No spam.

Loading comments...

Keep Reading

All articles