MervCodes

Tech Reviews From A Programmer

Implementing Semantic Search: Embeddings and Ranking

1 min read

Keyword search has a fundamental blind spot: it matches strings, not meaning. A user searching for "how to reset my password" will miss a help article titled "recovering account access" because the two share almost no overlapping words. Semantic search closes that gap by representing text as vectors in a high-dimensional space, where proximity reflects conceptual similarity rather than lexical overlap. This post walks through how to actually build it — from generating embeddings to ranking results that users trust.

What Semantic Search Actually Does

At its core, semantic search converts both your documents and the user's query into numerical vectors called embeddings. These vectors are produced by a model trained so that texts with similar meanings land near each other. "Reset my password" and "recover account access" end up as nearby points, even with no shared keywords.

Retrieval then becomes a geometry problem: find the document vectors closest to the query vector. "Closest" is almost always measured by cosine similarity, which compares the angle between two vectors and ignores their magnitude. The result is a ranked list ordered by conceptual relevance.

The pipeline has three stages worth treating as distinct concerns:

  1. Indexing — embed your corpus once and store the vectors.
  2. Retrieval — embed the incoming query and find nearest neighbors.
  3. Ranking — reorder candidates so the best results surface first.

Choosing and Generating Embeddings

Your embedding model is the single most important decision. A few practical guidelines:

  • Match the model to your domain. General-purpose models work well for broad content, but legal, medical, or code search often benefit from domain-tuned models.
  • Mind the dimensionality. Larger vectors (1024+ dimensions) can capture more nuance but cost more to store and compare. Many production systems do fine with 384–768 dimensions.
  • Stay consistent. The same model must embed both documents and queries. Mixing models produces vectors that live in incompatible spaces, and similarity scores become meaningless.

Here is the shape of an indexing step using a typical embedding API:

import numpy as np

def embed(texts: list[str]) -> np.ndarray:
    # Returns an (N, D) array of L2-normalized vectors
    vectors = model.encode(texts)
    norms = np.linalg.norm(vectors, axis=1, keepdims=True)
    return vectors / np.clip(norms, 1e-12, None)

documents = load_corpus()
doc_vectors = embed([d.text for d in documents])

Normalizing vectors to unit length up front is a small trick with a big payoff: once every vector has magnitude 1, cosine similarity reduces to a plain dot product, which is faster and lets you use simpler index types.

Chunking long documents

Embedding models have token limits, and a single vector for a 5,000-word document blurs together too many distinct ideas. Split long documents into chunks of a few hundred tokens, embed each chunk, and store them with a reference back to the parent document. A common refinement is to overlap chunks slightly (say 10–15%) so that ideas spanning a boundary aren't lost. At query time you retrieve chunks, then collapse them back to documents during ranking.

Storing and Retrieving Vectors

For a few thousand documents, brute-force search is perfectly fine — compute the dot product against every vector and take the top results:

def search(query: str, k: int = 10):
    q = embed([query])[0]
    scores = doc_vectors @ q          # dot product over all docs
    top = np.argpartition(-scores, k)[:k]
    return sorted(((scores[i], documents[i]) for i in top), reverse=True)

This is exact and trivially correct, which makes it a great baseline even if you later replace it. Once your corpus grows past a few hundred thousand vectors, exhaustive comparison gets slow, and you'll want approximate nearest neighbor (ANN) search. ANN indexes such as HNSW graphs or IVF partitions trade a tiny bit of recall for dramatic speedups. A dedicated vector database handles this for you, along with persistence, filtering, and horizontal scaling.

Don't reach for a vector database reflexively. If brute force answers in under 50 milliseconds for your corpus, the operational simplicity of an in-memory array beats another piece of infrastructure.

Ranking: Where Quality Is Won or Lost

Retrieval gets you candidates. Ranking decides what the user actually sees, and naive cosine similarity alone rarely produces the best ordering. Three techniques meaningfully improve results.

Hybrid search

Pure semantic search is weak at exact matches — product SKUs, error codes, proper names, and rare jargon. Keyword search (BM25) excels at exactly those. Hybrid search runs both and fuses the results. A robust, parameter-light fusion method is Reciprocal Rank Fusion (RRF), which combines rankings rather than raw scores:

def rrf(rankings: list[list[str]], k: int = 60) -> dict[str, float]:
    scores: dict[str, float] = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return scores

Because RRF uses ranks instead of scores, you sidestep the headache of normalizing incompatible score scales between BM25 and cosine similarity. It's a strong default that's hard to beat without significant tuning.

Reranking with a cross-encoder

The embedding models used for retrieval are bi-encoders: they embed query and document separately, which is fast but loses the interaction between them. A cross-encoder reads the query and a candidate document together and scores their relevance directly. Cross-encoders are far more accurate but far too slow to run over a whole corpus.

The standard pattern is a two-stage funnel: retrieve 50–100 candidates with the fast bi-encoder, then rerank just those with the cross-encoder. You get most of the accuracy of an expensive model at a fraction of the cost.

Business signals

Relevance is not purely semantic. Recency, popularity, document authority, and personalization often matter. Blend these as a weighted combination on top of the relevance score, but introduce them deliberately — every signal you add is a knob that can quietly degrade quality if mistuned.

Evaluating Your Search

You cannot improve what you don't measure. Build a small labeled set of queries paired with their ideal results, then track metrics that reflect ranking quality, not just whether the right answer appears somewhere:

  • Recall@k — did the relevant document make it into the top k candidates? This measures retrieval.
  • MRR (Mean Reciprocal Rank) — how high did the first relevant result land? Good for known-item search.
  • nDCG — rewards placing highly relevant results near the top, with diminishing credit further down. The best single measure of ranking quality.

Even 50 hand-labeled queries will catch most regressions. Re-run the evaluation every time you change the model, chunking strategy, or ranking weights, and treat a metric drop as a blocker.

Common Pitfalls

  • Embedding query and documents with different models. Subtle, silent, and ruinous to relevance.
  • Forgetting to re-index after a model upgrade. New query vectors against old document vectors live in different spaces.
  • Ignoring exact-match cases. Without hybrid search, codes and names fall through the cracks.
  • Over-chunking. Tiny fragments lose context; giant chunks dilute it. Tune to your content.
  • Optimizing latency before relevance. A fast search that returns the wrong answer is worthless. Get quality right, then make it fast.

FAQ

How is semantic search different from keyword search? Keyword search matches literal terms; semantic search matches meaning via vector similarity. The two are complementary, which is why hybrid approaches usually outperform either alone.

Do I need a vector database? Not necessarily. Below a few hundred thousand vectors, a normalized NumPy array with brute-force dot products is fast, exact, and simple. Adopt a vector database when scale, persistence, or metadata filtering demand it.

What's the difference between a bi-encoder and a cross-encoder? A bi-encoder embeds query and document independently — fast, used for retrieval. A cross-encoder evaluates them jointly — accurate, used to rerank a small candidate set.

How many candidates should I retrieve before reranking? Start with 50–100. Enough to give the reranker good material, small enough to stay fast. Tune using Recall@k: if relevant documents fall outside your retrieval window, the reranker never gets a chance to surface them.

How do I handle documents longer than the model's token limit? Split them into overlapping chunks, embed each chunk, and map results back to parent documents at ranking time. Overlap of 10–15% keeps ideas that straddle chunk boundaries intact.

How often should I re-embed my corpus? Re-embed whenever content changes or you switch embedding models. A model change requires re-indexing the entire corpus, since old and new vectors are not comparable.

Wrapping Up

Semantic search rewards a layered approach: solid embeddings for recall, hybrid fusion to cover exact-match gaps, and a cross-encoder rerank to sharpen the top results. Start with the simplest version that works — brute-force retrieval and cosine similarity — measure it honestly with nDCG and Recall@k, and add complexity only where the numbers justify it. The teams that win at search aren't the ones with the fanciest models; they're the ones who measure relentlessly and tune deliberately.

Sources

Related Articles