MervCodes

Tech Reviews From A Programmer

AI Embeddings: Practical Applications for Developers

1 min read

Embeddings are one of those concepts that sound abstract until you build something with them — and then they feel almost like cheating. At their core, embeddings turn text, images, or other data into lists of numbers (vectors) that capture meaning. Two pieces of content that mean similar things end up close together in vector space, even if they share no literal words. That single property unlocks search, recommendations, classification, deduplication, and a surprising amount more.

This post is a hands-on guide for developers who want to move past the theory and start shipping. We'll cover what embeddings actually do, the most useful application patterns, and the practical pitfalls that trip people up.

What an Embedding Actually Is

An embedding model takes input — say, the sentence "How do I reset my password?" — and returns a fixed-length array of floating-point numbers, perhaps 768, 1024, or 1536 dimensions long. Each dimension doesn't map to a human-readable concept, but collectively the vector encodes semantic content.

The key operation is comparing two vectors. The standard measure is cosine similarity: the cosine of the angle between two vectors, ranging from -1 (opposite) to 1 (identical direction). In practice you compute it like this:

import numpy as np

def cosine_similarity(a, b):
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Many embedding APIs return pre-normalized vectors (unit length), in which case a plain dot product is the cosine similarity and is faster. Check your provider's docs.

The mental model: once everything is a vector, "find related things" becomes "find nearby vectors." That's the whole trick.

Pattern 1: Semantic Search

Keyword search fails when users phrase things differently than your documents. A user types "my card got declined" but your help article says "payment authorization failure." No keyword overlap, no result.

Semantic search fixes this. The workflow:

  1. Index time: Split your documents into chunks, embed each chunk, store the vectors.
  2. Query time: Embed the user's query, find the nearest stored vectors, return the matching chunks.
# Pseudocode for the core loop
query_vec = embed(user_query)
results = vector_store.search(query_vec, top_k=5)

This is the single most common production use of embeddings, and it's the foundation of Retrieval-Augmented Generation (RAG) — where retrieved chunks get fed into an LLM as context so it can answer with your private data instead of hallucinating.

A practical tip: chunk size matters enormously. Too large, and a chunk dilutes its meaning across many topics, hurting retrieval precision. Too small, and you lose context. Start around 200–500 tokens per chunk with some overlap (say 50 tokens) between adjacent chunks, then tune against real queries.

Pattern 2: Classification and Routing

You don't always need a fine-tuned model to classify text. With embeddings you can build a lightweight classifier:

  • Embed a handful of labeled examples per category.
  • Average them to get a "centroid" vector per category (or keep them all for k-nearest-neighbors).
  • Classify a new item by finding the closest category.

This is great for intent routing in support systems ("billing" vs. "technical" vs. "sales"), content moderation triage, or tagging. It's cheap, requires few examples, and you can add a new category just by adding examples — no retraining.

Pattern 3: Recommendations and "More Like This"

If you embed products, articles, or songs based on their descriptions or metadata, "recommend similar items" becomes a nearest-neighbor lookup. A user reading an article? Embed it, find the closest other articles, surface them. This avoids the cold-start problems of collaborative filtering because it works from content alone, and you can blend it with behavioral signals later.

Pattern 4: Deduplication and Clustering

Embeddings shine at finding near-duplicates that aren't exact matches — two support tickets describing the same bug in different words, or scraped listings for the same product. Compute pairwise similarity and threshold it. For larger collections, run a clustering algorithm (HDBSCAN works well because you don't have to pre-specify the number of clusters) over the embedding vectors to discover natural groupings in your data. This is a fast way to understand a large unlabeled corpus.

Choosing and Storing Embeddings

Picking a model. Consider three axes: quality (how well it captures meaning for your domain), dimensionality (higher isn't always better; it costs more storage and compute), and cost/latency. Hosted API models are easy to start with; open-source models you self-host give you control and avoid per-call fees. Many modern models support dimension truncation (Matryoshka embeddings), letting you trade a little accuracy for big storage savings.

A critical rule: you must use the same model to embed your stored data and your queries. Vectors from different models live in incompatible spaces and comparing them is meaningless. If you switch models, you must re-embed everything.

Storing vectors. For small datasets (a few thousand vectors), a NumPy array and brute-force search is genuinely fine — don't over-engineer. As you scale, reach for a vector database (Pinecone, Weaviate, Qdrant, Milvus) or a vector extension on a database you already run (pgvector for PostgreSQL is a popular, low-friction choice). These use approximate nearest neighbor (ANN) indexes like HNSW to search millions of vectors in milliseconds, trading a tiny bit of recall for huge speed gains.

Practical Tips That Save Pain

  • Cache aggressively. Embedding the same text repeatedly wastes money and time. Hash the input and cache the vector.
  • Batch your requests. Most APIs let you embed many texts per call, drastically cutting overhead.
  • Normalize and store metadata alongside vectors. You'll almost always want to filter by attributes (date, author, category) and search by similarity. Hybrid filtering is a first-class feature in good vector stores.
  • Combine with keyword search. Pure semantic search can miss exact matches (product codes, names, acronyms). Hybrid search — blending embedding similarity with traditional BM25 keyword scoring — consistently outperforms either alone.
  • Evaluate with real queries. Build a small test set of query/expected-result pairs and measure recall. "It feels better" is not a metric.
  • Watch for stale indexes. When source data changes, the embeddings must be regenerated. Build re-indexing into your pipeline from day one.

A Minimal End-to-End Example

# 1. Index
docs = load_documents()
chunks = [c for d in docs for c in chunk(d, size=400, overlap=50)]
vectors = embed_batch([c.text for c in chunks])   # one API call per batch
store.upsert(chunks, vectors)

# 2. Query
def search(query, k=5):
    qv = embed(query)
    return store.search(qv, top_k=k)

# 3. (Optional) RAG
def answer(query):
    context = "\n\n".join(c.text for c in search(query))
    return llm(f"Answer using this context:\n{context}\n\nQ: {query}")

That's a working semantic search and RAG system in a dozen lines of logic. The hard parts are tuning chunking, evaluation, and scaling the store — not the embeddings themselves.

Frequently Asked Questions

Do I need a GPU to use embeddings? Not necessarily. If you use a hosted embedding API, all the heavy computation happens on the provider's side. Self-hosting an open-source model benefits from a GPU for throughput, but small models run acceptably on CPU for low-volume workloads.

How is an embedding different from what an LLM does? An embedding model is typically smaller and specialized for one job: turning input into a meaning vector. An LLM generates text. They're complementary — embeddings retrieve the right context, and the LLM uses that context to answer (that's RAG).

What's a good similarity threshold for "these are the same"? There's no universal number — it depends on the model and domain. Empirically calibrate it: gather known-similar and known-different pairs, plot their similarity scores, and pick a threshold that separates them well. Don't hardcode a magic value from a blog post (including this one).

How many dimensions do I need? Use what your chosen model produces, unless it supports truncation. More dimensions can capture more nuance but cost more to store and search. For most applications, 384–1024 dimensions are plenty; benchmark before assuming bigger is better.

Can I embed images, audio, or code — not just text? Yes. There are specialized embedding models for images, audio, and source code, plus multimodal models that place text and images in a shared space — letting you search images with text queries. The application patterns above all transfer.

Why are my search results irrelevant? The usual suspects: mismatched models between index and query, chunks that are too large, missing hybrid keyword search for exact terms, or a stale index. Start by inspecting the actual retrieved chunks for a few failing queries — the problem is usually visible immediately.

Wrapping Up

Embeddings turn the fuzzy problem of "meaning" into the concrete problem of "distance between vectors," and that reframing is what makes them so broadly useful. Start small: a single semantic search feature over your own documents will teach you more than any amount of reading. Get chunking and evaluation right, lean on hybrid search, and reach for a vector database only when scale demands it. From there, the same toolkit extends naturally into recommendations, classification, deduplication, and full RAG pipelines.

Sources

Related Articles

AI Code Review: The Complete Guide for Engineering Teams (2026)

A definitive, practical guide to AI code review in 2026 — how it works, where it helps and where it doesn't, how to roll it out, prompt and config patterns, security trade-offs, and the metrics that prove it's working.

AWS S3 and CloudFront for Static Site Hosting

TanStack Query Guide for React: Server State Made Simple