Vector Databases & Embeddings: A Practical Guide for RAG, Search, and AI Apps
Learn what embeddings are, how vector databases work, how to design chunking + indexing, and how to evaluate retrieval quality in production.
Vector databases and embeddings power modern search and RAG. But most teams hit the same wall: “We stored vectors… why does retrieval still feel random?” The fix is not a bigger model. It’s retrieval engineering: embedding choice, chunking strategy, indexing, filters, hybrid search, evaluation, and operational hygiene.
Quick answer (Gemini-style summary)
- Embeddings turn meaning into numbers so you can do semantic similarity search.
- Vector DBs store embeddings + metadata and run fast approximate nearest-neighbor queries.
- Quality comes from chunking + filtering + hybrid retrieval + evaluation, not “just add vectors.”
- Default stack: start with Postgres + pgvector if your scale is modest; use a dedicated vector DB when scale/QPS/filtering demands it.
1) What is an embedding?
An embedding is a vector (a list of numbers) that represents meaning. Two texts with similar meaning end up with vectors that are “close” under a distance metric (cosine, dot product, or Euclidean).
Concrete example
- Query: “How do I reset my password?”
- Doc chunk: “To reset your password, go to Settings → Security…”
Even if none of the exact words match (reset vs change), embeddings can still retrieve the right chunk because they share semantic meaning.
2) What does a vector database actually do?
A vector database typically provides:
- Storage: vectors + metadata + IDs
- Indexing: structures for fast approximate nearest-neighbor (ANN) search
- Filtering: metadata filters (tenant_id, doc_type, permissions)
- Upserts & deletes: keep vectors aligned with source-of-truth documents
3) Distance metrics (what “similar” means)
| Metric | Use when | Common note |
|---|---|---|
| Cosine | General semantic similarity; normalized embeddings | Often a safe default |
| Dot product | Normalized vectors or models trained for dot product | Similar to cosine if vectors are normalized |
| Euclidean (L2) | Some ANN indexes; certain model families | Works well when validated on your eval set |
3.1) Choosing an embedding model (domain + constraints)
- Domain fit: FAQ/helpdesk vs legal vs code vs product docs.
- Multilingual: if you serve multiple languages, choose a multilingual model or run language detection + per-language indexes.
- Dimension & cost: higher dimensions increase storage/IO; test if quality gains justify it.
- Latency: on-demand embedding (queries) should be low-latency; batch embedding (docs) can be slower.
4) Indexing: why ANN exists
Exact nearest-neighbor search over millions of vectors is slow. ANN indexes trade tiny accuracy for big speed.
| Index type | Strength | Tradeoff |
|---|---|---|
| HNSW | Great recall/latency; widely used | More memory; tuning matters |
| IVF / PQ | Good for very large corpora | More complex; can reduce recall if mis-tuned |
| Flat | Exact results; simplest | Gets slow at scale |
Index parameter tuning (starter)
- HNSW: tune M (graph connectivity) and efConstruction; query-time ef controls recall/latency.
- IVF: choose nlist (clusters) and nprobe (clusters searched). Larger nprobe improves recall but increases latency.
- PQ: product quantization compresses vectors; validate recall loss on your eval set before enabling.
5) How-to: choose chunking that doesn’t sabotage retrieval
Chunking is where most RAG systems silently fail. You’re deciding what the retriever can “see” and what it can’t.
Step-by-step starter recipe
- Start with structure: split by headings/sections first, then by length.
- Keep chunks answerable: each chunk should stand alone (no dangling references).
- Add metadata: doc_id, url, section_title, updated_at, tenant_id, access labels.
- Measure recall@K with a small query set, then iterate.
Chunking pitfalls (common)
- Chunks too big: top match contains the answer but also a lot of noise (LLM misses it).
- Chunks too small: answer is split across chunks; retrieval returns fragments without context.
- No metadata filtering: model retrieves the “right” answer from the wrong tenant/version.
Chunk overlap and anchors
- Light overlap (e.g., 10–20% tokens) can preserve context across boundaries.
- Anchors: include headings, IDs, and section paths in metadata for better filtering and attribution.
- Normalization: strip boilerplate, unify punctuation/whitespace, and remove navigation chrome.
6) How-to: implement metadata filtering (multi-tenant + permissions)
For real products, metadata filters are not optional. They are the difference between “RAG” and “data leak.”
{
"query": "How do I rotate an API key?",
"top_k": 10,
"filter": {
"tenant_id": "tenant_123",
"doc_visibility": "public",
"product": "enterprise"
}
}
7) Hybrid search + reranking (the quality multiplier)
Vector search is strong at semantic recall. BM25 is strong at exact-match and proper nouns. Hybrid search combines both, then reranking helps you pick the best final context.
- Hybrid retrieval: retrieve candidates from BM25 + vectors, fuse rankings (RRF/weights).
- Reranking: score candidate chunks against the query with a stronger model (cross-encoder or LLM) and take the top N.
Fusion strategies
- RRF (Reciprocal Rank Fusion): simple and robust; good baseline.
- Weighted linear: weight lexical vs semantic scores based on eval results.
- Learned fusion: train a lightweight model to combine signals if you have labels.
Reranker choices
- Cross-encoder: high precision, pairwise scoring; best quality, moderate cost.
- LLM rerank: flexible and explainable, but higher latency/cost; cap candidates and cache aggressively.
- Heuristic rerank: quick filters (dedupe by doc, prefer recent) before expensive reranking.
8) Evaluation: measure retrieval before you blame the generator
Most “hallucinations” in RAG are actually retrieval failures. Separate retrieval eval from generation eval.
- Recall@K: did the correct chunk appear in top K?
- MRR / NDCG: how highly ranked is the first relevant chunk?
- Slice metrics: performance by doc type, tenant, language, and query type.
How-to: build a tiny evaluation set (fast)
- Collect 30–100 real queries.
- For each query, mark the relevant chunk(s) (IDs) as “gold.”
- Run retrieval and compute recall@5 and recall@10.
- Iterate chunking/embedding/hybrid/rerank until recall stabilizes.
| Metric | Definition | Target (starter) |
|---|---|---|
| Recall@5 | Any gold chunk appears in top 5 | ≥ 0.7 for common queries |
| MRR | Mean reciprocal rank of first relevant | ≥ 0.5 baseline; improve with rerank |
| NDCG@10 | Ranking quality with graded relevance | ≥ 0.6 baseline; watch slices |
9) Operations: re-embedding, updates, and drift
Production vector systems change over time. Plan for:
- Document updates: re-embed changed chunks; delete removed chunks.
- Embedding model upgrades: staged re-embedding, dual indexes, and canary queries.
- Distribution drift: new product features and new vocabulary need fresh examples.
Production checklist
- Write path: idempotent upserts, delete on source removal, backfill jobs.
- Index health: recall canaries, ef/nprobe telemetry, error budgets.
- Filters: enforce tenant/ACL in the retriever layer, not just UI.
- Observability: log queries + topK IDs + final context; sample to review.
- Cost: cache embeddings and rerank results; cap candidates.
FAQ (direct answers)
Is pgvector “good enough”?
Often, yes—especially early. If you’re already on Postgres and your vector workload is modest, pgvector is a strong default. If you hit high QPS, huge corpora, or complex hybrid search needs, a dedicated vector DB can be worth it.
Do embeddings remove the need for keywords?
No. Proper nouns, IDs, error codes, and exact phrases are where BM25 shines. Hybrid search is usually the “grown-up” solution.
Bottom line
- Embeddings give you semantic similarity; vector databases give you fast retrieval at scale.
- Retrieval quality is an engineering problem: chunking, metadata filters, hybrid search, reranking, and evaluation.
- Start simple, measure recall@K, and iterate—don’t guess.