What is an embedding (in plain English)?

An embedding is a numeric vector that represents the meaning of text (or images/audio) so you can compare items by semantic similarity instead of exact keywords.

Do I need a vector database to do RAG?

Not always. You can start with Postgres + pgvector or an existing search engine. Dedicated vector databases help most when you need high QPS, large collections, filtering, and fast approximate nearest neighbor search.

Cosine similarity vs dot product vs Euclidean distance—what should I use?

If your embedding model produces normalized vectors, cosine similarity and dot product often behave similarly. Many systems default to cosine. The best choice is the one your embedding model and vector store support well and you can validate by retrieval metrics.

How do I choose an embedding model?

Choose based on your domain and constraints: retrieval quality on your eval set, embedding dimension (cost/storage), latency, and whether you need multilingual support.

What chunk size should I use?

Start with chunks that match how users ask questions (often 200–800 tokens). Then iterate using recall@K and failure analysis. Over-chunking can bury answers; under-chunking loses context.

What is hybrid search?

Hybrid search combines lexical retrieval (BM25/keywords) with semantic retrieval (vectors) and fuses rankings so you get both exact-match precision and semantic recall.

How do I evaluate a vector search/RAG retriever?

Measure retrieval quality separately from generation using recall@K, MRR/NDCG, and a small set of labeled queries with known relevant chunks. Then monitor drift and edge cases in production.

When do I need reranking?

If you retrieve many candidates (e.g., top 50) and quality is inconsistent, a reranker (cross-encoder or LLM rerank) can dramatically improve final context quality—at a latency/cost tradeoff.

RAG2025-12-12

Vector Databases & Embeddings: A Practical Guide for RAG, Search, and AI Apps

Learn what embeddings are, how vector databases work, how to design chunking + indexing, and how to evaluate retrieval quality in production.

Vector databases and embeddings power modern search and RAG. But most teams hit the same wall: “We stored vectors… why does retrieval still feel random?” The fix is not a bigger model. It’s retrieval engineering: embedding choice, chunking strategy, indexing, filters, hybrid search, evaluation, and operational hygiene.

Quick answer (Gemini-style summary)

Embeddings turn meaning into numbers so you can do semantic similarity search.
Vector DBs store embeddings + metadata and run fast approximate nearest-neighbor queries.
Quality comes from chunking + filtering + hybrid retrieval + evaluation, not “just add vectors.”
Default stack: start with Postgres + pgvector if your scale is modest; use a dedicated vector DB when scale/QPS/filtering demands it.

1) What is an embedding?

An embedding is a vector (a list of numbers) that represents meaning. Two texts with similar meaning end up with vectors that are “close” under a distance metric (cosine, dot product, or Euclidean).

Concrete example

Query: “How do I reset my password?”
Doc chunk: “To reset your password, go to Settings → Security…”

Even if none of the exact words match (reset vs change), embeddings can still retrieve the right chunk because they share semantic meaning.

2) What does a vector database actually do?

A vector database typically provides:

Storage: vectors + metadata + IDs
Indexing: structures for fast approximate nearest-neighbor (ANN) search
Filtering: metadata filters (tenant_id, doc_type, permissions)
Upserts & deletes: keep vectors aligned with source-of-truth documents

A good vector system is a pipeline: chunk → embed → index. If any step is sloppy, retrieval quality drops.

3) Distance metrics (what “similar” means)

Metric	Use when	Common note
Cosine	General semantic similarity; normalized embeddings	Often a safe default
Dot product	Normalized vectors or models trained for dot product	Similar to cosine if vectors are normalized
Euclidean (L2)	Some ANN indexes; certain model families	Works well when validated on your eval set

3.1) Choosing an embedding model (domain + constraints)

Domain fit: FAQ/helpdesk vs legal vs code vs product docs.
Multilingual: if you serve multiple languages, choose a multilingual model or run language detection + per-language indexes.
Dimension & cost: higher dimensions increase storage/IO; test if quality gains justify it.
Latency: on-demand embedding (queries) should be low-latency; batch embedding (docs) can be slower.

4) Indexing: why ANN exists

Exact nearest-neighbor search over millions of vectors is slow. ANN indexes trade tiny accuracy for big speed.

Index type	Strength	Tradeoff
HNSW	Great recall/latency; widely used	More memory; tuning matters
IVF / PQ	Good for very large corpora	More complex; can reduce recall if mis-tuned
Flat	Exact results; simplest	Gets slow at scale

Index parameter tuning (starter)

HNSW: tune M (graph connectivity) and efConstruction; query-time ef controls recall/latency.
IVF: choose nlist (clusters) and nprobe (clusters searched). Larger nprobe improves recall but increases latency.
PQ: product quantization compresses vectors; validate recall loss on your eval set before enabling.

5) How-to: choose chunking that doesn’t sabotage retrieval

Chunking is where most RAG systems silently fail. You’re deciding what the retriever can “see” and what it can’t.

Step-by-step starter recipe

Start with structure: split by headings/sections first, then by length.
Keep chunks answerable: each chunk should stand alone (no dangling references).
Add metadata: doc_id, url, section_title, updated_at, tenant_id, access labels.
Measure recall@K with a small query set, then iterate.

Chunking pitfalls (common)

Chunks too big: top match contains the answer but also a lot of noise (LLM misses it).
Chunks too small: answer is split across chunks; retrieval returns fragments without context.
No metadata filtering: model retrieves the “right” answer from the wrong tenant/version.

Chunk overlap and anchors

Light overlap (e.g., 10–20% tokens) can preserve context across boundaries.
Anchors: include headings, IDs, and section paths in metadata for better filtering and attribution.
Normalization: strip boilerplate, unify punctuation/whitespace, and remove navigation chrome.

6) How-to: implement metadata filtering (multi-tenant + permissions)

For real products, metadata filters are not optional. They are the difference between “RAG” and “data leak.”

{
  "query": "How do I rotate an API key?",
  "top_k": 10,
  "filter": {
    "tenant_id": "tenant_123",
    "doc_visibility": "public",
    "product": "enterprise"
  }
}

7) Hybrid search + reranking (the quality multiplier)

Vector search is strong at semantic recall. BM25 is strong at exact-match and proper nouns. Hybrid search combines both, then reranking helps you pick the best final context.

Hybrid retrieval: retrieve candidates from BM25 + vectors, fuse rankings (RRF/weights).
Reranking: score candidate chunks against the query with a stronger model (cross-encoder or LLM) and take the top N.

Fusion strategies

RRF (Reciprocal Rank Fusion): simple and robust; good baseline.
Weighted linear: weight lexical vs semantic scores based on eval results.
Learned fusion: train a lightweight model to combine signals if you have labels.

Reranker choices

Cross-encoder: high precision, pairwise scoring; best quality, moderate cost.
LLM rerank: flexible and explainable, but higher latency/cost; cap candidates and cache aggressively.
Heuristic rerank: quick filters (dedupe by doc, prefer recent) before expensive reranking.

8) Evaluation: measure retrieval before you blame the generator

Most “hallucinations” in RAG are actually retrieval failures. Separate retrieval eval from generation eval.

Recall@K: did the correct chunk appear in top K?
MRR / NDCG: how highly ranked is the first relevant chunk?
Slice metrics: performance by doc type, tenant, language, and query type.

How-to: build a tiny evaluation set (fast)

Collect 30–100 real queries.
For each query, mark the relevant chunk(s) (IDs) as “gold.”
Run retrieval and compute recall@5 and recall@10.
Iterate chunking/embedding/hybrid/rerank until recall stabilizes.

Metric	Definition	Target (starter)
Recall@5	Any gold chunk appears in top 5	≥ 0.7 for common queries
MRR	Mean reciprocal rank of first relevant	≥ 0.5 baseline; improve with rerank
NDCG@10	Ranking quality with graded relevance	≥ 0.6 baseline; watch slices

9) Operations: re-embedding, updates, and drift

Production vector systems change over time. Plan for:

Document updates: re-embed changed chunks; delete removed chunks.
Embedding model upgrades: staged re-embedding, dual indexes, and canary queries.
Distribution drift: new product features and new vocabulary need fresh examples.

Production checklist

Write path: idempotent upserts, delete on source removal, backfill jobs.
Index health: recall canaries, ef/nprobe telemetry, error budgets.
Filters: enforce tenant/ACL in the retriever layer, not just UI.
Observability: log queries + topK IDs + final context; sample to review.
Cost: cache embeddings and rerank results; cap candidates.

FAQ (direct answers)

Is pgvector “good enough”?

Often, yes—especially early. If you’re already on Postgres and your vector workload is modest, pgvector is a strong default. If you hit high QPS, huge corpora, or complex hybrid search needs, a dedicated vector DB can be worth it.

Do embeddings remove the need for keywords?

No. Proper nouns, IDs, error codes, and exact phrases are where BM25 shines. Hybrid search is usually the “grown-up” solution.

Bottom line

Embeddings give you semantic similarity; vector databases give you fast retrieval at scale.
Retrieval quality is an engineering problem: chunking, metadata filters, hybrid search, reranking, and evaluation.
Start simple, measure recall@K, and iterate—don’t guess.