Retrieval & Augmentation Systems / Embeddings & Search

Vector Embedding

Intermediate [3/5]
Embedding Dense vector Semantic representation

Definition

A vector embedding is a numerical representation of text (or other data) as a list of numbers. These numbers capture the semantic meaning of the content, allowing computers to understand similarity and relationships between pieces of text.

Think of it as translating words into coordinates in a high-dimensional "meaning space" where similar concepts are close together.

Key Concepts

  • High dimensionality: Typically 384 to 1536 dimensions
  • Semantic similarity: Similar meanings → similar vectors
  • Distance metrics: Cosine similarity, Euclidean distance
  • Embedding models: Specialized models that create embeddings

Examples

How Embeddings Work
Text to Numbers
"The cat sat on the mat" ↓ [0.023, -0.451, 0.892, ..., 0.234] (1536 numbers) "A feline rested on the rug" ↓ [0.019, -0.448, 0.901, ..., 0.228] (1536 numbers) Similarity score: 0.94 (very similar!) "The stock market crashed" ↓ [0.891, 0.234, -0.567, ..., -0.123] (1536 numbers) Similarity to cat sentence: 0.12 (very different)
Sentences with similar meanings produce similar vectors, regardless of the exact words used.
Embedding in RAG
How Retrieval Works
INDEXING (done once): 1. Split documents into chunks 2. Create embedding for each chunk 3. Store in vector database RETRIEVAL (per query): 1. User asks: "How do I reset my password?" ↓ Query embedding: [0.234, -0.567, ...] 2. Compare to all stored embeddings 3. Find most similar chunks: - "Password Reset Guide" (0.92) - "Account Recovery FAQ" (0.87) - "Login Troubleshooting" (0.84) 4. Return top matches for LLM context
Embeddings power the retrieval step in RAG systems.
Popular Embedding Models
Model Comparison
OpenAI text-embedding-3-small Dimensions: 1536 Good for: General purpose OpenAI text-embedding-3-large Dimensions: 3072 Good for: Higher accuracy needs Cohere embed-english-v3 Dimensions: 1024 Good for: English-specific tasks sentence-transformers (open source) Dimensions: 384-768 Good for: Self-hosted, privacy Voyage AI Dimensions: 1024 Good for: Code and technical content
Choose an embedding model based on your accuracy, cost, and privacy needs.

Interactive Exercise

🔢
Predict Similarity

Rank these sentence pairs by how similar their embeddings would be (1 = most similar):

A. "I love pizza" vs "Pizza is my favorite food"

B. "The dog ran fast" vs "The canine sprinted quickly"

C. "I love pizza" vs "The stock market is volatile"

D. "Machine learning is fascinating" vs "AI is interesting"

Pro Tips
  • Use the same embedding model for indexing and querying
  • Normalize embeddings for cosine similarity searches
  • Consider domain-specific models for specialized content
  • Cache embeddings—they're expensive to regenerate

Related Terms