Vector Embedding | HyperKit.ai

Definition

A vector embedding is a numerical representation of text (or other data) as a list of numbers. These numbers capture the semantic meaning of the content, allowing computers to understand similarity and relationships between pieces of text.

Think of it as translating words into coordinates in a high-dimensional "meaning space" where similar concepts are close together.

Key Concepts

High dimensionality: Typically 384 to 1536 dimensions
Semantic similarity: Similar meanings → similar vectors
Distance metrics: Cosine similarity, Euclidean distance
Embedding models: Specialized models that create embeddings

Examples

How Embeddings Work

Text to Numbers

"The cat sat on the mat"
        ↓
[0.023, -0.451, 0.892, ..., 0.234]
(1536 numbers)

"A feline rested on the rug"
        ↓
[0.019, -0.448, 0.901, ..., 0.228]
(1536 numbers)

Similarity score: 0.94 (very similar!)

"The stock market crashed"
        ↓
[0.891, 0.234, -0.567, ..., -0.123]
(1536 numbers)

Similarity to cat sentence: 0.12 (very different)

Sentences with similar meanings produce similar vectors, regardless of the exact words used.

Embedding in RAG

How Retrieval Works

INDEXING (done once):
1. Split documents into chunks
2. Create embedding for each chunk
3. Store in vector database

RETRIEVAL (per query):
1. User asks: "How do I reset my password?"
        ↓
   Query embedding: [0.234, -0.567, ...]

2. Compare to all stored embeddings
3. Find most similar chunks:
   - "Password Reset Guide" (0.92)
   - "Account Recovery FAQ" (0.87)
   - "Login Troubleshooting" (0.84)

4. Return top matches for LLM context

Embeddings power the retrieval step in RAG systems.

Popular Embedding Models

Model Comparison

OpenAI text-embedding-3-small
  Dimensions: 1536
  Good for: General purpose

OpenAI text-embedding-3-large
  Dimensions: 3072
  Good for: Higher accuracy needs

Cohere embed-english-v3
  Dimensions: 1024
  Good for: English-specific tasks

sentence-transformers (open source)
  Dimensions: 384-768
  Good for: Self-hosted, privacy

Voyage AI
  Dimensions: 1024
  Good for: Code and technical content

Choose an embedding model based on your accuracy, cost, and privacy needs.

Interactive Exercise

🔢

Predict Similarity

Rank these sentence pairs by how similar their embeddings would be (1 = most similar):

A. "I love pizza" vs "Pizza is my favorite food"

B. "The dog ran fast" vs "The canine sprinted quickly"

C. "I love pizza" vs "The stock market is volatile"

D. "Machine learning is fascinating" vs "AI is interesting"

Pro Tips

Use the same embedding model for indexing and querying
Normalize embeddings for cosine similarity searches
Consider domain-specific models for specialized content
Cache embeddings—they're expensive to regenerate

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms