Dense Retrieval | HyperKit.ai

Definition

Dense retrieval uses learned dense vector representations (embeddings) to find semantically similar documents. Unlike keyword matching, it understands meaning—"car" matches "automobile" even without shared words.

Documents and queries are encoded into high-dimensional vectors, and retrieval finds documents whose vectors are closest to the query vector.

Key Concepts

Embeddings: Dense vector representations capturing semantic meaning
Similarity metrics: Cosine similarity, dot product, Euclidean distance
Bi-encoder: Separate encoders for query and document
Approximate nearest neighbor: Efficient search in large vector spaces

Examples

Concept

Dense vs Sparse Retrieval

Query: "How do I fix a flat tire?"

SPARSE (keyword) matching:
Doc 1: "Tire repair guide" → Match: "tire" ✓
Doc 2: "Changing a punctured wheel" → No keyword match ✗
Doc 3: "Tire pressure tips" → Match: "tire" ✓

DENSE (semantic) matching:
Query embedding: [0.23, -0.45, 0.67, ...]

Doc 1: [0.21, -0.42, 0.65, ...] → Similarity: 0.92 ✓
Doc 2: [0.24, -0.44, 0.68, ...] → Similarity: 0.95 ✓✓ (best!)
Doc 3: [0.15, -0.30, 0.40, ...] → Similarity: 0.71

Dense retrieval finds Doc 2 ("punctured wheel")
as most relevant despite no keyword overlap!

Implementation

Dense Retrieval Pipeline

from sentence_transformers import SentenceTransformer
import numpy as np

class DenseRetriever:
    def __init__(self, model_name="all-MiniLM-L6-v2"):
        self.encoder = SentenceTransformer(model_name)
        self.documents = []
        self.embeddings = None

    def index(self, documents):
        """Embed and store documents"""
        self.documents = documents
        self.embeddings = self.encoder.encode(documents)

    def search(self, query, k=5):
        """Find k most similar documents"""
        query_embedding = self.encoder.encode([query])[0]

        # Cosine similarity
        similarities = np.dot(self.embeddings, query_embedding) / (
            np.linalg.norm(self.embeddings, axis=1) *
            np.linalg.norm(query_embedding)
        )

        # Get top-k indices
        top_indices = np.argsort(similarities)[-k:][::-1]

        return [
            {"doc": self.documents[i], "score": similarities[i]}
            for i in top_indices
        ]

Interactive Exercise

✎

Predict Retrieval Results

Which document would dense retrieval rank highest for this query?

Query: "symptoms of the common cold"

Documents:
A: "Cold weather safety tips"
B: "Runny nose, sneezing, and sore throat"
C: "Refrigerator temperature settings"

Pro Tips

Choose embedding models trained on your domain
Normalize embeddings for stable cosine similarity
Use ANN indexes (FAISS, Annoy) for large-scale search
Consider hybrid retrieval for best results

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms