Sparse Retrieval | HyperKit.ai

Definition

Sparse retrieval finds documents based on exact or near-exact keyword matches. It uses sparse vectors where most values are zero, with non-zero values indicating the presence and importance of specific terms.

Classic algorithms like TF-IDF and BM25 are sparse retrieval methods that have been the foundation of search engines for decades.

Key Concepts

Term frequency (TF): How often a term appears in a document
Inverse document frequency (IDF): Rarity of term across corpus
BM25: Advanced scoring with saturation and length normalization
Inverted index: Data structure mapping terms to documents

Examples

Concept

Sparse Vector Representation

Document: "The cat sat on the mat"
Vocabulary: [the, cat, sat, on, mat, dog, ran, ...]

Sparse vector (bag of words):
Index:    0    1    2   3   4    5    6   ...
Term:   [the, cat, sat, on, mat, dog, ran, ...]
Value:  [ 2,   1,   1,  1,  1,   0,   0,  ...]
         ↑                       ↑
      "the" appears          zeros for
        twice               absent terms

TF-IDF weighted:
        [0.1, 0.8, 0.7, 0.2, 0.8, 0, 0, ...]
         ↑                   ↑
     common word        rare word
     (low weight)      (high weight)

Most values are 0 → "sparse" representation

Implementation

BM25 Search

from rank_bm25 import BM25Okapi

class SparseRetriever:
    def __init__(self):
        self.documents = []
        self.bm25 = None

    def index(self, documents):
        """Build inverted index"""
        self.documents = documents
        # Tokenize documents
        tokenized = [doc.lower().split() for doc in documents]
        self.bm25 = BM25Okapi(tokenized)

    def search(self, query, k=5):
        """Find documents with matching keywords"""
        tokenized_query = query.lower().split()
        scores = self.bm25.get_scores(tokenized_query)

        # Get top-k
        top_indices = sorted(
            range(len(scores)),
            key=lambda i: scores[i],
            reverse=True
        )[:k]

        return [
            {"doc": self.documents[i], "score": scores[i]}
            for i in top_indices
        ]

# Example usage:
retriever = SparseRetriever()
retriever.index([
    "Python programming tutorial",
    "Java development guide",
    "Python data science basics"
])
results = retriever.search("Python tutorial")
# Returns docs 1 and 3 (contain "Python")

Interactive Exercise

✎

Compare Retrieval Methods

When would sparse retrieval outperform dense retrieval?

Give an example scenario where exact keyword matching is preferred.

Pro Tips

Excellent for exact term matching (IDs, codes, names)
Fast and interpretable—you know why a doc matched
No training required, works out of the box
Combine with dense retrieval for best results

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms