Memory and Context / Retrieval Methods

Sparse Retrieval

Intermediate [3/5]
Keyword retrieval Lexical retrieval BM25 retrieval

Definition

Sparse retrieval finds documents based on exact or near-exact keyword matches. It uses sparse vectors where most values are zero, with non-zero values indicating the presence and importance of specific terms.

Classic algorithms like TF-IDF and BM25 are sparse retrieval methods that have been the foundation of search engines for decades.

Key Concepts

  • Term frequency (TF): How often a term appears in a document
  • Inverse document frequency (IDF): Rarity of term across corpus
  • BM25: Advanced scoring with saturation and length normalization
  • Inverted index: Data structure mapping terms to documents

Examples

Concept
Sparse Vector Representation
Document: "The cat sat on the mat" Vocabulary: [the, cat, sat, on, mat, dog, ran, ...] Sparse vector (bag of words): Index: 0 1 2 3 4 5 6 ... Term: [the, cat, sat, on, mat, dog, ran, ...] Value: [ 2, 1, 1, 1, 1, 0, 0, ...] ↑ ↑ "the" appears zeros for twice absent terms TF-IDF weighted: [0.1, 0.8, 0.7, 0.2, 0.8, 0, 0, ...] ↑ ↑ common word rare word (low weight) (high weight) Most values are 0 → "sparse" representation
Implementation
BM25 Search
from rank_bm25 import BM25Okapi class SparseRetriever: def __init__(self): self.documents = [] self.bm25 = None def index(self, documents): """Build inverted index""" self.documents = documents # Tokenize documents tokenized = [doc.lower().split() for doc in documents] self.bm25 = BM25Okapi(tokenized) def search(self, query, k=5): """Find documents with matching keywords""" tokenized_query = query.lower().split() scores = self.bm25.get_scores(tokenized_query) # Get top-k top_indices = sorted( range(len(scores)), key=lambda i: scores[i], reverse=True )[:k] return [ {"doc": self.documents[i], "score": scores[i]} for i in top_indices ] # Example usage: retriever = SparseRetriever() retriever.index([ "Python programming tutorial", "Java development guide", "Python data science basics" ]) results = retriever.search("Python tutorial") # Returns docs 1 and 3 (contain "Python")

Interactive Exercise

Compare Retrieval Methods

When would sparse retrieval outperform dense retrieval?

Give an example scenario where exact keyword matching is preferred.

Pro Tips
  • Excellent for exact term matching (IDs, codes, names)
  • Fast and interpretable—you know why a doc matched
  • No training required, works out of the box
  • Combine with dense retrieval for best results

Related Terms