Information Retrieval / Search Algorithms

BM25

Intermediate [3/5]
Best Matching 25 Okapi BM25 Probabilistic retrieval

Definition

BM25 (Best Matching 25) is a ranking function used for information retrieval that scores documents based on term frequency and document length. It's a probabilistic model that improves upon TF-IDF by handling term saturation and document length normalization more effectively.

In RAG systems, BM25 is often used as a keyword-based retrieval method, complementing dense vector (semantic) search for hybrid retrieval approaches.

Key Concepts

  • Term frequency saturation: Diminishing returns for repeated terms
  • Document length normalization: Adjusts for document size
  • IDF component: Rare terms weighted more heavily
  • Sparse retrieval: Based on exact term matching

Examples

Formula
BM25 Scoring Function
BM25 FORMULA: For a query Q with terms q1, q2, ..., qn: score(D, Q) = Σ IDF(qi) × (f(qi,D) × (k1 + 1)) ───────────────────────────────── f(qi,D) + k1 × (1 - b + b × |D|/avgdl) WHERE: - f(qi, D) = frequency of term qi in document D - |D| = length of document D (in words) - avgdl = average document length in corpus - k1 = term frequency saturation parameter (1.2-2.0) - b = document length normalization (0.75 typical) - IDF(qi) = inverse document frequency of qi --- INTUITION: Term Frequency Effect (with saturation): ┌────────────────────────────────────────────────┐ │ Score │ │ │ ____... │ │ │ _____/ │ │ │ ____/ │ │ │ ____/ │ │ │ ____/ BM25 (saturates) │ │ │____/ │ │ │ │ │ └──────────────────────────────────────── │ │ Term Frequency │ └────────────────────────────────────────────────┘ vs TF-IDF (linear, no saturation): Score increases linearly with term frequency → 100 occurrences = 100× more important (unrealistic) BM25 says: → 10 occurrences ≈ 8× importance (more realistic) → 100 occurrences ≈ 10× importance (saturates) DOCUMENT LENGTH NORMALIZATION: Long doc (1000 words) with 5 mentions of "python" vs Short doc (100 words) with 5 mentions of "python" TF-IDF: Scores same (both have 5 mentions) BM25: Short doc scores higher (higher density)
Application
BM25 in RAG Systems
BM25 IN HYBRID RAG: WHY USE BM25 + VECTORS? Query: "What is the HTTP 418 error code?" Vector search alone: - Finds: "HTTP error codes explained" - Finds: "Network status responses" - MISSES: Doc specifically about "418 teapot" (embedding doesn't capture exact "418") BM25 alone: - Finds: Exact matches for "HTTP 418" - MISSES: Semantically related content (doesn't understand "error" ≈ "status code") HYBRID (BM25 + Vector): ✓ Finds exact "418" matches (BM25) ✓ Finds semantic relatives (Vector) → Best of both worlds --- IMPLEMENTATION: from rank_bm25 import BM25Okapi # Index documents corpus = [doc.split() for doc in documents] bm25 = BM25Okapi(corpus) # Search query = "python async programming" scores = bm25.get_scores(query.split()) top_docs = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:5] --- HYBRID RETRIEVAL FUSION: def hybrid_search(query, k=10): # BM25 retrieval bm25_results = bm25_search(query, k=k*2) # Vector retrieval vector_results = vector_search(query, k=k*2) # Reciprocal Rank Fusion scores = {} for rank, doc in enumerate(bm25_results): scores[doc] = scores.get(doc, 0) + 1/(60 + rank) for rank, doc in enumerate(vector_results): scores[doc] = scores.get(doc, 0) + 1/(60 + rank) # Return top-k by combined score return sorted(scores.items(), key=lambda x: x[1], reverse=True)[:k] WHEN TO USE EACH: ┌─────────────────┬─────────────┬───────────────┐ │ Query Type │ BM25 │ Vector │ ├─────────────────┼─────────────┼───────────────┤ │ Exact terms │ ★★★★★ │ ★★ │ │ Code/IDs │ ★★★★★ │ ★ │ │ Semantic │ ★★ │ ★★★★★ │ │ Synonyms │ ★ │ ★★★★★ │ │ Typos │ ★ │ ★★★★ │ │ Multilingual │ ★ │ ★★★★★ │ └─────────────────┴─────────────┴───────────────┘ Best practice: Use both (hybrid retrieval)

Interactive Exercise

Choose Retrieval Strategy

You're building a code documentation search. Users will search for things like "asyncio.gather example" or "how to handle exceptions". Which retrieval approach would you use and why?

Pro Tips
  • BM25 excels at exact API/function name matching
  • Tune k1 (1.2-2.0) and b (0.5-0.8) for your corpus
  • Hybrid retrieval typically outperforms either alone
  • BM25 is fast and requires no GPU - good baseline

Related Terms