BM25 | HyperKit.ai

Definition

BM25 (Best Matching 25) is a ranking function used for information retrieval that scores documents based on term frequency and document length. It's a probabilistic model that improves upon TF-IDF by handling term saturation and document length normalization more effectively.

In RAG systems, BM25 is often used as a keyword-based retrieval method, complementing dense vector (semantic) search for hybrid retrieval approaches.

Key Concepts

Term frequency saturation: Diminishing returns for repeated terms
Document length normalization: Adjusts for document size
IDF component: Rare terms weighted more heavily
Sparse retrieval: Based on exact term matching

Examples

Formula

BM25 Scoring Function

BM25 FORMULA:

For a query Q with terms q1, q2, ..., qn:

score(D, Q) = Σ IDF(qi) × (f(qi,D) × (k1 + 1))
              ─────────────────────────────────
              f(qi,D) + k1 × (1 - b + b × |D|/avgdl)

WHERE:
- f(qi, D) = frequency of term qi in document D
- |D| = length of document D (in words)
- avgdl = average document length in corpus
- k1 = term frequency saturation parameter (1.2-2.0)
- b = document length normalization (0.75 typical)
- IDF(qi) = inverse document frequency of qi

---

INTUITION:

Term Frequency Effect (with saturation):
┌────────────────────────────────────────────────┐
│ Score                                          │
│   │                          ____...           │
│   │                    _____/                  │
│   │               ____/                        │
│   │          ____/                             │
│   │     ____/          BM25 (saturates)        │
│   │____/                                       │
│   │                                            │
│   └────────────────────────────────────────    │
│                Term Frequency                  │
└────────────────────────────────────────────────┘

vs TF-IDF (linear, no saturation):
Score increases linearly with term frequency
→ 100 occurrences = 100× more important (unrealistic)

BM25 says:
→ 10 occurrences ≈ 8× importance (more realistic)
→ 100 occurrences ≈ 10× importance (saturates)

DOCUMENT LENGTH NORMALIZATION:

Long doc (1000 words) with 5 mentions of "python"
vs
Short doc (100 words) with 5 mentions of "python"

TF-IDF: Scores same (both have 5 mentions)
BM25: Short doc scores higher (higher density)

Application

BM25 in RAG Systems

BM25 IN HYBRID RAG:

WHY USE BM25 + VECTORS?

Query: "What is the HTTP 418 error code?"

Vector search alone:
- Finds: "HTTP error codes explained"
- Finds: "Network status responses"
- MISSES: Doc specifically about "418 teapot"
  (embedding doesn't capture exact "418")

BM25 alone:
- Finds: Exact matches for "HTTP 418"
- MISSES: Semantically related content
  (doesn't understand "error" ≈ "status code")

HYBRID (BM25 + Vector):
✓ Finds exact "418" matches (BM25)
✓ Finds semantic relatives (Vector)
→ Best of both worlds

---

IMPLEMENTATION:

from rank_bm25 import BM25Okapi

# Index documents
corpus = [doc.split() for doc in documents]
bm25 = BM25Okapi(corpus)

# Search
query = "python async programming"
scores = bm25.get_scores(query.split())
top_docs = sorted(range(len(scores)),
                  key=lambda i: scores[i],
                  reverse=True)[:5]

---

HYBRID RETRIEVAL FUSION:

def hybrid_search(query, k=10):
    # BM25 retrieval
    bm25_results = bm25_search(query, k=k*2)

    # Vector retrieval
    vector_results = vector_search(query, k=k*2)

    # Reciprocal Rank Fusion
    scores = {}
    for rank, doc in enumerate(bm25_results):
        scores[doc] = scores.get(doc, 0) + 1/(60 + rank)
    for rank, doc in enumerate(vector_results):
        scores[doc] = scores.get(doc, 0) + 1/(60 + rank)

    # Return top-k by combined score
    return sorted(scores.items(),
                  key=lambda x: x[1],
                  reverse=True)[:k]

WHEN TO USE EACH:

┌─────────────────┬─────────────┬───────────────┐
│ Query Type      │ BM25        │ Vector        │
├─────────────────┼─────────────┼───────────────┤
│ Exact terms     │ ★★★★★      │ ★★            │
│ Code/IDs        │ ★★★★★      │ ★             │
│ Semantic        │ ★★          │ ★★★★★        │
│ Synonyms        │ ★           │ ★★★★★        │
│ Typos           │ ★           │ ★★★★         │
│ Multilingual    │ ★           │ ★★★★★        │
└─────────────────┴─────────────┴───────────────┘

Best practice: Use both (hybrid retrieval)

Interactive Exercise

✎

Choose Retrieval Strategy

You're building a code documentation search. Users will search for things like "asyncio.gather example" or "how to handle exceptions". Which retrieval approach would you use and why?

Pro Tips

BM25 excels at exact API/function name matching
Tune k1 (1.2-2.0) and b (0.5-0.8) for your corpus
Hybrid retrieval typically outperforms either alone
BM25 is fast and requires no GPU - good baseline

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms