Concept
Sparse Vector Representation
Document: "The cat sat on the mat"
Vocabulary: [the, cat, sat, on, mat, dog, ran, ...]
Sparse vector (bag of words):
Index: 0 1 2 3 4 5 6 ...
Term: [the, cat, sat, on, mat, dog, ran, ...]
Value: [ 2, 1, 1, 1, 1, 0, 0, ...]
↑ ↑
"the" appears zeros for
twice absent terms
TF-IDF weighted:
[0.1, 0.8, 0.7, 0.2, 0.8, 0, 0, ...]
↑ ↑
common word rare word
(low weight) (high weight)
Most values are 0 → "sparse" representation
Implementation
BM25 Search
from rank_bm25 import BM25Okapi
class SparseRetriever:
def __init__(self):
self.documents = []
self.bm25 = None
def index(self, documents):
"""Build inverted index"""
self.documents = documents
# Tokenize documents
tokenized = [doc.lower().split() for doc in documents]
self.bm25 = BM25Okapi(tokenized)
def search(self, query, k=5):
"""Find documents with matching keywords"""
tokenized_query = query.lower().split()
scores = self.bm25.get_scores(tokenized_query)
# Get top-k
top_indices = sorted(
range(len(scores)),
key=lambda i: scores[i],
reverse=True
)[:k]
return [
{"doc": self.documents[i], "score": scores[i]}
for i in top_indices
]
# Example usage:
retriever = SparseRetriever()
retriever.index([
"Python programming tutorial",
"Java development guide",
"Python data science basics"
])
results = retriever.search("Python tutorial")
# Returns docs 1 and 3 (contain "Python")