BM25 Algorithm

12 min

Probabilistic ranking function for information retrieval with term frequency saturation

Best viewed on desktop for optimal interactive experience

BM25 (Best Matching 25) Algorithm

BM25 is a probabilistic ranking function that estimates document relevance to search queries. It's the foundation of many search engines including Elasticsearch and Lucene, remaining competitive with modern neural approaches for keyword-based search.

Interactive BM25 Explorer

BM25 Ranking Algorithm

Probabilistic retrieval with term frequency saturation

No saturationHigh saturation
No normalizationFull normalization

Term Frequency Saturation

BM25 vs TF-IDF

BM25 Advantages
  • • Term frequency saturation prevents keyword stuffing
  • • Better document length normalization
  • • Probabilistic foundation
  • • Tunable parameters (k1, b)
  • • Superior performance in practice
TF-IDF Characteristics
  • • Linear term frequency
  • • Simple length normalization
  • • No saturation effect
  • • Fixed formula
  • • Easier to understand

Python Implementation

class BM25:
    def __init__(self, corpus, k1=1.2, b=0.75):
        self.k1 = k1
        self.b = b
        self.corpus = corpus
        self.doc_lengths = [len(doc.split()) for doc in corpus]
        self.avgdl = sum(self.doc_lengths) / len(self.doc_lengths)
        self.N = len(corpus)
        self.doc_freqs = self._calculate_doc_frequencies()
    
    def _calculate_doc_frequencies(self):
        doc_freqs = {}
        for doc in self.corpus:
            for term in set(doc.split()):
                doc_freqs[term] = doc_freqs.get(term, 0) + 1
        return doc_freqs
    
    def idf(self, term):
        n = self.doc_freqs.get(term, 0)
        return math.log((self.N - n + 0.5) / (n + 0.5) + 1)
    
    def score(self, query, doc_idx):
        doc = self.corpus[doc_idx]
        doc_terms = doc.split()
        doc_len = self.doc_lengths[doc_idx]
        
        score = 0
        for term in query.split():
            tf = doc_terms.count(term)
            if tf > 0:
                idf = self.idf(term)
                numerator = tf * (self.k1 + 1)
                denominator = tf + self.k1 * (1 - self.b + self.b * doc_len / self.avgdl)
                score += idf * (numerator / denominator)
        
        return score

Mathematical Foundation

The Core Formula

The BM25 score for a document D given query Q is:

score(D,Q) = Σ IDF(qi) · (f(qi,D) · (k1 + 1)) / (f(qi,D) + k1 · (1 - b + b · |D|/avgdl))

Components Breakdown

ComponentDescriptionTypical Value
f(qi,D)Term frequency of qi in document DVaries
**D**
avgdlAverage document length in collectionComputed
k1Term frequency saturation parameter1.2
bDocument length normalization0.75
IDF(qi)Inverse document frequencyComputed

IDF Calculation

IDF(qi) = log((N - n(qi) + 0.5) / (n(qi) + 0.5))

Where:

  • N: Total number of documents in collection
  • n(qi): Number of documents containing term qi

Key Innovations

1. Term Frequency Saturation

Unlike TF-IDF's linear term frequency, BM25 uses a saturating function:

def saturation_function(tf, k1): """ Diminishing returns for repeated terms """ return (tf * (k1 + 1)) / (tf + k1)

Benefits:

  • Prevents keyword stuffing
  • First occurrences matter most
  • Tunable via k1 parameter

2. Document Length Normalization

Sophisticated normalization accounting for document length variance:

def length_normalization(doc_length, avg_length, b): """ Penalize long documents, boost short ones """ return 1 - b + b * (doc_length / avg_length)

Effects of b parameter:

  • b = 0: No length normalization
  • b = 0.75: Balanced (default)
  • b = 1: Full normalization

3. Probabilistic Foundation

Based on the Probability Ranking Principle (PRP):

  • Documents ranked by probability of relevance
  • Derived from Robertson-Spärck Jones weight
  • Theoretically grounded in information theory

Implementation

Python Implementation

import numpy as np from collections import Counter from typing import List, Dict, Tuple class BM25: def __init__(self, corpus: List[str], k1: float = 1.2, b: float = 0.75): """ Initialize BM25 with a corpus of documents Args: corpus: List of documents (strings) k1: Term saturation parameter b: Length normalization parameter """ self.k1 = k1 self.b = b self.corpus = corpus self.N = len(corpus) # Precompute document lengths and average self.doc_lengths = [len(doc.split()) for doc in corpus] self.avgdl = np.mean(self.doc_lengths) # Build inverted index and document frequencies self.doc_freqs = {} self.inverted_index = {} self._build_index() def _build_index(self): """Build inverted index and calculate document frequencies""" for doc_id, doc in enumerate(self.corpus): terms = set(doc.lower().split()) for term in terms: # Document frequency self.doc_freqs[term] = self.doc_freqs.get(term, 0) + 1 # Inverted index if term not in self.inverted_index: self.inverted_index[term] = [] self.inverted_index[term].append(doc_id) def idf(self, term: str) -> float: """Calculate IDF for a term""" n = self.doc_freqs.get(term, 0) # Add 0.5 to avoid negative IDF return np.log((self.N - n + 0.5) / (n + 0.5) + 1) def score_document(self, query: str, doc_id: int) -> float: """Calculate BM25 score for a single document""" doc = self.corpus[doc_id] doc_len = self.doc_lengths[doc_id] # Count term frequencies in document doc_terms = Counter(doc.lower().split()) query_terms = query.lower().split() score = 0 for term in query_terms: if term in doc_terms: tf = doc_terms[term] idf = self.idf(term) # BM25 formula numerator = tf * (self.k1 + 1) denominator = tf + self.k1 * (1 - self.b + self.b * doc_len / self.avgdl) score += idf * (numerator / denominator) return score def search(self, query: str, k: int = 10) -> List[Tuple[int, float]]: """ Search for top-k documents Returns: List of (doc_id, score) tuples sorted by score """ # Get candidate documents (containing at least one query term) query_terms = set(query.lower().split()) candidates = set() for term in query_terms: if term in self.inverted_index: candidates.update(self.inverted_index[term]) # Score all candidates scores = [] for doc_id in candidates: score = self.score_document(query, doc_id) scores.append((doc_id, score)) # Sort by score and return top-k scores.sort(key=lambda x: x[1], reverse=True) return scores[:k]

Optimized Implementation

class OptimizedBM25: """Memory-efficient BM25 with sparse representations""" def __init__(self, k1=1.2, b=0.75): self.k1 = k1 self.b = b self.doc_vectors = {} # Sparse document vectors self.idf_cache = {} # Cached IDF values def index_documents(self, documents): """Build sparse index with precomputed values""" from scipy.sparse import csr_matrix # Build vocabulary vocab = {} for doc in documents: for term in doc.split(): if term not in vocab: vocab[term] = len(vocab) # Create sparse document-term matrix rows, cols, data = [], [], [] for doc_id, doc in enumerate(documents): term_counts = Counter(doc.split()) for term, count in term_counts.items(): rows.append(doc_id) cols.append(vocab[term]) data.append(count) self.doc_term_matrix = csr_matrix( (data, (rows, cols)), shape=(len(documents), len(vocab)) ) # Precompute IDF values doc_freqs = (self.doc_term_matrix > 0).sum(axis=0).A1 N = len(documents) self.idf_values = np.log((N - doc_freqs + 0.5) / (doc_freqs + 0.5) + 1)

Variants and Extensions

BM25F (Field-weighted BM25)

Handles documents with multiple fields:

def bm25f_score(query, doc, field_weights): """ BM25F for multi-field documents field_weights = {'title': 2.0, 'body': 1.0, 'abstract': 1.5} """ score = 0 for field, weight in field_weights.items(): field_score = bm25_score(query, doc[field]) score += weight * field_score return score

BM25+

Addresses the term frequency = 0 issue:

def bm25_plus(tf, idf, k1, b, delta=1.0): """ BM25+ with delta parameter for better handling of rare terms """ standard_bm25 = idf * ((tf * (k1 + 1)) / (tf + k1 * norm)) return standard_bm25 + delta * idf

BM25-Adpt

Dynamically adapts parameters based on query:

class AdaptiveBM25: def adapt_parameters(self, query): """Adjust k1 and b based on query characteristics""" query_length = len(query.split()) # Shorter queries need less saturation if query_length <= 2: self.k1 = 0.8 self.b = 0.5 elif query_length <= 5: self.k1 = 1.2 self.b = 0.75 else: self.k1 = 1.5 self.b = 0.8

Performance Optimization

1. Index-time Optimizations

class IndexOptimizations: def __init__(self): self.posting_lists = {} # Inverted index self.skip_pointers = {} # For faster intersection self.doc_norms = {} # Precomputed normalizations def build_skip_pointers(self, posting_list, skip_distance=None): """Add skip pointers for faster processing""" if skip_distance is None: skip_distance = int(np.sqrt(len(posting_list))) skip_list = [] for i in range(0, len(posting_list), skip_distance): if i + skip_distance < len(posting_list): skip_list.append((i, i + skip_distance)) return skip_list

2. Query-time Optimizations

def optimized_search(query, index, k=10): """ Optimized search with early termination """ # Use heap for top-k import heapq top_k = [] # Process documents in chunks for doc_batch in index.get_batches(): scores = compute_scores_vectorized(query, doc_batch) # Maintain top-k with heap for doc_id, score in scores: if len(top_k) < k: heapq.heappush(top_k, (score, doc_id)) elif score > top_k[0][0]: heapq.heapreplace(top_k, (score, doc_id)) # Sort final results return sorted(top_k, reverse=True)

3. Caching Strategies

class CachedBM25: def __init__(self, cache_size=10000): from functools import lru_cache # Cache IDF values self.idf = lru_cache(maxsize=cache_size)(self._compute_idf) # Cache query processing self.process_query = lru_cache(maxsize=1000)(self._process_query) # Cache document scores for popular queries self.score_cache = {}

Comparison with Modern Approaches

BM25 vs Dense Embeddings

AspectBM25Dense Embeddings
Exact MatchExcellentPoor
SynonymsPoorExcellent
TyposPoorGood
SpeedVery FastSlower
MemoryLowHigh
TrainingNot neededRequired
InterpretabilityHighLow
Domain TransferGoodRequires fine-tuning

Hybrid Approaches

Combining BM25 with dense methods:

class HybridRetriever: def __init__(self, bm25_index, dense_index, alpha=0.5): self.bm25 = bm25_index self.dense = dense_index self.alpha = alpha # Weight for BM25 def search(self, query, k=10): # Get BM25 results bm25_results = self.bm25.search(query, k=k*2) bm25_scores = {doc_id: score for doc_id, score in bm25_results} # Get dense results query_embedding = self.encode(query) dense_results = self.dense.search(query_embedding, k=k*2) dense_scores = {doc_id: score for doc_id, score in dense_results} # Combine scores all_docs = set(bm25_scores.keys()) | set(dense_scores.keys()) combined_scores = [] for doc_id in all_docs: bm25_score = bm25_scores.get(doc_id, 0) dense_score = dense_scores.get(doc_id, 0) # Normalize and combine final_score = self.alpha * normalize(bm25_score) + \ (1 - self.alpha) * normalize(dense_score) combined_scores.append((doc_id, final_score)) # Sort and return top-k combined_scores.sort(key=lambda x: x[1], reverse=True) return combined_scores[:k]

Best Practices

1. Parameter Tuning

def grid_search_parameters(train_queries, relevance_judgments): """Find optimal k1 and b parameters""" best_params = {'k1': 1.2, 'b': 0.75} best_score = 0 for k1 in np.arange(0.5, 3.0, 0.1): for b in np.arange(0.0, 1.0, 0.05): bm25 = BM25(corpus, k1=k1, b=b) # Evaluate on training queries ndcg_scores = [] for query, relevant_docs in zip(train_queries, relevance_judgments): results = bm25.search(query) ndcg = compute_ndcg(results, relevant_docs) ndcg_scores.append(ndcg) avg_ndcg = np.mean(ndcg_scores) if avg_ndcg > best_score: best_score = avg_ndcg best_params = {'k1': k1, 'b': b} return best_params

2. Preprocessing Pipeline

class BM25Preprocessor: def __init__(self): self.stemmer = PorterStemmer() self.stop_words = set(stopwords.words('english')) def preprocess(self, text): """Standard preprocessing for BM25""" # Lowercase text = text.lower() # Remove punctuation text = re.sub(r'[^\w\s]', ' ', text) # Tokenize tokens = text.split() # Remove stopwords (optional) tokens = [t for t in tokens if t not in self.stop_words] # Stemming (optional) tokens = [self.stemmer.stem(t) for t in tokens] return ' '.join(tokens)

3. Domain-Specific Adaptations

class DomainBM25: """BM25 with domain-specific enhancements""" def __init__(self, domain='general'): self.domain = domain self.set_domain_parameters() def set_domain_parameters(self): """Set parameters based on domain characteristics""" domain_configs = { 'legal': {'k1': 1.5, 'b': 0.9}, # Long documents 'twitter': {'k1': 0.8, 'b': 0.3}, # Short texts 'scientific': {'k1': 1.2, 'b': 0.75}, # Standard 'news': {'k1': 1.0, 'b': 0.7}, # Medium length 'product': {'k1': 0.9, 'b': 0.5} # Product descriptions } config = domain_configs.get(self.domain, {'k1': 1.2, 'b': 0.75}) self.k1 = config['k1'] self.b = config['b']

Production Deployment

Elasticsearch Integration

{ "settings": { "index": { "similarity": { "custom_bm25": { "type": "BM25", "k1": 1.2, "b": 0.75 } } } }, "mappings": { "properties": { "content": { "type": "text", "similarity": "custom_bm25" } } } }

Scalability Considerations

  1. Sharding: Distribute index across multiple machines
  2. Caching: Cache popular queries and IDF values
  3. Approximation: Use top-k approximation algorithms
  4. Compression: Compress posting lists
  5. Parallelization: Process queries in parallel

Evaluation Metrics

Standard Metrics

def evaluate_bm25(bm25_index, test_queries, relevance_judgments): """Comprehensive evaluation of BM25 performance""" metrics = { 'map': [], # Mean Average Precision 'mrr': [], # Mean Reciprocal Rank 'ndcg@10': [], # Normalized Discounted Cumulative Gain 'p@10': [], # Precision at 10 'r@10': [] # Recall at 10 } for query, relevant in zip(test_queries, relevance_judgments): results = bm25_index.search(query, k=10) retrieved = [doc_id for doc_id, _ in results] # Calculate metrics metrics['map'].append(average_precision(retrieved, relevant)) metrics['mrr'].append(reciprocal_rank(retrieved, relevant)) metrics['ndcg@10'].append(ndcg_at_k(retrieved, relevant, 10)) metrics['p@10'].append(precision_at_k(retrieved, relevant, 10)) metrics['r@10'].append(recall_at_k(retrieved, relevant, 10)) # Average across queries return {k: np.mean(v) for k, v in metrics.items()}

Common Pitfalls and Solutions

1. Vocabulary Mismatch

Problem: Query terms not in documents Solution: Query expansion, synonyms, stemming

2. Parameter Sensitivity

Problem: Performance varies with k1, b Solution: Grid search, domain-specific tuning

3. Length Bias

Problem: Favoring short/long documents Solution: Adjust b parameter, field-specific scoring

4. Rare Terms

Problem: IDF can be negative for very common terms Solution: Use BM25+ variant, adjust IDF formula

Conclusion

BM25 remains a powerful and efficient ranking algorithm that often outperforms complex neural models for keyword-based search. Its probabilistic foundation, tunable parameters, and excellent performance on exact matching make it an essential component of modern search systems.

The interactive visualization demonstrates how term frequency saturation and document length normalization work together to produce robust relevance scores. Understanding BM25 is crucial for building effective search systems, whether used alone or in hybrid approaches with dense embeddings.

If you found this explanation helpful, consider sharing it with others.

Mastodon