Agents & Tools · intermediate
BM25 (Okapi BM25)
BM25 is the classical keyword-based ranking algorithm: a refined TF-IDF that scores documents by query-term frequency, document length, and corpus-wide rarity. The keyword side of hybrid search.
Explanation
Decades old, but BM25 stubbornly remains the best lightweight keyword retriever. It scales linearly in corpus size, runs without GPUs, and excels on the exact-match queries where embeddings underperform — product names, error codes, version numbers, jargon.
Modern RAG systems run BM25 alongside vector search in a hybrid setup. Elasticsearch, OpenSearch, Postgres tsvector, and most vector DBs (Weaviate, Qdrant) all ship BM25 built in.
A common surprise: on technical corpora (codebases, internal docs heavy in proper nouns), BM25 alone often beats vector alone. Hybrid is reliably best.
Examples
- A codebase search where BM25 finds every file containing the exact function name; vector alone often missed them.
- Elasticsearch index serving as the BM25 half of a hybrid RAG.
Frequently asked
What is BM25?
BM25 is the classical keyword-based ranking algorithm: a refined TF-IDF that scores documents by query-term frequency, document length, and corpus-wide rarity. The keyword side of hybrid search.
What is an example of bm25?
A codebase search where BM25 finds every file containing the exact function name; vector alone often missed them.
How is BM25 related to Hybrid Search?
BM25 and Hybrid Search are both agents & tools concepts. Hybrid search combines vector (semantic) and keyword (BM25) retrieval and fuses their results — usually via Reciprocal Rank Fusion — to get the best of both: semantic recall and exact-match precision.
Is BM25 considered intermediate?
BM25 is generally considered intermediate-level material in the AI and LLM space.