Skip to main content
ModelTerms

Agents & Tools · intermediate

BM25 (Okapi BM25)

BM25 is the classical keyword-based ranking algorithm: a refined TF-IDF that scores documents by query-term frequency, document length, and corpus-wide rarity. The keyword side of hybrid search.

Explanation

Decades old, but BM25 stubbornly remains the best lightweight keyword retriever. It scales linearly in corpus size, runs without GPUs, and excels on the exact-match queries where embeddings underperform — product names, error codes, version numbers, jargon.

Modern RAG systems run BM25 alongside vector search in a hybrid setup. Elasticsearch, OpenSearch, Postgres tsvector, and most vector DBs (Weaviate, Qdrant) all ship BM25 built in.

A common surprise: on technical corpora (codebases, internal docs heavy in proper nouns), BM25 alone often beats vector alone. Hybrid is reliably best.

Examples

  • A codebase search where BM25 finds every file containing the exact function name; vector alone often missed them.
  • Elasticsearch index serving as the BM25 half of a hybrid RAG.

Frequently asked

What is BM25?

BM25 is the classical keyword-based ranking algorithm: a refined TF-IDF that scores documents by query-term frequency, document length, and corpus-wide rarity. The keyword side of hybrid search.

What is an example of bm25?

A codebase search where BM25 finds every file containing the exact function name; vector alone often missed them.

How is BM25 related to Hybrid Search?

BM25 and Hybrid Search are both agents & tools concepts. Hybrid search combines vector (semantic) and keyword (BM25) retrieval and fuses their results — usually via Reciprocal Rank Fusion — to get the best of both: semantic recall and exact-match precision.

Is BM25 considered intermediate?

BM25 is generally considered intermediate-level material in the AI and LLM space.

Hybrid SearchAgents & Tools

Hybrid search combines vector (semantic) and keyword (BM25) retrieval and fuses their results — usually via Reciprocal Rank Fusion — to get the best of both: semantic recall and exact-match precision.

Retrieval-Augmented GenerationAgents & Tools

RAG retrieves relevant documents from a corpus at query time and includes them in the prompt, letting an LLM answer with up-to-date, source-cited, private information without retraining.

Semantic SearchAgents & Tools

Semantic search ranks documents by meaning rather than keyword match, using embedding similarity. "Affordable laptops" can match "cheap notebooks" even with no overlapping words.

EmbeddingArchitecture

An embedding is a list of numbers (a vector) that represents a piece of input — a word, a sentence, an image — in a space where similar things end up close together.

Side-by-side comparisons

Sources