Agents & Tools · intermediate

Retrieval-Augmented Generation (RAG)

RAG retrieves relevant documents from a corpus at query time and includes them in the prompt, letting an LLM answer with up-to-date, source-cited, private information without retraining.

Published May 29, 2026

Explanation

Pipeline: chunk your documents, embed the chunks, store in a vector database. At query time: embed the user's question, retrieve the top-K nearest chunks, stuff them into the prompt before the question, let the LLM generate the answer.

RAG sidesteps three of the LLM's biggest limitations: knowledge cutoff (you can index new content daily), private data (the model never sees your data during training), and hallucination (the model can cite specific retrieved sources).

It is the dominant pattern for "chat with your docs", customer support bots, and internal knowledge tools.

Examples

"Chat with your PDFs" — Notion, Glean, ChatGPT custom GPTs.
Customer-support bot that cites help-center articles by URL.

When to use retrieval-augmented generation

When the model needs information that is not baked into its weights — fresh, private, or domain-specific.

Frequently asked

What is Retrieval-Augmented Generation?

RAG retrieves relevant documents from a corpus at query time and includes them in the prompt, letting an LLM answer with up-to-date, source-cited, private information without retraining.

What is an example of retrieval-augmented generation?

"Chat with your PDFs" — Notion, Glean, ChatGPT custom GPTs.

How is Retrieval-Augmented Generation related to Embedding?

Retrieval-Augmented Generation and Embedding are both agents & tools concepts. An embedding is a list of numbers (a vector) that represents a piece of input — a word, a sentence, an image — in a space where similar things end up close together.

When should I use retrieval-augmented generation?

When the model needs information that is not baked into its weights — fresh, private, or domain-specific.

Is Retrieval-Augmented Generation considered intermediate?

Retrieval-Augmented Generation is generally considered intermediate-level material in the AI and LLM space.

EmbeddingArchitecture

An embedding is a list of numbers (a vector) that represents a piece of input — a word, a sentence, an image — in a space where similar things end up close together.

Vector DatabaseAgents & Tools

A vector database stores high-dimensional embeddings and answers "find the K nearest vectors to this query" extremely fast. The retrieval engine behind most RAG systems.

Semantic SearchAgents & Tools

Semantic search ranks documents by meaning rather than keyword match, using embedding similarity. "Affordable laptops" can match "cheap notebooks" even with no overlapping words.

Context WindowInference

The context window is the maximum number of tokens an LLM can consider in a single call — prompt plus generated output combined.

Fine-tuningTraining

Fine-tuning continues training a pretrained model on a smaller, task-specific dataset, adjusting its weights to specialize behavior or knowledge.

HallucinationEvaluation

A hallucination is a confidently-stated, plausible-sounding LLM output that is factually wrong. It is the failure mode that most often surprises non-expert users.

AgentAgents & Tools

An AI agent is an LLM-driven system that decides which actions to take, executes them via tools, observes the results, and iterates until a goal is met.

ChunkingAgents & Tools

Chunking is the process of splitting source documents into smaller passages before embedding them for retrieval. Chunk size and boundaries control how relevant retrievals will be.

Side-by-side comparisons

Sources

RAG paper (arXiv)