Retrieval-Augmented Generation (RAG)

Interactive

Learn how RAG lets an LLM answer questions using relevant external documents fetched at query time.

Try the interactive tools (2)
Difficulty intermediate
Read time 8 min
rag retrieval llm embeddings vector-search vector-database knowledge-grounding nlp
Updated March 9, 2026

What Is Retrieval-Augmented Generation (RAG)?

Retrieval-Augmented Generation (RAG) is a way to make a language model feel like it “looked something up” before answering.

Analogy: imagine an extremely smart colleague with a decent memory… but who also has access to a well-organized library. When you ask a question, they first walk to the right shelf, grab a few relevant pages, skim them, and then answer—using those pages as evidence. That “walk to the shelf” step is retrieval; the final response is generation.

Technically, RAG is a pattern where an LLM (the generator) is given retrieved context (snippets from documents, databases, or knowledge bases) at inference time, so the answer can be based on your data instead of only what the model memorized during training.

Why Does It Matter?

LLMs are impressive, but they have two chronic limitations:

  1. Their built-in knowledge is frozen at training time.
  2. They can’t read your entire data source (wikis, PDFs, policies, tickets, codebases) in one go because of context window limits.

RAG matters because it helps you build systems that:

  • Answer questions using current and private information (company docs, product manuals, internal policies).
  • Provide traceability: you can show the user what text the model used, which is a big deal for trust and auditing.
  • Update knowledge without retraining the model—just update the document store and re-index.

In practice, RAG is the backbone of many “chat with your data” applications and internal copilots.

How It Works

A clean mental model: Index first, then retrieve, then generate.

1) Prepare the knowledge (ingestion)

You start with your source material: PDFs, docs, web pages, tickets, code, etc.

Typical steps:

  • Clean text (remove boilerplate, fix encoding).
  • Chunk into pieces (e.g., 200–800 tokens). Chunking matters because retrieval works better on smaller, focused passages, and the model can’t ingest infinite text anyway.

Example: A 40-page HR policy becomes ~200 chunks like “Vacation policy — carryover rules”, “Sick leave — documentation”, etc.

2) Turn chunks into vectors (embeddings)

Each chunk is converted into an embedding: a list of numbers that represents meaning (roughly: “what this text is about”).

Intuition: embeddings place text into a “meaning-space” so that similar ideas land near each other, even if the wording differs.

3) Store them in an index (vector database / vector index)

You store embeddings (and their original text) in a vector index. This index supports fast “find the most similar vectors” queries.

4) At question time, retrieve relevant chunks

When a user asks a question:

  1. Embed the question.
  2. Search the vector index for the top-k most similar chunks (e.g., k=5).
  3. Optionally re-rank results with a stronger model (common in higher-quality RAG systems).

A common similarity measure is cosine similarity:

sim(q,d)=qdqd\text{sim}(q, d) = \frac{q \cdot d}{|q||d|}

Intuition before math: treat embeddings like arrows in space; cosine similarity measures how closely those arrows point in the same direction. The closer the directions, the more semantically related the texts likely are.

5) Assemble a prompt and generate the answer

You create a prompt like:

  • System instructions (“Answer using the provided context. If missing, say you don’t know.”)
  • The user question
  • The retrieved chunks (often called “context”)

Then the LLM generates an answer conditioned on those chunks. This is the “augmented” part: generation is now anchored in retrieved evidence.

A tiny concrete walkthrough

Suppose your internal doc says:

“Refunds are available within 30 days for unopened items. Opened items are eligible only for store credit.”

User asks: “Can I get my money back if I opened it?”

Retrieval finds the refund chunk. The model answers:

“Opened items aren’t eligible for a cash refund; they’re eligible for store credit.”

Without RAG, the model might guess based on generic e-commerce norms. With RAG, it can align to your policy.

Key Terminology

  • Embedding: A numeric representation of text (or images) capturing semantic meaning, used to compare similarity.
  • Chunking: Splitting documents into smaller passages to improve retrieval quality and fit context limits.
  • Retriever: The component that selects relevant chunks for a query (often via vector similarity search).
  • Vector index / vector database: A data structure/system optimized to store embeddings and quickly return nearest neighbors.
  • Grounding (and provenance): Using retrieved sources to anchor outputs; provenance means you can point to the supporting passages.

Real-World Applications

  • Internal knowledge assistants: “How do we file expenses?” answered from your finance policy docs.
  • Customer support copilots: Draft replies using product manuals and prior resolved tickets.
  • Developer tools: “Where is this API defined?” answered by retrieving relevant code chunks.
  • Search + synthesis: Retrieval finds the best passages across many documents; generation turns them into a concise explanation.
  • Multimodal RAG: Retrieving not only text but also diagrams/tables/images (or their extracted representations) so answers can reflect visual documents too.

Common Misconceptions

  1. “RAG eliminates hallucinations.” It reduces them when retrieval is good, but it doesn’t magically guarantee truth. If retrieval fetches irrelevant chunks (or misses the right ones), the model may still improvise. Guardrails (like “say you don’t know”) and evaluation matter.

  2. “RAG is just keyword search + LLM.” Keyword search helps, but modern RAG typically relies on semantic retrieval (embeddings), which can match meaning even when terms differ. Many production systems use hybrids (keyword + vector) for robustness.

  3. “More context is always better.” Dumping 50 chunks into the prompt often worsens quality. You want the smallest set of the most relevant passages. Too much context dilutes signal and can confuse the model.

Further Reading

  • “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (Lewis et al., 2020)
  • OpenAI Cookbook: “Retrieval augmented generation using Elasticsearch”
  • LangChain docs: “RAG” / “Retrieval”
  • LlamaIndex docs: “Understanding RAG”

Read the article first

These tools reinforce the concepts above — you'll get more out of them after reading through the article.

Interactive: RAG Pipeline Workbench

Simulate retrieval, prompt assembly, and answer generation together. Adjust top-k, reranking, context budget, and answer policy to see how RAG quality changes.

Guided scenario

How long do production API keys stay valid, when must teams rotate them, and what exception exists during customer migrations?

Goal: Surface the primary rule and the exception without letting support guidance outrank the actual policy.

Top-k3

Higher top-k raises recall, but only if the extra chunks still fit the prompt budget.

Context budget170 tok

Retrieved chunks compete for this budget before generation starts.

Answer policy

Evidence coverage

100%

3 of 3 required evidence keys grounded

Citation quality

strong

Hallucination risk

low

Pipeline latency

242 ms

Retrieved context uses 162/170 tokens

Retrieved chunks

Top-k after rerank

#1 Security Operations Handbook

Security · 92 tok · score 1.03

supportingin prompt

Production API keys stay valid for ninety days from issue time, and teams must rotate keys during the final seven days before expiration to avoid forced lockout.

#2 Security Operations Handbook

Security · 70 tok · score 0.92

supportingin prompt

Temporary exceptions are allowed only for customer migration windows and require security approval with an explicit end date.

#3 Customer Support Playbook

Support · 76 tok · score 0.64

partialtrimmed

Security rotation timing in production is owned by internal teams, so avoid giving policy commitments unless they are copied from approved handbook text.

Assembled prompt

Context budget decides which chunks survive

[system]

Answer only from the provided context. If the evidence is missing, say so.

[user]

How long do production API keys stay valid, when must teams rotate them, and what exception exists during customer migrations?

[retrieved context]

Security Operations Handbook (92 tok)

Production API keys stay valid for ninety days from issue time, and teams must rotate keys during the final seven days before expiration to avoid forced lockout.

Security Operations Handbook (70 tok)

Temporary exceptions are allowed only for customer migration windows and require security approval with an explicit end date.

Generated answer

grounded

Production API keys stay valid for ninety days from issue time. Teams should rotate them during the final seven days before expiration. Customer migration windows can use a temporary exception, but only with security approval and an explicit end date.

Security Operations HandbookSecurity Operations Handbook

Grounding status

Required evidence checklist
ninety day validity windowgrounded
final seven day rotation rulegrounded
migration exception with explicit approvalgrounded

Rerank is on, so authority and exact answer fit matter more than rough similarity.

The context budget trimmed at least one retrieved chunk, so some evidence never reached generation.

Higher top-k improves recall headroom, but only if the extra chunks still fit the prompt budget cleanly.

All required evidence keys are covered by the current retrieved context.

Interactive: RAG Failure Debugger

Explore why RAG still fails even when a retriever exists. Compare mitigation strategies across missed evidence, ranking failures, stale sources, and prompt overflow.

Failure preset

The right answer is split across chunk boundaries, so the retriever never surfaces the exact supporting sentence.

Broken stage: Chunking / indexing

Baseline failure

The answer sounds plausible but misses the one sentence that proves the rule.

Critical evidence never enters the candidate pool.

The model mentions rotation timing, but the approval rule is absent because no chunk contains the full exception sentence.

Choose a mitigation

Recommended for this failure: better chunking

Answer quality

high

Citation quality

strong

Latency / cost

+small

+small indexing cost

Hallucination risk

low

Mitigation outcome

The answer span stays intact, so the retriever can finally return it.

Both the main rule and exception sentence appear together in the retrieved context.

Why this changes the result

Overlap or semantic chunking fixes the root cause instead of trying to patch missing evidence later.

Compare to baseline

Baseline: Critical evidence never enters the candidate pool.

After mitigation: The answer span stays intact, so the retriever can finally return it.

Continue learning

Continue directly from here instead of returning to the top navigation.