Retrieval-Augmented Generation (RAG)
InteractiveLearn how RAG lets an LLM answer questions using relevant external documents fetched at query time.
Try the interactive tools (2)What Is Retrieval-Augmented Generation (RAG)?
Retrieval-Augmented Generation (RAG) is a way to make a language model feel like it “looked something up” before answering.
Analogy: imagine an extremely smart colleague with a decent memory… but who also has access to a well-organized library. When you ask a question, they first walk to the right shelf, grab a few relevant pages, skim them, and then answer—using those pages as evidence. That “walk to the shelf” step is retrieval; the final response is generation.
Technically, RAG is a pattern where an LLM (the generator) is given retrieved context (snippets from documents, databases, or knowledge bases) at inference time, so the answer can be based on your data instead of only what the model memorized during training.
Why Does It Matter?
LLMs are impressive, but they have two chronic limitations:
- Their built-in knowledge is frozen at training time.
- They can’t read your entire data source (wikis, PDFs, policies, tickets, codebases) in one go because of context window limits.
RAG matters because it helps you build systems that:
- Answer questions using current and private information (company docs, product manuals, internal policies).
- Provide traceability: you can show the user what text the model used, which is a big deal for trust and auditing.
- Update knowledge without retraining the model—just update the document store and re-index.
In practice, RAG is the backbone of many “chat with your data” applications and internal copilots.
How It Works
A clean mental model: Index first, then retrieve, then generate.
1) Prepare the knowledge (ingestion)
You start with your source material: PDFs, docs, web pages, tickets, code, etc.
Typical steps:
- Clean text (remove boilerplate, fix encoding).
- Chunk into pieces (e.g., 200–800 tokens). Chunking matters because retrieval works better on smaller, focused passages, and the model can’t ingest infinite text anyway.
Example: A 40-page HR policy becomes ~200 chunks like “Vacation policy — carryover rules”, “Sick leave — documentation”, etc.
2) Turn chunks into vectors (embeddings)
Each chunk is converted into an embedding: a list of numbers that represents meaning (roughly: “what this text is about”).
Intuition: embeddings place text into a “meaning-space” so that similar ideas land near each other, even if the wording differs.
3) Store them in an index (vector database / vector index)
You store embeddings (and their original text) in a vector index. This index supports fast “find the most similar vectors” queries.
4) At question time, retrieve relevant chunks
When a user asks a question:
- Embed the question.
- Search the vector index for the top-k most similar chunks (e.g., k=5).
- Optionally re-rank results with a stronger model (common in higher-quality RAG systems).
A common similarity measure is cosine similarity:
Intuition before math: treat embeddings like arrows in space; cosine similarity measures how closely those arrows point in the same direction. The closer the directions, the more semantically related the texts likely are.
5) Assemble a prompt and generate the answer
You create a prompt like:
- System instructions (“Answer using the provided context. If missing, say you don’t know.”)
- The user question
- The retrieved chunks (often called “context”)
Then the LLM generates an answer conditioned on those chunks. This is the “augmented” part: generation is now anchored in retrieved evidence.
A tiny concrete walkthrough
Suppose your internal doc says:
“Refunds are available within 30 days for unopened items. Opened items are eligible only for store credit.”
User asks: “Can I get my money back if I opened it?”
Retrieval finds the refund chunk. The model answers:
“Opened items aren’t eligible for a cash refund; they’re eligible for store credit.”
Without RAG, the model might guess based on generic e-commerce norms. With RAG, it can align to your policy.
Key Terminology
- Embedding: A numeric representation of text (or images) capturing semantic meaning, used to compare similarity.
- Chunking: Splitting documents into smaller passages to improve retrieval quality and fit context limits.
- Retriever: The component that selects relevant chunks for a query (often via vector similarity search).
- Vector index / vector database: A data structure/system optimized to store embeddings and quickly return nearest neighbors.
- Grounding (and provenance): Using retrieved sources to anchor outputs; provenance means you can point to the supporting passages.
Real-World Applications
- Internal knowledge assistants: “How do we file expenses?” answered from your finance policy docs.
- Customer support copilots: Draft replies using product manuals and prior resolved tickets.
- Developer tools: “Where is this API defined?” answered by retrieving relevant code chunks.
- Search + synthesis: Retrieval finds the best passages across many documents; generation turns them into a concise explanation.
- Multimodal RAG: Retrieving not only text but also diagrams/tables/images (or their extracted representations) so answers can reflect visual documents too.
Common Misconceptions
-
“RAG eliminates hallucinations.” It reduces them when retrieval is good, but it doesn’t magically guarantee truth. If retrieval fetches irrelevant chunks (or misses the right ones), the model may still improvise. Guardrails (like “say you don’t know”) and evaluation matter.
-
“RAG is just keyword search + LLM.” Keyword search helps, but modern RAG typically relies on semantic retrieval (embeddings), which can match meaning even when terms differ. Many production systems use hybrids (keyword + vector) for robustness.
-
“More context is always better.” Dumping 50 chunks into the prompt often worsens quality. You want the smallest set of the most relevant passages. Too much context dilutes signal and can confuse the model.
Further Reading
- “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (Lewis et al., 2020)
- OpenAI Cookbook: “Retrieval augmented generation using Elasticsearch”
- LangChain docs: “RAG” / “Retrieval”
- LlamaIndex docs: “Understanding RAG”
Read the article first
These tools reinforce the concepts above — you'll get more out of them after reading through the article.
Interactive: RAG Pipeline Workbench
Simulate retrieval, prompt assembly, and answer generation together. Adjust top-k, reranking, context budget, and answer policy to see how RAG quality changes.
Guided scenario
How long do production API keys stay valid, when must teams rotate them, and what exception exists during customer migrations?
Goal: Surface the primary rule and the exception without letting support guidance outrank the actual policy.
Higher top-k raises recall, but only if the extra chunks still fit the prompt budget.
Retrieved chunks compete for this budget before generation starts.
Answer policy
Evidence coverage
100%
3 of 3 required evidence keys grounded
Citation quality
strong
Hallucination risk
low
Pipeline latency
242 ms
Retrieved context uses 162/170 tokens
Retrieved chunks
Top-k after rerank#1 Security Operations Handbook
Security · 92 tok · score 1.03
Production API keys stay valid for ninety days from issue time, and teams must rotate keys during the final seven days before expiration to avoid forced lockout.
#2 Security Operations Handbook
Security · 70 tok · score 0.92
Temporary exceptions are allowed only for customer migration windows and require security approval with an explicit end date.
#3 Customer Support Playbook
Support · 76 tok · score 0.64
Security rotation timing in production is owned by internal teams, so avoid giving policy commitments unless they are copied from approved handbook text.
Assembled prompt
Context budget decides which chunks survive[system]
Answer only from the provided context. If the evidence is missing, say so.
[user]
How long do production API keys stay valid, when must teams rotate them, and what exception exists during customer migrations?
[retrieved context]
Security Operations Handbook (92 tok)
Production API keys stay valid for ninety days from issue time, and teams must rotate keys during the final seven days before expiration to avoid forced lockout.
Security Operations Handbook (70 tok)
Temporary exceptions are allowed only for customer migration windows and require security approval with an explicit end date.
Generated answer
groundedProduction API keys stay valid for ninety days from issue time. Teams should rotate them during the final seven days before expiration. Customer migration windows can use a temporary exception, but only with security approval and an explicit end date.
Grounding status
Required evidence checklistRerank is on, so authority and exact answer fit matter more than rough similarity.
The context budget trimmed at least one retrieved chunk, so some evidence never reached generation.
Higher top-k improves recall headroom, but only if the extra chunks still fit the prompt budget cleanly.
All required evidence keys are covered by the current retrieved context.
Interactive: RAG Failure Debugger
Explore why RAG still fails even when a retriever exists. Compare mitigation strategies across missed evidence, ranking failures, stale sources, and prompt overflow.
Failure preset
The right answer is split across chunk boundaries, so the retriever never surfaces the exact supporting sentence.
Broken stage: Chunking / indexing
Baseline failure
The answer sounds plausible but misses the one sentence that proves the rule.
Critical evidence never enters the candidate pool.
The model mentions rotation timing, but the approval rule is absent because no chunk contains the full exception sentence.
Choose a mitigation
Recommended for this failure: better chunking
Answer quality
high
Citation quality
strong
Latency / cost
+small
+small indexing cost
Hallucination risk
low
Mitigation outcome
The answer span stays intact, so the retriever can finally return it.
Both the main rule and exception sentence appear together in the retrieved context.
Why this changes the result
Overlap or semantic chunking fixes the root cause instead of trying to patch missing evidence later.
Compare to baseline
Baseline: Critical evidence never enters the candidate pool.
After mitigation: The answer span stays intact, so the retriever can finally return it.
Continue learning
Continue directly from here instead of returning to the top navigation.