Contextual Retrieval: Anthropic's Fix for RAG's Biggest Problem
Anthropic's Contextual Retrieval uses Claude to add context to document chunks before embedding, cutting RAG retrieval failures by 49%. Combined with reranking, it hits 67%. Here's how it works and why it matters.
TL;DR
- Traditional RAG destroys context when chunking documents, causing retrieval failures
- Contextual Retrieval prepends AI-generated context to each chunk before embedding, reducing failed retrievals by 49%
- Combined with reranking, the failure rate drops 67% (5.7% → 1.9%)
- Prompt caching makes this practical at $1.02 per million document tokens
The Big Picture
RAG is everywhere. If you've built a chatbot that knows your docs, a support system that references past tickets, or a legal assistant that cites case law, you're using Retrieval-Augmented Generation. The pattern is simple: chunk your knowledge base, embed it, retrieve relevant pieces, stuff them in the prompt.
The problem? Chunking destroys context. A sentence like "Revenue grew 3% over the previous quarter" is useless without knowing which company, which quarter, which filing. Traditional RAG embeds that orphaned chunk and hopes semantic similarity saves you. It doesn't.
Anthropic's Contextual Retrieval solves this by having Claude write a short context snippet for every chunk before embedding it. The chunk about revenue growth becomes "This chunk is from an SEC filing on ACME corp's performance in Q2 2023; the previous quarter's revenue was $314 million. Revenue grew 3% over the previous quarter." Now your retrieval system has something to work with.
This isn't theoretical. Anthropic tested this across codebases, research papers, and fiction. The results are consistent: contextual embeddings cut retrieval failures by 35%. Add contextual BM25 and you're at 49%. Throw in reranking and you hit 67%. These aren't marginal gains. This is the difference between a RAG system that works and one that doesn't.
How It Works
Start with the baseline. Traditional RAG splits documents into chunks (usually a few hundred tokens), runs them through an embedding model, stores the vectors in a database. At query time, you search for semantic similarity, grab the top-K chunks, append them to the prompt. Done.
Better RAG systems add BM25, a lexical matching technique that catches exact phrases your embedding model might miss. If someone searches "Error code TS-999," BM25 finds that exact string while embeddings might return generic error documentation. You combine both approaches with rank fusion and get more accurate results.
Contextual Retrieval adds a preprocessing step. Before embedding or indexing, you pass each chunk and its source document to Claude with this prompt:
<document>
{{WHOLE_DOCUMENT}}
</document>
Here is the chunk we want to situate within the whole document
<chunk>
{{CHUNK_CONTENT}}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.Claude generates 50-100 tokens of context explaining what this chunk is about, where it came from, what document it belongs to. You prepend that context to the chunk, then embed it. You do the same for BM25 indexing. Now every chunk carries its own explanation.
The cost is manageable because of prompt caching. You load the full document into cache once, then reference it for every chunk. Anthropic's math: assuming 800-token chunks, 8k-token documents, and 100 tokens of generated context per chunk, you pay $1.02 per million document tokens. One-time cost. After that, retrieval is the same as always.
The architecture looks like this: during preprocessing, each chunk goes through Claude to get contextualized. At runtime, your query hits both the contextualized embeddings and the contextualized BM25 index. You merge results with rank fusion, optionally rerank the top 150 down to the top 20, then pass those 20 chunks to Claude for generation.
Anthropic tested this across multiple embedding models. Gemini Text 004 and Voyage performed best, but contextual retrieval improved results for every model they tried. The technique is model-agnostic. It works because you're giving the retrieval system more signal, not because you're gaming a specific embedding architecture.
Reranking adds another layer. After initial retrieval pulls 150 candidates, a reranking model (Anthropic used Cohere's) scores each chunk against the query and selects the top 20. This filtering step is expensive in latency and cost, but it's worth it. Reranking on top of contextual retrieval pushed the failure rate from 2.9% to 1.9%.
What This Changes For Developers
If you're running RAG in production, you've hit this problem. A user asks a specific question, your system retrieves plausible-looking chunks, Claude generates an answer that's confidently wrong. You check the logs and realize the right information was in the knowledge base, just not in the retrieved chunks. Retrieval failed.
Contextual Retrieval fixes the failure mode where chunks lack identifying information. Financial filings, legal documents, technical specs, support tickets — these are full of references that only make sense with context. "The defendant argued" means nothing without the case name. "This API endpoint" is useless without the service name. Traditional chunking strips that context. Contextual Retrieval puts it back.
The workflow change is minimal. You add a preprocessing step that costs about a dollar per million tokens. If you're already using Claude, you're already paying for tokens. This is a one-time cost that improves every query forever. The retrieval code stays the same. The embedding model stays the same. You're just feeding it better input.
The latency impact depends on whether you rerank. Contextual embeddings and BM25 add zero runtime latency — the extra context is baked into the vectors during preprocessing. Reranking adds a round trip to the reranking service, but it's parallelized across chunks. Anthropic doesn't publish numbers, but reranking 150 chunks typically adds 100-300ms depending on your provider.
This stacks with other techniques. Anthropic's agent patterns rely on accurate retrieval. If your agent is searching a knowledge base to decide what to do next, retrieval failures cascade into bad decisions. Contextual Retrieval makes agents more reliable.
Try It Yourself
Anthropic published a cookbook with implementation details. The core logic is straightforward: loop through your chunks, call Claude with the contextualization prompt, prepend the output to each chunk, then proceed with normal embedding and indexing.
If your knowledge base is under 200k tokens (about 500 pages), skip all of this and use prompt caching to stuff the entire knowledge base in the prompt. Anthropic's prompt caching reduces costs by up to 90% and latency by over 2x for repeated content. RAG is only necessary when your knowledge base exceeds Claude's context window.
For larger knowledge bases, the decision tree is simple. Start with contextual embeddings and contextual BM25. That gets you to a 49% reduction in retrieval failures. If that's not enough, add reranking. If you're latency-sensitive, skip reranking or reduce the number of chunks you rerank. If you're accuracy-sensitive, rerank more chunks and pass the top 20 to Claude instead of the top 10.
Anthropic tested chunk counts of 5, 10, and 20. Twenty chunks performed best. More context helps Claude generate better answers, and the models are good enough now that they don't get distracted by extra information. Your mileage may vary depending on your use case, but 20 is a reasonable default.
The Bottom Line
Use Contextual Retrieval if you're running RAG in production and retrieval accuracy matters. The cost is negligible, the implementation is simple, and the gains are real. A 49% reduction in retrieval failures means fewer hallucinations, fewer "I don't know" responses, fewer support tickets about wrong answers.
Skip it if your knowledge base fits in a prompt (under 200k tokens) or if you're prototyping and don't care about accuracy yet. Also skip it if your chunks already have strong context — if you're chunking by section headers and each chunk starts with a descriptive title, you might not need this.
The real opportunity here is that RAG is no longer the bottleneck. For years, the advice has been "RAG is hard, retrieval is the hard part, spend your time tuning embeddings and chunk sizes." Contextual Retrieval doesn't eliminate that tuning, but it raises the baseline enough that most teams can stop obsessing over retrieval and focus on the actual product. That's the win.
Source: Anthropic