Home/Insights/AI

RAG isn't retrieval - it's context engineering

SS
Sylvester SFounder & CEO
Jan 15, 2025·7 min read
Data analytics workspace

Everyone calls it retrieval-augmented generation but the bottleneck is never retrieval. It's knowing what context an LLM actually needs to reason correctly.

The name 'retrieval-augmented generation' implies that retrieval is the thing. It's not. Retrieval is the easy part. The hard part - the part that determines whether your RAG pipeline produces useful answers or confident nonsense - is deciding what context to put in front of the model.

After building RAG systems for enterprise knowledge bases, legal document analysis, financial research, and customer support, here's our framework for thinking about context engineering.

The chunk strategy problem

Most teams chunk documents at fixed token counts. It's the default in LangChain, it's what tutorials show, and it's usually wrong. Fixed-size chunking splits sentences mid-thought, separates tables from their headers, and divorces conclusions from the evidence they summarise. The embedding model then generates a vector for a chunk that means nothing in isolation.

Chunk at semantic boundaries instead. For prose: paragraph-level chunking, with sentence-level overlap. For structured documents: section-level chunking that preserves hierarchical context. For tables and code: chunk by logical unit (one table, one function), never mid-structure. The additional complexity in your ingestion pipeline pays off dramatically in retrieval quality.

Embedding model choice matters less than you think

Teams spend significant time evaluating embedding models - ada-002 vs. BGE vs. Cohere vs. Jina. The performance differences between modern embedding models on typical enterprise retrieval tasks are smaller than the performance difference between good and bad chunk strategy. Get chunking right first. Then optimise your embedding model if retrieval quality still falls short.

The reranking layer is not optional

Vector similarity retrieval is a blunt instrument. It finds semantically similar chunks, but semantic similarity and relevance-to-this-specific-query are not the same thing. A cross-encoder reranker - which takes the query and each candidate chunk and scores them jointly - dramatically improves precision. We use Cohere Rerank or a fine-tuned cross-encoder as standard on every production RAG system.

Add a reranker before you add a more expensive embedding model. In our benchmarks, switching from no reranker to a cross-encoder reranker improved answer quality more than switching from ada-002 to a state-of-the-art embedding model.

Context assembly: what goes in the prompt

You've retrieved the right chunks. Now: what do you actually put in the prompt, in what order, and with what framing? This is context engineering. A few principles we've landed on:

  • Position matters: LLMs attend better to context at the start and end of the context window. Put the most critical evidence first.
  • Include metadata: chunk source, document date, section title. This helps the model reason about evidence provenance.
  • Filter before you fill: it's better to pass 3 high-quality chunks than 10 mediocre ones. Don't use the full context window by default.
  • Add explicit structure: label each chunk with [Source 1], [Source 2] etc. so the model can cite and distinguish between them.
  • State what you don't know: if retrieval returns nothing relevant, tell the model explicitly rather than sending it empty context.

Evaluating RAG quality

Build an evaluation set before you start building the pipeline. Sample 50-100 real questions from your target user group, pair them with ground-truth answers, and measure your pipeline against them at every stage. Evaluate retrieval quality (did the right chunks come back?) and generation quality (did the model use the chunks correctly?) separately - they have different failure modes and different fixes.

The teams that build great RAG systems are the ones that treat evaluation as the primary engineering task, not an afterthought. The retrieval and generation are the implementation. The eval suite is the product.

Free tool

Calculate the return before you build

The ROI calculator helps you model the time savings from deploying an AI assistant against your team's current manual research and documentation workflows, before you write a line of code.

Run the ROI calculator

Frequently asked questions

Why does my RAG system give wrong answers even when the source documents have the right information?

Usually a context quality problem, not a retrieval failure. The right documents may be in your vector store, but the chunks being passed to the model are fragmented, lack surrounding context, or are arranged in a way that is hard to reason about. Check your chunking strategy first. Then check whether a reranker is in the pipeline to filter out low-relevance chunks before they reach the model.

What chunk size should I use for a RAG pipeline?

Avoid fixed token counts. Chunk at semantic boundaries instead: paragraph-level for prose, section-level for structured documents, and logical unit boundaries for tables and code. The goal is chunks that mean something in isolation, not chunks that happen to be a uniform size. The added complexity in the ingestion pipeline pays off significantly in retrieval quality.

What is a reranker and do I actually need one?

A reranker is a cross-encoder model that scores each retrieved chunk against your query jointly, rather than independently. Vector similarity retrieval finds semantically similar chunks, but similarity and actual relevance to the query are not always the same thing. In our benchmarks, adding a reranker improved answer quality more than switching to a better embedding model.

How do I evaluate whether my RAG pipeline is actually working well?

Build an evaluation set before you build the pipeline. Sample 50 to 100 real questions from your target users and pair them with ground-truth answers. Then measure retrieval quality and generation quality separately, because they have different failure modes and different fixes. Teams that treat evaluation as the primary task consistently ship better RAG systems than teams that add it as an afterthought.

Hostwire builds this
AI AgentsAI Marketing
Start a project

More in AI

AI neural network visualization
AI · Mar 12, 2025

Why most AI agents fail in production (and how to fix them)

Abstract illustration of artificial intelligence and human interaction
AI · Apr 17, 2026

What AI can't do: build trust, be truly timely, and address human fears

AI system monitoring
AI · Nov 2, 2024

Evaluating AI agents: beyond accuracy metrics