Retrieval-Augmented Generation is the most common pattern for grounding a language model in documents the model was not trained on. It is also the pattern most often described in ways that make it sound more powerful than it is.
The useful framing is narrower. RAG addresses one specific problem: how to make a model answer questions about a corpus that changes faster than training, while still being able to cite where the answer came from.
The shape of the pipeline
A working RAG system has the same broad shape regardless of size:
- Documents are split into chunks small enough to fit alongside a question in the model’s context window.
- Each chunk is converted into a vector by an embedding model.
- The vectors live in a database that supports similarity search.
- A user query is converted into a vector by the same embedding model.
- The database returns the most similar chunks.
- The chunks and the query are sent to the language model, which generates the answer.
This is what most “chat with your documents” products do, including search-flavored assistants and notebook-style tools. The components are interchangeable: the embedding model, the vector database, the chunking strategy, and the language model can all be swapped without changing the overall pattern.
What it actually solves
RAG is a good fit when:
- The corpus is too large to fit in a single prompt.
- The corpus changes more often than it would be reasonable to retrain or finetune.
- The answer should be traceable back to specific documents.
- Hallucination cost is high enough that grounding matters more than fluency.
In those situations, RAG is usually the cheapest and most flexible option compared to finetuning. A finetuned model bakes knowledge into weights, which is expensive to refresh and difficult to attribute. RAG keeps the knowledge in documents and lets the model treat it as input.
What it does not solve
RAG is often described as a fix for hallucination. It is not. It reduces the surface area for hallucination by giving the model relevant text to lean on, but a model can still ignore the retrieved context, blend it with prior knowledge, or interpret it incorrectly.
It also does not solve retrieval quality. A naive setup that splits documents into fixed-size chunks and returns the top few matches will work on demos and fail on real corpora. Production systems usually add some combination of:
- Smarter chunking that respects document structure.
- Hybrid retrieval that combines vector similarity with keyword search.
- A reranking step that scores candidate chunks against the query.
- Query rewriting or expansion before retrieval.
Each of these adds complexity, latency, and a new place for things to go wrong.
Where RAG fits in a larger system
RAG works at query time. Every question triggers retrieval over the same corpus. For knowledge that is genuinely dynamic — a help center, a documentation site, a frequently updated knowledge base — that is the right shape.
For knowledge that accumulates more slowly and benefits from human judgment, a different pattern may fit better: compiling raw sources into curated, cross-linked notes once, and letting the model read those notes directly. The two are not mutually exclusive. Retrieval can sit in front of curated notes the same way it sits in front of raw documents.
The point is to choose based on how the knowledge is shaped, not on which pattern is most often demoed.