RAG, which stands for Retrieval-Augmented Generation, emerged as an answer to a very concrete problem: large language models are good at writing text, but they do not reliably carry around all the facts you need in a real business. They can lack information, they can be outdated, and they can hallucinate.

The classic purpose of RAG was therefore twofold. First, to reduce hallucinations by grounding the answer in actual documents. Second, to give the model access to information that was never in the training data, such as internal manuals, contracts, product sheets, support cases, or technical documentation.

Today, modern LLMs and language models and AI agents have more ways to fetch information. They can search the web, open files, read terminal output, run code, and navigate systems. But RAG still has a clear advantage when the task is to find relevant information in a known knowledge base quickly, cheaply, and reproducibly. It is still one of the most practical retrieval patterns available.

RAG in One Sentence

RAG is an architectural layer around the language model. Instead of hoping the model already "knows" the answer, the system first retrieves relevant information from an external knowledge base and then sends that information together with the question to the model.

How Does RAG Work in Practice?

In the walkthrough below, we focus on text-based RAG, meaning the most common and pedagogically clearest form. It is also the variant most people mean when they talk about RAG in documents, knowledge bases, and internal systems. Later we come back to how the same core idea can also be used multimodally.

At a high level, RAG consists of two phases in practice.

First, the material is prepared so it becomes searchable. The documents are read in, turned into text, split into smaller parts, and converted into numerical representations that can be compared mathematically.

Then, when a user asks a question, the system does the same thing with the question. The question is converted into a numerical representation, the system finds the most relevant passages, and the language model then answers based on that exact material.

That means RAG is not one single model step, but a whole pipeline where several parts must work well together. If document ingestion is weak, retrieval becomes weak. If chunking is poor, the matches get worse. If the embedding model does not capture meaning well, it matters less how strong the language model is at the end. That is why RAG is more system design than prompting in practice.

First, a Searchable Knowledge Base Is Built

Documents Become Readable Text

RAG starts in the documents.

PDFs, Word files, HTML, email, database exports, images, and sometimes audio or video first need to be converted into a format the system can work with. In simple cases, raw text extraction is enough. In more realistic environments, you also need to capture structure, for example headings, tables, page numbers, sections, source IDs, and sometimes OCR if the documents are scanned.

This step is often underestimated, because it is easy to think that retrieval and the language model are the parts that "are AI" and therefore matter most. In reality, ingestion quality is often decisive. If the parser loses headings, table structure, or paragraph boundaries, the rest of the RAG pipeline gets worse no matter how good an embedding model or language model you choose later.

That becomes especially visible in material that is not written as plain running text. Manuals, policy documents, contracts, support cases, and technical specifications are often full of tables, bullet lists, footnotes, and appendices. If all of that gets flattened the wrong way, you still have "text", but no longer a very good representation of what the document actually says.

The Text Gets Split into Chunks

Once the documents exist as text, they are normally split into smaller pieces called chunks. The reason is simple: retrieval works better on smaller, focused passages than on whole documents, and it is also much cheaper to send a few relevant chunks to the model than to send everything.

There are several ways to do this. Some systems chunk by headings and semantic sections. Some use recursive strategies. Some use fixed windows. But in practice, token-based chunking is still a common default, because the tokenizer, embeddings, and context window all work in tokens anyway.

What matters is that the chunks are small enough to be precise, but not so small that they lose their context. If you split all text too aggressively, you can end up with passages that look irrelevant on their own even though together they contain the information you actually need. If you make the chunks too large instead, you get worse precision, higher cost, and a greater risk that the language model receives too much noise in each call.

That is why people often also use overlap, meaning that two neighboring chunks share a certain number of tokens. That reduces the risk that important information gets cut off in the middle of a sentence, in the middle of a table row, or exactly between two paragraphs that belong together. At the same time, too much overlap costs both storage and retrieval quality, because you create more nearly identical chunks that compete with each other.

In Chroma's evaluation of chunking, they also found that chunking strategy materially affects retrieval quality, and that no single strategy wins for every type of material. That is worth stressing, because many people describe chunking as a simple preparation step. In practice, it is one of the places where a RAG system often wins or loses on quality.

Every Chunk Becomes an Embedding

Once the text chunks have been created, they are sent to an embedding model. It does not produce a finished answer. It converts each chunk into a numerical representation, a vector, that captures the semantic meaning.

An embedding can look like a long list of floating-point numbers, for example:

[0.018, -0.442, 0.731, 0.094, -0.118, 0.507, -0.221, 0.063, ...]

In reality it is much longer than that. Many modern embedding models work with hundreds or thousands of dimensions, often for example 768, 1024, 1536, 3072, or 4096, depending on the model. The point is not that humans should interpret each number. The point is that texts with similar meaning end up close to each other in a high-dimensional space.

That is also why the embedding model is an important architectural choice in RAG. A weak embedding model can produce weak retrieval even if the language model afterward is very strong. If the model is bad at Swedish, legal language, technical terminology, or multimodal content, then the whole retrieval chain already starts with a worse representation of the material.

It is also worth understanding that the embedding model and the language model do not have to be the same type of model. In many systems they are different components with very different roles. The embedding model is used to organize material in a searchable semantic space. The language model is used later to read the retrieved context and formulate an answer.

The Vectors Are Stored in a Searchable Index

Once embeddings have been created, they need to be stored in a way that makes them fast to search. That is where vector databases and other vector indexes come in.

They usually store not only the vector itself, but also metadata such as document ID, section, source, page number, language, and sometimes links back to the original chunk. That matters, because retrieval ultimately is not only about finding one vector that lies close to another vector. The system also has to return to the actual text passage and often to its source.

Under the hood, some form of approximate nearest neighbor indexing is often used, for example HNSW or similar structures, because exact comparison against every vector becomes expensive as the amount of data grows. Users rarely notice that, but for system design it is decisive. Without a good index, retrieval becomes either too slow or too expensive when the material gets large.

Then the Knowledge Base Is Used When Someone Asks a Question

The Question Also Becomes an Embedding

When the user asks a question, the question is also converted into an embedding. Now both the document chunks and the user's question are represented in the same mathematical space.

That allows the system to compare the question with the indexed material in the same way it already compares different chunks with each other. That is exactly why RAG does not get stuck in pure keyword logic. A question does not need to use exactly the same words as the document, as long as the embedding model captures that the meaning is close.

That is also why RAG can work well in real businesses where language use varies. A user can write "how do I submit vacation" while the document talks about a "leave application in the HR system". If the embedding model does its job well, there is still a good chance they land close to each other semantically.

The Most Relevant Passages Are Retrieved

Now comes the retrieval step itself. The system finds the chunks whose vectors lie closest to the question vector.

This is often done with cosine similarity or dot product. The idea is the same in both cases: the system measures which vectors most closely resemble the question vector in a high-dimensional space. Then it normally retrieves the k nearest matches, for example top 3, 5, or 10.

After that, metadata is used to fetch the actual text passages, not just the vectors. Those text passages are what get sent on to the language model. In many practical systems, you also add an extra step here, where the results are reranked or filtered further before being sent onward. But the core is the same: first semantic retrieval, then actual text back into the model.

How RAG finds relevant information

Cat

Dog

Wolf

Bird

Paris

Human

Apple

Banana

Car

Click the plus signs above to add more items and see how they are placed semantically.

Semantic retrieval is not about exact keywords, but about proximity in a vector space where similar meanings are placed near each other.

It is important to understand that retrieval does not "understand" in a human sense. It selects the chunks that appear most semantically similar to the question. That often makes it very powerful, but also limited. The system is good at finding what is most relevant, but not automatically at deciding whether the question actually requires the whole distribution of relevant hits, a broad comparison, or an exact enumeration.

Only Now Does the Language Model Generate an Answer

When the retrieval step is complete, the prompt to the language model is built. It often consists of the user's question, instructions about how the answer should be formulated, and the retrieved chunks as context. Only here does the actual generation begin.

That is why RAG is not the same thing as model training. You do not change the language model's weights. You change which material the model gets to see in the current call. If you want to understand that difference more deeply, the natural next step is our pages on LLMs, tokenizers, and the context window.

This is one of RAG's great strengths. If the documents change, you often do not need to retrain the model. You need to update the searchable knowledge base instead. That makes RAG especially attractive in environments where information changes often, but where you still want to build AI systems on top of it.

When Are Prompting, RAG, and Finetuning Relevant?

It is easy to mix these three things together, because all of them are used to get better results from a language model. But at a fundamental level, they solve different problems.

Prompting

Prompting is about how you formulate the instruction to the model. You can often get far with better structure, clearer role instructions, better examples, and clearer requirements on format or tone. It is almost always the first thing you should improve, because it is fast, cheap, and does not require any extra system around the model.

But prompting does not give the model new knowledge. It can help the model use what it already has better, but it cannot by itself give the model access to a new contract, an updated price list, or an internal manual that never existed in the training data.

RAG

RAG becomes relevant when the problem is not mainly how you instruct the model, but what material the model actually gets to see. It is therefore the right tool when the answer needs to be based on external, updateable, or internal company information. Instead of trying to write all the knowledge into the prompt, or hoping the model already knows it, the system retrieves the right material at the exact moment the question is asked.

That makes RAG especially strong when knowledge changes over time, when the material is too large to live directly in the prompt, and when you want to connect the answer to actual sources.

Finetuning

Finetuning becomes relevant when the problem sits deeper than both prompting and retrieval. It is not mainly about giving the model more facts in the moment, but about changing the model's behavior more permanently. That can for example mean that the model should follow a specific format very consistently, learn a certain tone, or get better at a narrow task where ordinary prompting is not enough.

Finetuning is not normally the natural first choice if the goal is only for the model to read new or frequently updated information. There, RAG is usually much more natural, because you update the knowledge base instead of the model itself.

In practice, many good systems work with all three levels at once. First you make the prompting clear, then you add RAG if the model needs external knowledge, and only after that do you consider finetuning if the behavior still is not stable or specialized enough. If you want to go deeper into that part, keep reading about finetuning.

RAG Does Not Work Only on Text

Many people associate RAG with text documents, and that is still the most common form. But the pattern itself is broader than that.

If you have embedding models that can place several modalities in the same semantic space, it becomes possible to do retrieval across more than text. Google described this clearly when they launched Gemini Embedding 2, a model that can work multimodally. The idea is that text, image, audio, and video can be represented in a shared vector space so that a question can find the most semantically relevant content regardless of modality.

That means a text question can in principle retrieve a text passage describing an error, an image of the correct component, a video clip showing the step, or an audio clip from support, as long as all of it is embedded in the same kind of semantic space. That makes RAG interesting far beyond simple document chat, especially in environments where knowledge is actually spread across several media types.

Why RAG Still Has a Clear Edge

When people talk about modern agent systems, several kinds of retrieval often get mixed together. An agent can search the web, run terminal commands, open files, or query an API. All of that is useful. But RAG still has several strong properties.

It is fast to search through large amounts of already indexed information. Retrieval becomes more reproducible than free browsing or free tool navigation. It works well for private documents and internal knowledge bases. And it is possible to tie results to sources, metadata, and sometimes quotations in a way that is often harder in more open agent flows.

That is why RAG often becomes the retrieval layer that other systems are built on top of, even when the agent is otherwise more advanced. An agent can be good at planning, but it still often needs a robust way to find the right information in the right amount. RAG still fills that role very clearly.

Strengths and Weaknesses

The greatest strength of RAG is that it is very good at finding a specific needle in a large haystack. If the right information exists in the material, and if chunking plus embeddings are reasonably good, there is a good chance that retrieval finds the most relevant passages even when the exact wording differs from the question.

That makes the technique especially strong when the relevant information really is concentrated in a few clear text passages. That can mean policy documents, instructions, technical answers, support summaries, or internal knowledge that someone quickly needs access to without manually browsing large amounts of material.

But RAG also has clear weaknesses, and they become important to understand as soon as the questions get more complex.

When RAG Often Gets Weaker

RAG is less strong when the question requires the system not only to find the most similar chunks, but to understand relationships or summarize many similar information points at once.

Imagine a long machine manual with twenty machines where each machine has a section about oil changes. If you ask "What differences are there in how the machines require oil changes?" or "How many machines require an oil change every 500 hours?", it is not always enough to retrieve the three or five nearest chunks. Top k retrieval tends to give you the most semantically similar hits, not necessarily the whole set of relevant hits.

That means ordinary RAG is often stronger at finding the right thing than at creating a complete synthesis of many related things. It is very good at precise retrieval, but much less naturally suited when the question is fundamentally about counting, broad comparison, or understanding relationships across many data points at once.

That is also where ordinary RAG often starts to give way to more specialized patterns, for example Graph RAG when relationships are central, or Agentic RAG when the system must break the question into several subproblems and search iteratively.

Three Common Variants of RAG

Standard RAG

This is the most common form. The question is embedded, the nearest chunks are retrieved from a vector index, and they are then sent to the language model as context. For many document-heavy use cases, that goes a long way.

It is also the variant most people mean when they simply say "RAG" without specifying further. When people talk about "connecting AI to the company's documents", it is often standard RAG they mean.

Graph RAG

Graph RAG tries to solve a different problem from ordinary similarity search. Instead of only asking "which chunks most resemble my question?", the system tries to explicitly model relationships between entities, facts, and concepts.

That fits better when the question is about dependencies, networks, or relationships between several pieces of information. In such cases, semantic proximity is not always enough. It can for example be more important to understand how two things connect than to find the text passage that happens to lie closest to the question's wording.

Graph RAG Visualization

Cat

Dog

Wolf

Bird

Human

Paris

Apple

Banana

Car

Add nodes via the buttons above to see how they connect in the network.

Graph RAG becomes relevant when the information mainly lives in the relationships between things, not only in individual text passages.

Agentic RAG

Agentic RAG adds planning on top of retrieval. Instead of one single retrieval step, the system can break down a complex question into several subquestions, choose different data sources, perform follow-up searches, and only then assemble the final answer.

At that point it starts to approach AI agents. The difference is that retrieval still sits at the center, but the retrieval chain has become more dynamic and multi-step. Instead of asking once and retrieving once, the system can reason about which source should be used, what is missing, and what the next search should be.

Agentic RAG Visualization

Click on the questions above to see which database is used

Agentic RAG is useful when a single similarity search is not enough, and the system must choose source and search strategy step by step.

How RAG Connects to the Rest of the Knowledge Cluster

RAG is rarely an isolated technique. It connects to several other layers.

It connects to LLMs and language models, because RAG is fundamentally a retrieval layer around the model, not a replacement for the model. It connects to the tokenizer, because chunking, token budgets, and embedding flows in practice are built on tokenization. It connects to the context window, because all retrieved chunks must fit inside the model's workspace. And it connects to AI agents, because agents often use RAG as one of several retrieval tools.

If you want to see how this technology is used in practice on internal company information, the natural next step is also the page about AI with your data.

Summary

RAG emerged because language models needed a way to work with current and external information without being retrained every time the data changes. That is still the core point.

In practice, you first build a searchable knowledge base by ingesting documents, splitting them into chunks, creating embeddings, and storing them in a vector index. When the question then arrives, the system does the same thing with the question, retrieves the most relevant text passages, and only after that lets the language model generate its answer.

That is also why RAG is so useful. It lets the language model work with the right information at the right moment, without changing the model's weights. At the same time, the technique has clear limitations, especially when questions require broad comparison, enumeration, or understanding relationships across many data points. That is where more advanced variants like Graph RAG and Agentic RAG become relevant.

If you want to build robust systems on documents and internal knowledge, RAG is still one of the most useful patterns to understand properly.