This article explains retrieval-augmented generation (RAG) and what developers need to build a production-ready RAG solution.
To learn about two ways to build a "chat over your data" appâone of the top generative AI use cases for businessesâsee Augment LLMs with RAG or fine-tuning.
The following diagram shows the main steps of RAG:
This process is called naive RAG. It helps you understand the basic parts and roles in a RAG-based chat system.
Real-world RAG systems need more preprocessing and post-processing to handle articles, queries, and responses. The next diagram shows a more realistic setup, called advanced RAG:
This article gives you a simple framework to understand the main phases in a real-world RAG-based chat system:
Ingestion means saving your organization's documents so you can quickly find answers for users. The main challenge is to find and use the parts of documents that best match each question. Most systems use vector embeddings and cosine similarity search to match questions to content. You get better results when you understand the content type (like patterns and format) and organize your data well in the vector database.
When setting up ingestion, focus on these steps:
The first step in the ingestion phase is to preprocess and extract the content from your documents. This step is crucial because it ensures that the text is clean, structured, and ready for indexing and retrieval.
Clean and accurate content makes a RAG-based chat system work better. Start by looking at the shape and style of the documents you want to index. Do they follow a set pattern, like documentation? If not, what questions could these documents answer?
At a minimum, set up your ingestion pipeline to:
Some of this information, like metadata, can help during retrieval and evaluation if you keep it with the document in the vector database. You can also combine it with the text chunk to improve the chunk's vector embedding.
Chunking strategyAs a developer, decide how to break up large documents into smaller chunks. Chunking helps send the most relevant content to the LLM so it can answer user questions better. Also, think about how you'll use the chunks after you get them. Try out common industry methods and test your chunking strategy in your organization.
When chunking, think about:
In a RAG system, how you organize your data in the vector database makes it easier and faster to find the right information. Here are some ways to set up your indexes and searches:
Make retrieved chunks more relevant and accurate by matching them to the types of questions they answer. One way is to create a sample question for each chunk that shows what question it answers best. This approach helps in several ways:
Each chunkâs sample question acts as a label that guides the retrieval algorithm. The search becomes more focused and aware of context. This method works well when chunks cover many different topics or types of information.
Update strategiesIf your organization updates documents often, you need to keep your database current so the retriever can always find the latest information. The retriever component is the part of the system that searches the vector database and returns results. Here are some ways to keep your vector database up to date:
Incremental updates:
Partial updates:
Versioning:
Real-time updates:
Optimization techniques:
Pick the update strategy or mix that fits your needs. Think about:
Review these factors for your application. Each method has trade-offs in complexity, cost, and how quickly updates show up.
Inference pipelineYour articles are now chunked, vectorized, and stored in a vector database. Next, focus on getting the best answers from your system.
To get accurate and fast results, think about these key questions:
The whole inference pipeline works in real time. Thereâs no single right way to set up your preprocessing and post-processing steps. You use a mix of code and LLM calls. One of the biggest trade-offs is balancing accuracy and compliance with cost and speed.
Letâs look at strategies for each stage of the inference pipeline.
Query preprocessing stepsQuery preprocessing starts right after the user sends a question:
These steps help make sure the userâs question fits your system and is ready to find the best article chunks using cosine similarity or "nearest neighbor" search.
Policy check: Use logic to spot and remove or flag unwanted content, like personal data, bad language, or attempts to break safety rules (called "jailbreaking").
Query rewriting: Change the question if neededâexpand acronyms, remove slang, or rephrase it to focus on bigger ideas (step-back prompting).
A special version of step-back prompting is Hypothetical Document Embeddings (HyDE). HyDE has the LLM answer the question, makes an embedding from that answer, and then searches the vector database with it.
SubqueriesSubqueries break a long or complex question into smaller, easier questions. The system answers each small question, then combines the answers.
For example, if someone asks, "Who made more important contributions to modern physics, Albert Einstein or Niels Bohr?" you can split it into:
The answers might include:
You can then ask follow-up questions:
These follow-ups look at each scientistâs effect, like:
The system combines the answers to give a full response to the original question. This method makes complex questions easier to answer by breaking them into clear, smaller parts.
Query routerSometimes, your content lives in several databases or search systems. In these cases, use a query router. A query router picks the best database or index to answer each question.
A query router works after the user asks a question but before the system searches for answers.
Hereâs how a query router works:
Use different indexes or search engines for:
For example, in a medical advice system, you might have:
If someone asks about the effects of a new drug, the router sends the question to the research paper index. If the question is about common symptoms, it uses the general health index for a simple answer.
Post-retrieval processing stepsPost-retrieval processing happens after the system finds content chunks in the vector database:
Next, check if these chunks are useful for the LLM prompt before sending them to the LLM.
Keep these things in mind:
Watch out for the needle in a haystack problem: LLMs often pay more attention to the start and end of a prompt than the middle.
Also, remember the LLMâs maximum context window and the number of tokens needed for long prompts, especially at scale.
To handle these issues, use a post-retrieval processing pipeline with steps like:
Post-completion processing happens after the userâs question and all content chunks go to the LLM:
After the LLM gives an answer, check its accuracy. A post-completion processing pipeline can include:
Evaluating a system like this is more complex than running regular unit or integration tests. Think about these questions:
Work with your organization's privacy team to design feedback capture tools, system data, and logging for forensics and root cause analysis of a query session.
The next step is to build an assessment pipeline. An assessment pipeline makes it easier and faster to review feedback and find out why the AI gave certain answers. Check every response to see how the AI produced it, if the right content chunks were used, and how the documents were split up.
Also, look for extra preprocessing or post-processing steps that could improve results. This close review often finds content gaps, especially when no good documentation exists for a user's question.
You need an assessment pipeline to handle these tasks at scale. A good pipeline uses custom tools to measure answer quality. It helps you see why the AI gave a specific answer, which documents it used, and how well the inference pipeline worked.
Golden datasetOne way to check how well a RAG chat system works is to use a golden dataset. A golden dataset is a set of questions with approved answers, helpful metadata (like topic and question type), links to source documents, and different ways users might ask the same thing.
A golden dataset shows the "best case scenario." Developers use it to see how well the system works and to run tests when they add new features or updates.
Assessing harmHarms modeling helps you spot possible risks in a product and plan ways to reduce them.
A harms assessment tool should include these key features:
These features help you find and fix risks, and they also help you build more ethical and responsible AI by thinking about all possible impacts from the start.
For more information, see these articles:
Testing and verifying the safeguardsRed-teaming is keyâit means to act like an attacker to find weak spots in the system. This step is especially important to stop jailbreaking. For tips on planning and managing red teaming for responsible AI, see Planning red teaming for large language models (LLMs) and their applications.
Developers should test RAG system safeguards in different scenarios to make sure they work. This step makes the system stronger and also helps fine-tune responses to follow ethical standards and rules.
Final considerations for application designHere are some key things to remember from this article that can help you design your app:
To build a generative AI app, check out Get started with chat by using your own data sample for Python. The tutorial is also available for .NET, Java, and JavaScript.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4