A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://learn.microsoft.com/en-us/azure/developer/ai/advanced-retrieval-augmented-generation below:

Build Advanced Retrieval-Augmented Generation Systems

This article explains retrieval-augmented generation (RAG) and what developers need to build a production-ready RAG solution.

To learn about two ways to build a "chat over your data" app—one of the top generative AI use cases for businesses—see Augment LLMs with RAG or fine-tuning.

The following diagram shows the main steps of RAG:

This process is called naive RAG. It helps you understand the basic parts and roles in a RAG-based chat system.

Real-world RAG systems need more preprocessing and post-processing to handle articles, queries, and responses. The next diagram shows a more realistic setup, called advanced RAG:

This article gives you a simple framework to understand the main phases in a real-world RAG-based chat system:

Ingestion

Ingestion means saving your organization's documents so you can quickly find answers for users. The main challenge is to find and use the parts of documents that best match each question. Most systems use vector embeddings and cosine similarity search to match questions to content. You get better results when you understand the content type (like patterns and format) and organize your data well in the vector database.

When setting up ingestion, focus on these steps:

Content preprocessing and extraction

The first step in the ingestion phase is to preprocess and extract the content from your documents. This step is crucial because it ensures that the text is clean, structured, and ready for indexing and retrieval.

Clean and accurate content makes a RAG-based chat system work better. Start by looking at the shape and style of the documents you want to index. Do they follow a set pattern, like documentation? If not, what questions could these documents answer?

At a minimum, set up your ingestion pipeline to:

Some of this information, like metadata, can help during retrieval and evaluation if you keep it with the document in the vector database. You can also combine it with the text chunk to improve the chunk's vector embedding.

Chunking strategy

As a developer, decide how to break up large documents into smaller chunks. Chunking helps send the most relevant content to the LLM so it can answer user questions better. Also, think about how you'll use the chunks after you get them. Try out common industry methods and test your chunking strategy in your organization.

When chunking, think about:

Chunking organization

In a RAG system, how you organize your data in the vector database makes it easier and faster to find the right information. Here are some ways to set up your indexes and searches:

Alignment optimization

Make retrieved chunks more relevant and accurate by matching them to the types of questions they answer. One way is to create a sample question for each chunk that shows what question it answers best. This approach helps in several ways:

Each chunk’s sample question acts as a label that guides the retrieval algorithm. The search becomes more focused and aware of context. This method works well when chunks cover many different topics or types of information.

Update strategies

If your organization updates documents often, you need to keep your database current so the retriever can always find the latest information. The retriever component is the part of the system that searches the vector database and returns results. Here are some ways to keep your vector database up to date:

Pick the update strategy or mix that fits your needs. Think about:

Review these factors for your application. Each method has trade-offs in complexity, cost, and how quickly updates show up.

Inference pipeline

Your articles are now chunked, vectorized, and stored in a vector database. Next, focus on getting the best answers from your system.

To get accurate and fast results, think about these key questions:

The whole inference pipeline works in real time. There’s no single right way to set up your preprocessing and post-processing steps. You use a mix of code and LLM calls. One of the biggest trade-offs is balancing accuracy and compliance with cost and speed.

Let’s look at strategies for each stage of the inference pipeline.

Query preprocessing steps

Query preprocessing starts right after the user sends a question:

These steps help make sure the user’s question fits your system and is ready to find the best article chunks using cosine similarity or "nearest neighbor" search.

Policy check: Use logic to spot and remove or flag unwanted content, like personal data, bad language, or attempts to break safety rules (called "jailbreaking").

Query rewriting: Change the question if needed—expand acronyms, remove slang, or rephrase it to focus on bigger ideas (step-back prompting).

A special version of step-back prompting is Hypothetical Document Embeddings (HyDE). HyDE has the LLM answer the question, makes an embedding from that answer, and then searches the vector database with it.

Subqueries

Subqueries break a long or complex question into smaller, easier questions. The system answers each small question, then combines the answers.

For example, if someone asks, "Who made more important contributions to modern physics, Albert Einstein or Niels Bohr?" you can split it into:

The answers might include:

You can then ask follow-up questions:

These follow-ups look at each scientist’s effect, like:

The system combines the answers to give a full response to the original question. This method makes complex questions easier to answer by breaking them into clear, smaller parts.

Query router

Sometimes, your content lives in several databases or search systems. In these cases, use a query router. A query router picks the best database or index to answer each question.

A query router works after the user asks a question but before the system searches for answers.

Here’s how a query router works:

  1. Query analysis: The LLM or another tool looks at the question to figure out what kind of answer is needed.
  2. Index selection: The router picks one or more indexes that fit the question. Some indexes are better for facts, others for opinions or special topics.
  3. Query dispatch: The router sends the question to the chosen index or indexes.
  4. Results aggregation: The system collects and combines the answers from the indexes.
  5. Answer generation: The system creates a clear answer using the information it found.

Use different indexes or search engines for:

For example, in a medical advice system, you might have:

If someone asks about the effects of a new drug, the router sends the question to the research paper index. If the question is about common symptoms, it uses the general health index for a simple answer.

Post-retrieval processing steps

Post-retrieval processing happens after the system finds content chunks in the vector database:

Next, check if these chunks are useful for the LLM prompt before sending them to the LLM.

Keep these things in mind:

Watch out for the needle in a haystack problem: LLMs often pay more attention to the start and end of a prompt than the middle.

Also, remember the LLM’s maximum context window and the number of tokens needed for long prompts, especially at scale.

To handle these issues, use a post-retrieval processing pipeline with steps like:

Post-completion processing steps

Post-completion processing happens after the user’s question and all content chunks go to the LLM:

After the LLM gives an answer, check its accuracy. A post-completion processing pipeline can include:

Evaluation

Evaluating a system like this is more complex than running regular unit or integration tests. Think about these questions:

Capturing and acting on feedback from users

Work with your organization's privacy team to design feedback capture tools, system data, and logging for forensics and root cause analysis of a query session.

The next step is to build an assessment pipeline. An assessment pipeline makes it easier and faster to review feedback and find out why the AI gave certain answers. Check every response to see how the AI produced it, if the right content chunks were used, and how the documents were split up.

Also, look for extra preprocessing or post-processing steps that could improve results. This close review often finds content gaps, especially when no good documentation exists for a user's question.

You need an assessment pipeline to handle these tasks at scale. A good pipeline uses custom tools to measure answer quality. It helps you see why the AI gave a specific answer, which documents it used, and how well the inference pipeline worked.

Golden dataset

One way to check how well a RAG chat system works is to use a golden dataset. A golden dataset is a set of questions with approved answers, helpful metadata (like topic and question type), links to source documents, and different ways users might ask the same thing.

A golden dataset shows the "best case scenario." Developers use it to see how well the system works and to run tests when they add new features or updates.

Assessing harm

Harms modeling helps you spot possible risks in a product and plan ways to reduce them.

A harms assessment tool should include these key features:

These features help you find and fix risks, and they also help you build more ethical and responsible AI by thinking about all possible impacts from the start.

For more information, see these articles:

Testing and verifying the safeguards

Red-teaming is key—it means to act like an attacker to find weak spots in the system. This step is especially important to stop jailbreaking. For tips on planning and managing red teaming for responsible AI, see Planning red teaming for large language models (LLMs) and their applications.

Developers should test RAG system safeguards in different scenarios to make sure they work. This step makes the system stronger and also helps fine-tune responses to follow ethical standards and rules.

Final considerations for application design

Here are some key things to remember from this article that can help you design your app:

To build a generative AI app, check out Get started with chat by using your own data sample for Python. The tutorial is also available for .NET, Java, and JavaScript.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4