This article explains how large language models (LLMs) can use extra data to provide better answers. By default, an LLM only knows what it learned during training. You can add real-time or private data to make it more useful.
There are two main ways to add this extra data:
The next sections break down both methods.
Understanding RAGRAG enables the key "chat over my data" scenario. In this scenario, an organization has a potentially large corpus of textual content, like documents, documentation, and other proprietary data. It uses this corpus as the basis for answers to user prompts.
RAG lets you build chatbots that answer questions using your own documents. Here's how it works:
Start by building a vector data store. This store holds the embeddings for each document or chunk. The following diagram shows the main steps to create a vectorized index of your documents.
The diagram shows a data pipeline. This pipeline brings in data, processes it, and manages it for the system. It also prepares the data for storage in the vector database and makes sure itâs in the right format for the LLM.
Embeddings drive the whole process. An embedding is a set of numbers that represents the meaning of words, sentences, or documents so a machine learning model can use them.
One way to create an embedding is to send your content to the Azure OpenAI Embeddings API. The API returns a vectorâa list of numbers. Each number describes something about the content, like its topic, meaning, grammar, or style.
All these numbers together show where the content sits in a multi-dimensional space. Imagine a 3D graph, but with hundreds or thousands of dimensions. Computers can work with this kind of space, even if we canât draw it.
The Tutorial: Explore Azure OpenAI in Azure AI Foundry Models embeddings and document search provides a guide on how to use the Azure OpenAI Embeddings API to create embeddings for your documents.
Storing the vector and contentThe next step involves storing the vector and the content (or a pointer to the content's location) and other metadata in a vector database. A vector database is like any other type of database, but with two key differences:
With the corpus of documents stored in a vector database, developers can build a retriever component to retrieve documents that match the user's query. The system uses this data to supply the LLM with what it needs to answer the user's query.
Answering queries by using your documentsA RAG system first uses semantic search to find articles that might be helpful to the LLM when it composes an answer. The next step involves sending the matching articles with the user's original prompt to the LLM to compose an answer.
The following diagram depicts a simple RAG implementation (sometimes called naive RAG):
In the diagram, a user submits a query. First, the system turns the user's prompt into an embedding. Then, it searches the vector database to find the documents or chunks that are most similar to the prompt.
Cosine similarity measures how close two vectors are by looking at the angle between them. A value near 1 means the vectors are very similar; a value near -1 means theyâre very different. This approach helps the system find documents with similar content.
Nearest neighbor algorithms find the vectors that are closest to a given point. The k-nearest neighbors (KNN) algorithm looks for the top k closest matches. Systems like recommendation engines often use KNN and cosine similarity together to find the best matches for a userâs needs.
After the search, send the best matching content and the userâs prompt to the LLM so it can generate a more relevant response.
Challenges and considerationsA RAG system comes with its own challenges:
Developers need to address these challenges to build RAG systems that are efficient, ethical, and valuable.
To learn more about building production-ready RAG systems, see Build advanced retrieval-augmented generation systems.
Want to try building a generative AI solution? Start with Get started with the chat using your own data sample for Python. Tutorials are also available for .NET, Java, and JavaScript.
Fine-tuning a modelFine-tuning retrains an LLM on a smaller, domain-specific dataset after its initial training on a large, general dataset.
During pretraining, LLMs learn language structure, context, and general patterns from broad data. Fine-tuning teaches the model with new, focused data so it can perform better on specific tasks or topics. As it learns, the model updates its weights to handle the details of the new data.
Key benefits of fine-tuningFine-tuning also has some challenges:
Customize a model through fine-tuning explains how to fine-tune a model.
Fine-tuning vs. RAGFine-tuning and RAG both help LLMs work better, but each fits different needs. Pick the right approach based on your goals, the data and compute you have, and whether you want the model to specialize or stay general.
When to choose fine-tuningDecide between fine-tuning and RAG based on what your app needs. Fine-tuning is best for specialized tasks, while RAG gives you flexibility and up-to-date content for dynamic scenarios.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4