J
JohnAziz
Introduction
Large language models (LLMs) can perform many different tasks with text, audio, images, and even videos, allowing them to be multimodal. With these many capabilities, there is a fear that LLMs might not always follow the goal that you created them for, which might be to do a specific task related to your business. For example, respond to customers' inquiries or assist the company's employees in finding answers to their questions. Modifying the system message (meta prompt) with prompt engineering to achieve these specific tasks will still leave you with nondeterministic responses generated randomly. To mitigate this, you can use retrieval-augmented generation (RAG), which we will explain in this blog.
Why Use RAG?
The top reasons why you'd consider using the RAG technique are:
- Provide Grounding and Context for the Large Language Model:
- Spoiler (Highlight to read)
Now, the model is not responding based on its training data (original data); instead, it's trying to follow your instructions by giving you a response from a retrieved subset of your data.
Now, the model is not responding based on its training data (original data); instead, it's trying to follow your instructions by giving you a response from a retrieved subset of your data.
- Spoiler (Highlight to read)
- Overcome the outdated training data limitation.
- Spoiler (Highlight to read)
Generative Pre-trained models (GPT) are trained on data collected over a specific period of time; after that period, the model has no way of knowing the updated state of its training data unless you somehow provide the new data. The key here is understanding the word "pre-trained".
Generative Pre-trained models (GPT) are trained on data collected over a specific period of time; after that period, the model has no way of knowing the updated state of its training data unless you somehow provide the new data. The key here is understanding the word "pre-trained".
- Spoiler (Highlight to read)
- Lower starting cost compared to other solutions like fine-tuning.
- Spoiler (Highlight to read)
Fine-tuning is retraining a part of the large language model on your training data before using it. It is a completely different approach to RAG and might cost more money to retrain the model as it requires more computing power. See this video to learn more about when to fine-tune.
Fine-tuning is retraining a part of the large language model on your training data before using it. It is a completely different approach to RAG and might cost more money to retrain the model as it requires more computing power. See this video to learn more about when to fine-tune.
- Spoiler (Highlight to read)
- Supercharge data retrieval with a powerful generative model.
- Spoiler (Highlight to read)
This approach supports building a powerful data retrieval system, as you can retrieve your data the way you'd normally do and then send it to the large language model to rephrase or format it to whatever form and shape you'd like.
This approach supports building a powerful data retrieval system, as you can retrieve your data the way you'd normally do and then send it to the large language model to rephrase or format it to whatever form and shape you'd like.
- Spoiler (Highlight to read)
What is RAG?
RAG is Intelligently retrieving a subset of data from data stores to provide specific, contextual knowledge to the large language model to support how it answers a user’s prompt (question or query).
Intelligent retrieval is crucial, as the model will only respond based on the retrieved data; if the data is bad, the model will give non-relevant responses.
You might often see RAG used with vector databases, but using this technique is not limited to that; you can also use RAG with your pre-existing SQL or No-SQL database. Just pass the query response to a large language model, and it will rephrase it into text-based responses that closely resemble human language and structure.
Why are Vector Databases commonly used with RAG?
A user question is in natural language that has some context and semantics for what they are talking about. Applying a full-text search to a user query strips it of everything and fully or partially matches the text in the question to text from your database. That does not allow for using synonyms in the search or semantically matching what the user means with database records; it has to be the exact word from the database to pull the right records. Here comes the vector database to provide that missing component.
Vector databases store vector embeddings that hold the semantic meaning of words. Vector embeddings consist of a vector (or array) of numbers that represent real-world words.
(Artwork by: sfoteini)
You can use any embedding model (like text-embedding-ada-002) to generate this array of numbers, but note that each embedding model generates vector embeddings of different lengths, so you need to make sure that all of your embeddings are of the same length and that you have configured that in your vector database too.
How to do RAG?
There are two ways to do RAG:
- With preexisting database SQL or No-SQL.
- With Vector database.
Preexisting database:
- Create a new Azure OpenAI Chat deployment. (see guide for creating a ChatGPT deployment here)
- Get the Model API Key and Endpoint to integrate it into your application.
- Feed the returned records from your database to the model.
- Return the model's response to the user.
Vector database:
- Choose a vector database. (Currently, not all databases support storing vectors so you might need to migrate to a database that supports storing vectors and vector operations.)
- Create a new Azure OpenAI Embeddings deployment. (see guide for creating an embedding deployment here)
- Get the Model API Key and Endpoint to use it
- Choose the columns/keys that you want to convert to vector embeddings.
- Make API calls to generate your embeddings and store them in your database.
- Create a new Azure OpenAI Chat deployment. (see guide for creating a ChatGPT deployment here)
- Get the Model API Key and Endpoint to integrate it into your application.
- When a user asks a question convert it to vector embeddings and compare the similarity of the embeddings with what you have in the database (this now can compare relationships, patterns, and meaning of words)
- Feed the returned records from your database to the model.
- Return the model's response to the user.
This GitHub sample displays the difference between full-text, vector-only, and rag search with Azure CosmosDB for MongoDB vCore and LangChain available here Azure-Samples/Cosmic-Food-RAG-app.
You can try it to test the three approaches and see what these techniques can add to your business.
Conclusion
RAG can streamline your recommendation system and take it to the next level with the power of human-like responses and semantic similarity, understanding what your user actually needs. You can start by adding a large language model layer between your database and the user to see the difference it makes before migrating your entire database to a vector database. It all depends on your use case and the nature of the data you have. Start by testing a small subset of your data first, then migrate if the results are looking good.
Further Reading
- What is Azure OpenAI Service? - Azure AI services
- Azure OpenAI Service embeddings - Azure OpenAI - embeddings and cosine similarity
- Vector Search - Azure Cosmos DB for MongoDB vCore
- Advanced Prompt engineering techniques
- 18 Lessons, Get Started Building with Generative AI
- RAG and generative AI - Azure AI Search
- RAG techniques: Cleaning user questions with an LLM
- RAG techniques: Function calling for more structured retrieval
Found this useful? Share it with others and follow me to get updates on:
- Twitter (twitter.com/john00isaac)
- LinkedIn (linkedin.com/in/john0isaac)
Feel free to share your comments and/or inquiries in the comment section below..
See you in future blogs!
Continue reading...