Building Retrieval Augmented Generation on VSCode & AI Toolkit

vinayakh · Sep 17, 2024

Retrieval Augmented Generation (RAG) on VS Code AI Toolkit:

AI toolkit allows users to quickly deploy models locally or in the cloud, test and integrate them via a user-friendly playground or REST API, fine-tune models for specific requirements, and deploy AI-powered features either in the cloud or embedded within device applications.

In the previous blogs, we learnt how to get started with AI Toolkit by installing and creating basic application. Please refer to the blogs for detailed insights and updates if this is your first time using the VS Code AI Toolkit.

Visual Studio Code AI Toolkit: Run LLMs locally

Visual Studio AI Toolkit: Building GenAI Applications

Retrieval Augmented Generation (RAG):

LLMs are trained on a specific dataset from various domains. When you want to work with LLMs, there might be information that is specific to your dataset or domain which the LLMs doesn’t have enough knowledge about. Generated language might also need to be referenced based on the domain and use case. RAG is used for these use cases to increase the applicability to specific domains and datasets. e.g. The LLM might know about legal services in general but if you want to reference specific statutes in US Law and get references to them then RAG might be a good approach to do this. A similar approach can be applied for any other country specific laws as well.

Retrieval-Augmented Generation (RAG) is a hybrid approach in natural language processing (NLP) that combines two key elements: retrieval of relevant information from external data sources and generation of text based on this retrieved information.

Retrieval: These models retrieve relevant text from a large repository of documents when generating responses or completing tasks. They excel at providing factual and specific information. A better retrieval mechanism leads to a more accurate and relevant response.

Instead of relying solely on learned patterns and data, RAG incorporates a retrieval step. This involves searching or retrieving specific relevant snippets from a large database of documents related to the input query or task. RAG improves the accuracy and relevancy of the response based on the domain specific document repository.

Generation: These models generate responses from scratch based on learned patterns and data that they are being trained on. They can create fluent and contextually appropriate text but may struggle with accuracy and factual correctness when dealing with specific queries. e.g. if we ask the model for current dollar rate, then the language model can only generate the response based on the data that it is trained on and it is likely to be inaccurate. The information contained in the model is based on the documents it is trained on and for current events such as news and currency prices which are dynamic, it will be accurate only until the training date cutoff. So, connecting it to a reliable source then allows us to extend the model to get it from the right source. Similarly, in-case the model answers from its pretrained data, it may not be able to quote the reference for the data. This can be limiting in certain cases if user doesn’t know the source of the answer. With RAG the source can be referenced as the answer is generated.

Some Applications:

Question Answering: RAG can excel in tasks where precise answers backed by evidence are required, such as in open-domain question answering systems.
Content Creation: It can also be used to generate content that is both informative and accurate, leveraging retrieved knowledge to enhance the generation process.
Summarization: RAG can also be used to create both abstractive and extractive summaries. Extractive summaries identify the important sentences from the document(s) and generate a summary based on their relative importance. Abstractive summaries identify the most important ideas and content and synthesize them in their own words.

In this series, lets create a basic RAG application.

Let's discuss the architecture in two parts, first part would be creation of database and second is retrieval.

Creation of Database

We will use a PDF file that will be used for RAG implementation. We will first extract text from the PDF file and then convert that into smaller pieces of documents, which are often referred as ‘chunks ‘and the process of doing so is known as ‘chunking’. This process helps the language model to extract the right document without exceeding the context limit. Chunk overlap and chunk size must be balanced well for a getting good results.

Once we have document chunks, we will next proceed to convert them into embeddings. Embeddings are a foundational concept in NLP. Embeddings enable machines to understand and process human language more effectively by representing words or phrases as vectors in a continuous space where semantic relationships are encoded.

How Embeddings Work?

Vector Representation: Each word or phrase is represented as a vector of real numbers. For example, in a 300-dimensional embedding space, each word might be represented by a vector with 300 numerical values.
Semantic Similarity: Words with similar meanings are represented by vectors that are closer together in the embedding space. For instance, vectors for "dog" and "cat" would be closer than vectors for "dog" and "car".
Learned from Data: Embeddings are learned from large amounts of textual data using techniques like Word2Vec, GloVe (Global Vectors for Word Representation), or through neural network-based approaches such as Transformer models.
Applications: Embeddings are widely used in NLP tasks such as sentiment analysis, machine translation, text classification, and more. They allow models to generalize better to data they have not seen before and capture intricate relationships between words.

Once we have the embeddings, we store these in a unique database called as vector database.

Vector Databases

Vector databases are specialized for storing and retrieving vector data efficiently, making them an essential component in applications that rely on similarity-based search and analysis of high-dimensional data vectors. ChromaDB, is such an AI-native open-source vector database which will be used in this tutorial. For learning more about ChromaDB, Click here.

We will be utilizing ChromaDB from the Langchain framework in this tutorial.

Setting up a virtual environment (venv) is highly recommended while following this blog. Python is a prerequisite, if not installed please install the latest version of python on the machine. For detailed steps click here. Once the environment is setup, ChromaDB needs to be installed using the python package installer "pip".

In the VSCode terminal type the following command,

pip install chromadb

We will also utilize LangChain, a widely used OS framework for developing GenAI applications. Langchain is an Opensource Framework which is used as an orchestrator to build customizable AI applications especially those application which use LLMs/SLMs . Langchain provides tools and abstractions that make it easier to customize, control, and integrate LLMs into applications.

The following are some of the major components of Langchain.

Prompts: Prompts are the text instructions or questions that you provide to the LLM. Well-crafted prompts are crucial for getting accurate and relevant responses from the LLM. LangChain provides templates to structure prompts and make them more reusable.
LLMs: Large Language models (LLMs) like GPT-4o, LLaMA, and others are pre-trained models with vast knowledge and capabilities. LangChain seamlessly integrates with popular LLM providers. You can also use your own custom-trained models.
Chains: Chains are responsible for orchestrating the flow of data and interactions between different components.

Types of Chains:

Sequential Chains: Execute components in a linear order.

Parallel Chains: Execute components concurrently.

Conditional Chains: Execute components based on certain conditions.

Generative Chains: Generate text or other outputs.

Custom Chains: You can create your own custom chains to suit specific use cases

Callbacks:

Monitoring and Logging: Callbacks provide a way to track the progress of your application and log important events.

Customizations: You can implement custom callbacks to perform actions like sending notifications or storing data.

Indexes:

Document Retrieval: Indexes are used to store and retrieve documents that can be used as context for LLMs.

Vector Databases: LangChain supports various vector databases for efficient document retrieval.

Agents:

Autonomous Actions: Agents are capable of taking actions based on the information they gather from the environment.

Decision-Making: Agents use LLMs to make decisions and complete tasks.

Memory:

Context Preservation: Memory allows LLMs to maintain context and remember information from previous interactions.

Types of Memory:

Conversation Memory: Stores the history of a conversation.

Document Memory: Stores information from documents.

Episodic Memory: Stores information about past events.

Tools:

External Integration: Tools enable LLMs to interact with external resources like search engines, calculators, or APIs.

Expanding Capabilities: Tools can enhance the functionality of your applications.

By effectively combining these components, you can create a wide range of applications, including chatbots, question-answering systems, text summarization tools, and more.

To install, type the following command,

pip install langchain

LangChain Community contains third-party integrations that implement the base interfaces defined in LangChain Core, making them ready-to-use in any LangChain application.

Type the following command in VSCode terminal to install the Langchain community,

pip install langchain-community

Let’s now begin by importing the required libraries into our coding editor. It is recommended to use a notebook file (.ipynb) for this part. Head to Visual Studio code, and then create a new file, name is as dbmaker.ipynb, we must be having a notebook file now, Select the Kernel to the virtual environment that we have created earlier or use the python version installed on the local machine.

Now import the following libraries,

Code:

from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.document_loaders import DirectoryLoader,PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

It is quite important to understand the use of the above statements, so let’s look at these one by one,

Chroma:

Module: langchain_community.vectorstores

Purpose: Chroma is a vector store that allows you to store and retrieve high-dimensional vectors. It is used to store embeddings of documents or text and retrieve them based on similarity searches.

Usage: Typically used in applications that require efficient similarity searches, such as document retrieval or question-answering systems.

HuggingFaceEmbeddings:

Module: langchain_community.embeddings

Purpose: HuggingFaceEmbeddings provides a way to generate embeddings using models from the Hugging Face library. These embeddings are numerical representations of text that capture semantic meaning.

Usage: Used to convert text into embeddings that can be stored in a vector store like Chroma for similarity searches.

DirectoryLoader:

Module: langchain_community.document_loaders

Purpose: DirectoryLoader is used to load documents from a specified directory. It can handle various file types and is useful for batch processing of documents.

Usage: Commonly used to load a large number of documents from a directory for further processing, such as embedding generation or text splitting.

PyMuPDFLoader:

Module: langchain_community.document_loaders

Purpose: PyMuPDFLoader is a specialized document loader that uses the PyMuPDF library to load and process PDF documents.

Usage: Used to extract text from PDF files, which can then be processed further, such as generating embeddings or splitting text.

RecursiveCharacterTextSplitter:

Module: langchain.text_splitter

Purpose: RecursiveCharacterTextSplitter is used to split text into smaller chunks based on character count. It recursively splits text to ensure that chunks are of manageable size while preserving semantic meaning.

Usage: Useful in scenarios where large documents need to be broken down into smaller, more manageable pieces for processing, such as embedding generation or indexing.

Once we have done these imports successfully, it's time now to specify the directory of the documents and also to specify the embedding model.

Code:

#Directory of the PDF Files
dir = 'docs/'

# OS Embedding model from Huggingface
embeddings = HuggingFaceEmbeddings(model_name='all-MiniLM-L6-v2')

This line defines a variable ‘dir’ that holds the path to a directory containing PDF files. This directory path will be used later to load and process the PDF files stored in this location.

The model that will be used in this tutorial is all-MiniLM-L6-v2, which is a pre-trained model from Hugging Face.

The embeddings object will be used to convert text into numerical embeddings. These embeddings capture the semantic meaning of the text and can be used for various tasks such as similarity searches, clustering, or feeding into other machine learning models.

It's now time to load the documents from the library, so let’s create a function to achieve this,

Code:

#Loading the documents
def load_docs(dir):
    loader=DirectoryLoader(dir,loader_cls=PyMuPDFLoader,use_multithreading=True,max_concurrency=128,show_progress=True,silent_errors=True)
    documents=loader.load()
    return documents

Now let’s step through the code and explain what we are doing here,

The load_docs function is designed to load documents from the specified directory. It uses the DirectoryLoader class to handle the loading process, with specific configurations to optimize performance and handle unforeseen errors gracefully.

The load_docs function is designed to accept one Parameter that is the directory. The directory path where the PDF files are located.

Initialize DirectoryLoader:
- dir: The directory containing the PDF files.
- loader_cls=PyMuPDFLoader: Specifies that the PyMuPDFLoader class should be used to load the PDF files. This loader is specialized for handling PDF documents.
- use_multithreading=True: Enables multithreading to speed up the loading process.
- max_concurrency=128: Sets the maximum number of concurrent threads to 128. This allows for parallel processing of multiple files.
- show_progress=True: Displays a progress bar to indicate the loading progress.
- silent_errors=True: Suppresses error messages, allowing the loading process to continue even if some files fail to load.
Load Documents:
- documents = loader.load(): Calls the load method of the DirectoryLoader instance to load the documents from the specified directory.
Return Documents:
- return documents: Returns the loaded documents.

This function will be now used while we create chunks, throughout this tutorial we will be following functional paradigm in order to use them efficiently wherever needed.

To create chunks, lets now design a function.

Code:

#Splitting the documents into chunks
def split_docs(documents,chunk_size=1000,chunk_overlap=100):
    text_splitter=RecursiveCharacterTextSplitter(chunk_size=chunk_size,chunk_overlap=chunk_overlap)
    docs=text_splitter.split_documents(documents)
    return docs

The split_docs function is designed to split a list of documents into smaller chunks. This is useful for processing large documents in manageable pieces, especially for tasks like embedding generation or indexing. This helps a lot especially while there is a need to send limited context to the model when there is a token limit.

Parameters of split_docs:

documents: A list of documents to be split. Each document is typically a string or a structured object containing text.
chunk_size (default 1000): The maximum number of characters in each chunk.
chunk_overlap (default 100): The number of characters that overlap between consecutive chunks. This helps to maintain context across chunks.

The chunk size and chunk overlap needs to be tweaked according to the use case, but for this tutorial we are keeping it to a default standard value.

Steps

Initialize RecursiveCharacterTextSplitter:
- chunk_size=chunk_size: Sets the maximum size of each chunk.
- chunk_overlap=chunk_overlap: Sets the number of overlapping characters between chunks.
- The RecursiveCharacterTextSplitter is designed to split text into chunks while preserving semantic meaning as much as possible.
Split Documents:
- docs = text_splitter.split_documents(documents): Calls the split_documents method of the text_splitter instance to split the input documents into smaller chunks.
Return Chunks:
- return docs: Returns the list of document chunks.

We have now completed the task of designing functions that will further help us in performing the tasks that we intend to do inorder to create a vector database.

Its time to utilise the functions and create the vector database,

Code:

documents=load_docs(dir)
len(documents)
doc=split_docs(documents)
print(len(doc))

Let’s call the load_docs function with the directory path, to load the documents from the specified directory. The loaded documents are stored in the documents variable. Then, to check the length of the documents, we can use the len() function.

Now call the split_docs function with the loaded documents to split them into smaller chunks. The resulting chunks are stored in the doc variable. The length of the doc list is printed, indicating how many chunks were created from the original documents.

As we have the chunked embeddings ready, its now time to store these into a vector database. As we have already discussed we will be using the ChromaDB in this tutorial.

save_to=Chroma.from_documents(documents=doc,embedding=embeddings,persist_directory='./ai-toolkit')

The above line of code initializes a Chroma vector store from a list of document chunks and their corresponding embeddings. The vector store is then saved to a specified directory for persistence.

Parameters

documents=doc: The list of document chunks that were created by the split_docs function.
embedding=embeddings: The embedding model used to generate embeddings for the document chunks. This is an instance of HuggingFaceEmbeddings.
persist_directory='./ai-toolkit': The directory where the vector store will be saved. This allows the vector store to be persisted and loaded later.

Our database is now ready and we can ask a sample search query to this. Although we haven’t still configured the small language model to this, but we can still try to see how the retriever works, so let’s ask a sample query,

query="What is Fine tuning"

We have defined our query, now its time to search it in our newly created ‘ai-toolkit’ named vector database. In order to do this, we will need to take the following steps,

Initialize a Chroma vector store by loading it from the specified directory.
Perform a similarity search on the vector store using the provided query.
Print the entire list of search results to the console.
Print the content of the first document or chunk in the search results.

The code is as follows,

Code:

db1=Chroma(persist_directory='./ai-toolkit',embedding_function=embeddings)
results=db1.similarity_search(query)
print(results)
print(results[0].page_content)

Upon successful execution, now there will be some results appearing in the output cell.

In this article we have learned the concepts about RAG and created embedding successfully. In the next part of this article, we will take look at how we can use RAG with these embeddings in ChromaDB to get better results using AI toolkit for VSCode and Phi-3 Model downloaded locally

Meanwhile you can take a look at the following resources about RAG and AI Toolkit

Building Retrieval Augmented Generation on VSCode & AI Toolkit

vinayakh