Retrieval Augmented Fine Tuning: Use GPT-4o to fine tune GPT-4o mini for domain specific application

  • Thread starter Thread starter azeltov
  • Start date Start date
A

azeltov

Customization is key!



One of the most impactful applications of generative AI for businesses is to create natural language interfaces that have been customized to use domain and use case specific data to provide better, more accurate responses. This means answering questions about specific domains such as banking, legal, and medical fields.

We often talk about two methods to achieve this:



  1. Retrieval Augmented Generation (RAG): Storing those documents in a vector database and, at query time, retrieving documents based on their semantic similarity to the question, then using them as context for the LLM.
  2. Supervised Fine-Tuning (SFT): Training an existing base model on a set of prompts and responses representing the domain-specific knowledge.

While most organizations experimenting with RAG aim to extend an LLM's knowledge with their internal knowledge base, many do not achieve the expected results without significant optimization. Similarly, it can be challenging to curate a sufficiently large and high-quality data set for fine-tuning. Both approaches have limitations: fine-tuning confines the model to its trained data, making it susceptible to approximation and hallucination, while RAG grounds the model but retrieves documents based merely on their semantic proximity to the query -- which may not be relevant and can lead to poorly reasoned answers.



RAFT to the rescue!

azeltov_0-1726661039094.jpeg



Instead of choosing RAG or fine-tuning, we can combine them! Think of RAG as an open book exam: the model looks up relevant documents to generate answers. Fine-tuning is like a closed book exam: the model relies on pre-trained knowledge. Just like in exams, the best results come from studying and having notes handy.


Retrieval Aware Fine-Tuning (RAFT) is a powerful technique to prep fine-tuning data for domain-specific open-book settings, like in-domain RAG. It’s a game-changer for language models, combining the best parts of RAG and fine-tuning. RAFT helps tailor models to specific domains by boosting their ability to understand and use domain-specific knowledge. It’s the sweet spot between RAG and domain-specific SFT.



How does it work?



RAFT has three steps:

  1. Preparing a dataset to teach the model how to answer questions about your domain.
  2. Fine tuning a model with your prepared dataset
  3. Evaluating the quality of your new, custom, domain adapted model

The key to RAFT is the training data generation, where each data point includes a question (Q), a set of documents (Dk), and a Chain-of-Thought style answer (A).

The documents are categorized into Oracle Documents (Do), which contain the answer, and Distractor Documents (Di), which do not. Fine-tuning teaches the model to differentiate between these, resulting in a custom model that outperforms the original with RAG or fine-tuning alone.

We use GPT-4o to generate training data and fine-tune GPT-4o mini, creating a cost-effective, faster model tailored to your use case. This technique, called distillation, uses GPT-4o as the teacher model and 4o-mini as the student.


In the next section of this blog, we'll get hands on. If you want to follow along on your own, or see reference code, check out GitHub - Azure-Samples/azure-openai-raft. We'll create a domain adapted model for a banking use case, capable of answering questions about a bank's online tooling and accounts.



Notebook 1- Generating your RAFT training data



Start by gathering domain-specific documents; in our example, these are PDFs of bank documentation. To generate our training data, we convert the PDFS to markdown text format The document is in PDF format and contains a number of tables and charts, we will use GPT-4o to convert the pages content to markdown. We use Azure OpenAI GPT 4o to extract all of this information into a Markdown file to be used for downstream processing. We then use GPT-4o (our teacher model) to generate synthetic Question-Document-Answer triplets including examples of "golden documents" (highly relevant) and "Distractors" (misleading). This will ensure the model learns to differentiate between relevant and irrelevant information. RAFT utilizes Chain of Thought (CoT) process, by integrating CoT RAFT process improves the model’s ability to extract information and perform logical reasoning. This method helps prevent overfitting and enhances training robustness, making it particularly effective for tasks that require detailed and structured thinking


We then format this data for fine-tuning, splitting it into training, validation, and test sets. The validation data is used during training, and the test set measures performance at the end.



Notebook 2- RAFT Fine Tuning



Now it's time to teach our student! After preparing the training and validation data, the next step is to upload this data to Azure OpenAI and create the fine-tuning job. This is surprisingly easy: in AI Studio, selecting your model, uploading your training and validation data, and setting your training parameters are just a few clicks. We'll select 4o-mini as our student model for training. In the lab we will show you how to use the SDK to upload and trigger the fine-tuning job. UI makes it an easy way to experiment, and SDK approach is preferred way for productionalizing and enabling your llmops strategy for deploying in production.



azeltov_2-1726660289683.png



Once the fine-tuning job is running, we can monitor its progress and, upon completion, analyze the fine-tuned model in Azure OpenAI Studio. Finally, we create a new deployment with the fine-tuned model, ready to be used for our specialized domain tasks.



Notebook 3 - Is our RAFT model really better than the base model? Let's check!



You can start off by reviewing the built in metrics returned by AI Studio, showing loss and accuracy. We want to see accuracy increase, while loss goes down:

azeltov_1-1726659961512.png



However – we can do much more to measure the quality of our model. Remember our test dataset from the beginning? This is why we prepared it!



While there are many options for evaluation, including AI Studio evaluations, in our example we use the open-source library RAGAS, which evaluates RAG pipelines with metrics like answer relevancy, faithfulness, answer similarity, and answer correctness. These metrics rely on either an LLM as a judge or an embedding model to assess the quality and accuracy of the generated answers.




gpt4o-mini vs gpt4o-mini-raftgpt4o-mini vs gpt4o-mini-raft

We could probably improve our metrics further by adjusting our training parameters and/or generating additional training data to improve model metrics.



Ready to get started yourself?






Acknowledgement:




References:



Continue reading...
 
Back
Top