The Future of AI: LLM Distillation just got easier - Synthetic Data Gen with Llama 3.1 405B & RAFT

cedricvidal · Sep 4, 2024

The Future of AI: LLM Distillation just got easier

Part 1 - Synthetic Data Gen with Llama 3.1 405B & RAFT

How Llama 405B and RAFT on Azure AI are changing the landscape of synthetic dataset generation and making model distillation much more approachable.

By Cedric Vidal, Principal AI Advocate, Microsoft

Part of the Future of AI

series initiated by Marco Casalaina with his Exploring Multi-Agent AI Systems blog post.

Gorilla scientist distilling, generated using Azure OpenAI DALL-E 3

The AI landscape is continuously evolving, with one of the latest advancements being the ability to generate high-quality synthetic datasets using large language models (LLMs). Llama 3.1 405B Instruct, released on Hugging Face on July 23rd and simultaneously on Azure AI, combined with the RAFT (Retrieval Augmented Fine Tuning) framework, is set to revolutionize how companies create synthetic data. This powerful combination simplifies a previously tedious, time-consuming, and costly process, enabling businesses to generate self-instruct Q&A and Chain of Thought datasets directly from their domain-specific documents.

This blog post is the first in a five-part series where we explore how you can leverage a new GitHub repository that makes it easier than ever to use Llama 405B for distilling smaller, more efficient models. In this first part, we’ll dive into the benefits of synthetic dataset generation with Llama 405B and RAFT, why it’s a game-changer, and how you can get started.

Why Synthetic Dataset Generation Matters

Synthetic dataset generation has become increasingly important in AI development. Acquiring high-quality, task-specific data often requires extensive manual effort, significant costs, and can be hampered by privacy concerns. This is especially challenging in industries where data is sensitive or hard to obtain. Synthetic data generation provides a practical solution by creating data that mirrors real-world scenarios, tailored to specific tasks without the need for traditional data collection.

Key benefits include:

Efficiency: Quickly generate datasets without manual data collection or annotation, saving time and reducing operational costs.
Scalability: Easily scale data creation to match the requirements of training and fine-tuning large models.
Domain Relevance: Generate data specific to the domain, ensuring the training data is highly relevant and contextually accurate.

Llama 3.1 405B Instruct: The Data Generation Powerhouse

Llama 3.1 405B Instruct is a state-of-the-art language model with 405 billion parameters, designed to excel at following instructions for complex text generation tasks. As one of the most powerful models available, it offers unparalleled capabilities in generating high-quality synthetic datasets.

Instruction-Tuned Excellence: Llama 3.1 405B has been tuned specifically to follow detailed instructions, making it particularly effective at generating synthetic Q&A pairs and complex reasoning paths.
High-Quality Outputs: Thanks to its massive scale and extensive fine-tuning, Llama 405B generates text that is contextually rich and accurate, ideal for creating training data for downstream tasks.
Versatility: From conversational data to intricate problem-solving scenarios, Llama 405B can produce data tailored to a wide array of applications, including industry-specific needs.

RAFT: Enhancing Synthetic Q&A and Chain of Thought Generation

RAFT, detailed in a recent paper from UC Berkeley’s Gorilla project and summarized in a previous blog post, significantly enhances the synthetic generation capabilities of Llama 3.1 405B. Originally, the Self-Instruct method—outlined in the Self-Instruct paper—advanced traditional synthetic dataset creation by automating the generation of questions and instructions typically crafted by humans. RAFT extends this methodology by generating synthetic questions directly from domain-specific documents, such as PDFs, and incorporating Chain of Thought reasoning. This approach trains the student model not only to understand the domain but also to comprehend the reasoning behind answering the questions.

RAFT is specifically designed to optimize Retrieval-Augmented Generation (RAG) workflows by being trained to identify and utilize relevant documents while discarding those that are irrelevant.

The RAFT self-instruct synthetic dataset generation steps

Here’s how RAFT makes the process more efficient:

Domain-Specific Data: RAFT enables you to input a collection of domain-specific documents—like technical manuals, research papers, or internal company guides. It then automatically generates relevant questions, answers, and reasoning paths tailored to those documents. This is particularly beneficial for industries where it’s challenging to gather high-quality, task-specific data.
Automated Self-Instruct Datasets: With RAFT, manual annotation and expert involvement in data generation become unnecessary. RAFT autonomously creates instruction-following data that simulates how experts would analyze and reason about the content of the documents.
Cost and Time Efficiency: Previously, creating such specialized datasets needed extensive resources and expert time. RAFT automates this, significantly cutting down the time and cost required for developing domain-specific training data.

RAFT Distillation Recipe: Streamlining the Distillation Process with RAFT and Llama 3.1 405B

The primary goal of the raft-distillation-recipe is to simplify and automate the end-to-end process of distilling large language models. The project automates the provisioning of the infrastructure required to run RAFT and Llama and provides a set of notebooks for each step of the distillation process. These notebooks are designed to be as hands-free as possible, ensuring that even complex tasks—such as synthetic dataset generation, model fine-tuning, and deployment—can be accomplished with minimal manual intervention while documenting the code necessary to run each step.

Whether you’re new to AI or an experienced practitioner, the focus is on delivering a seamless, user-friendly experience that allows you to concentrate on the outcomes rather than the process itself.

In this blog post, we will focus on the first step: self-instruct dataset generation.

Why the RAFT Distillation Recipe Matters

Documentation on distillation often requires setting up GPUs, which requires some expertise, or leaves out critical steps like creating the dataset.

Azure AI Serverless offers an enterprise-ready solution, making it easy and cost-effective to run fine-tuning jobs at scale with a curated selection of teacher models, including Llama 3.1 405B Instruct.

RAFT simplifies synthetic dataset creation, generating data from documents that most companies already have on hand.

The RAFT Distillation Recipe combines Azure AI, Llama 3.1 405B, and RAFT to automate the distillation process end-to-end while explaining each step.

Getting Started with Llama 405B and RAFT for Synthetic Dataset Generation

To begin generating synthetic datasets using Llama 3.1 405B and RAFT, we will utilize the raft-distillation-recipe GitHub repository that streamlines this process. Here’s how you can get started:

Clone the Repository: Start by visiting the project’s GitHub repository and open the project easily in one click using one of the options below:
- Open in GitHub Codespaces
- Open in Dev Containers

Alternatively, you can clone the repository locally and set up a Python virtual environment if you prefer more control over your environment setup.

Set Up Your Environment: Codespaces and Dev Containers automatically install the required dependencies. The repository uses GitHub’s prebuild image feature, which includes all necessary pip requirements, greatly speeding up setup. The setup script also automatically clones the RAFT repository, making its command-line tools available directly in the notebooks.
Provision the Infrastructure: To provision the Azure AI infrastructure, log in to Azure using the Azure Developer CLI:

Note: You will need an Azure Pay-As-You-Go account. If you don’t have one, head over to the Azure signup page. It will require a credit card, but for the sample datasets included, costs are capped and estimated on the project page.

azd auth login --use-device-code

Then, deploy the necessary resources with:

azd up

Upload Your Documents or Use Sample Documents: The project comes with sample documents for various domains, sourced from public data such as Wikipedia. You can use these or upload your own domain-specific PDFs. RAFT will parse these documents, generating relevant questions, answers, and reasoning chains optimized for your needs.

By default, the generation notebook uses the sample surfing domain and loads a PDF named Surfing - Wikipedia.pdf.

You can use other sample domains and documents or upload your own.

Open the Generation Notebook: Open the 1_gen.ipynb notebook and follow the explanations. To use a different sample domain, PDF document, or your own, update the notebook parameters:

Code:

ds_name: str = "surfing"
doc_path: str = "sample_data/surfing/Surfing - Wikipedia.pdf"

To load a directory of PDFs, set the doc_path parameter to the directory.

Run the Synthetic Data Generation Notebook: After setting a few variables, the main step is running the raft.py script.

The script stops generating Q/A samples when it reaches the --qa-threshold, controlling dataset size, cost, and generation time. After running the raft.py script, the notebook will export the generated dataset to a format suitable for the Azure AI Fine-tuning as a Service API using RAFT’s format.py script. This script supports multiple formats, depending on the intended use of the dataset.

Next, the dataset is split into three sets: training, validation, and evaluation (test). This splitting process is fundamental to ensuring your model is trained effectively, evaluated properly, and generalizes well to new data. Here’s a brief overview of each split:

Training Split (80%): Used to train the model, allowing it to learn patterns and relationships by adjusting internal parameters based on input-output pairs.
Validation Split (10%): Used during training to monitor performance and fine-tune hyperparameters, ensuring the model doesn’t overfit by providing feedback on unseen data during training.
Evaluation Split (10%): Used only after training is complete to evaluate the model’s final performance, offering an unbiased measure of how well the model generalizes to new, unseen data.

What’s Next?

This blog post covered the first step in leveraging Llama 405B and RAFT for streamlined distillation: self-instruct synthetic dataset generation using RAFT. In the next installment, we’ll explore how to fine-tune a Llama 3.1 8B model using these synthetically generated datasets with the Azure AI Serverless Python SDK.

Stay tuned as our next blog post in this series will be out in a couple of weeks, continuing our exploration of cutting-edge methodologies that are making AI model development faster, more accessible, and more impactful for businesses everywhere.

Continue reading...

The Future of AI: LLM Distillation just got easier - Synthetic Data Gen with Llama 3.1 405B & RAFT

cedricvidal

Why Synthetic Dataset Generation Matters​

Llama 3.1 405B Instruct: The Data Generation Powerhouse​

RAFT: Enhancing Synthetic Q&A and Chain of Thought Generation​

RAFT Distillation Recipe: Streamlining the Distillation Process with RAFT and Llama 3.1 405B​

Why the RAFT Distillation Recipe Matters​

Getting Started with Llama 405B and RAFT for Synthetic Dataset Generation​

What’s Next?​