C
cedricvidal
The Future of AI: Distillation Just Got Easier
Part 3 - Deploying your LoRA Fine-tuned Llama 3.1 8B model, why it's a breeze!
Learn how Azure AI makes it effortless to deploy your LoRA fine-tuned models using Azure AI. ( Github recipe repo).
Part 3 - Deploying your LoRA Fine-tuned Llama 3.1 8B model, why it's a breeze!
Learn how Azure AI makes it effortless to deploy your LoRA fine-tuned models using Azure AI. ( Github recipe repo).
By Cedric Vidal, Principal AI Advocate, Microsoft
Part of the Future of AI series initiated by Marco Casalaina with his Exploring Multi-Agent AI Systems blog post.
A Llama on a rocket launched in space, generated using Azure OpenAI DALL-E 3
Welcome back to our series on leveraging Azure AI Studio to accelerate your AI development journey. In our previous posts, we’ve explored synthetic dataset generation and the process of fine-tuning models. Today, we’re diving into the crucial step that turns your hard work into actionable insights: deploying your fine-tuned model. In this installment, we’ll guide you through deploying your model using Azure AI Studio and the Python SDK, ensuring a seamless transition from development to production.
Why Deploying GPU Accelerated Inference Workloads is Hard
Deploying GPU-accelerated inference workloads comes with a unique set of challenges that make the process significantly more complex compared to standard CPU workloads. Below are some of the primary difficulties encountered:
- GPU Resource Allocation: GPUs are specialized and limited resources, requiring precise allocation to avoid wastage and ensure efficiency. Unlike CPUs that can be easily provisioned in larger numbers, the specialized nature of GPUs means that effective allocation strategies are crucial to optimize performance.
- GPU Scaling: Scaling GPU workloads is inherently more challenging due to the high cost and limited availability of GPU resources. it requires careful planning to balance cost efficiency with workload demands, unlike more straightforward CPU resource scaling.
- Load Balancing for GPU Instances: Implementing load balancing for GPU-based tasks is complex due to the necessity of evenly distributing tasks across available GPU instances. This step is vital to prevent bottlenecks, avoid overload in certain instances, and ensure optimal performance of each GPU unit.
- Model Partitioning and Sharding: Large models that cannot fit into a single GPU memory require partitioning and sharding. This process involves splitting the model across multiple GPUs, which introduces additional layers of complexity in terms of load distribution and resource management.
- Containerization and Orchestration: While containerization simplifies the deployment process by packaging models and dependencies, managing GPU resources within containers and orchestrating them across nodes adds another layer of complexity. Effective orchestration setups need to be fine-tuned to handle the subtle dynamics of GPU resource utilization and management.
- LoRA Adapter Integration: LoRA, which stands for Low-order Rank Adaptation, is a powerful optimization technique that reduces the number of trainable parameters by decomposing the original weight matrices into lower-rank matrices. This makes it efficient for fine-tuning large models with fewer resources. However, integrating LoRA adapters into deployment pipelines involves additional steps to efficiently store, load and merge the lightweight adapters with the base model and serve the final model, which increases the complexity of the deployment process.
- Monitoring GPU Inference Endpoints: Monitoring GPU inference endpoints is complex due to the need for specialized metrics to capture GPU utilization, memory bandwidth, and thermal limits, not to mention model specific metrics such as token counts or request counts. These metrics are vital for understanding performance bottlenecks and ensuring efficient operation but require intricate tools and expertise to collect and analyze accurately.
- Model Specific Considerations: It’s important to acknowledge that the deployment process is often specific to the base model architecture you are working with. Each new version of a model or a different model vendor will require a fair amount of adaptations in your deployment pipeline. This could include changes in preprocessing steps, modifications in environment configurations, or adjustments in the integration or versions of third-party libraries. Therefore, it’s crucial to stay updated with the model documentation and vendor-specific deployment guidelines to ensure a smooth and efficient deployment process.
- Model Versioning Complexity: Keeping track of multiple versions of a model can be intricate. Each version may exhibit distinct behaviors and performance metrics, necessitating thorough evaluation to manage updates, rollbacks, and compatibility with other systems. We’ll cover the subject of model evaluation more thoroughly in the next blog post. Another difficulty with versioning is storing the weights of the different LoRA adapters and keeping track of the versions of the base models they must be adapted onto.
- Cost Planning: Planning the costs for GPU inference workloads is challenging due to the variable nature of GPU usage and the higher costs associated with GPU resources. Predicting the precise amount of GPU time required for inference under different workloads can be difficult, leading to unexpected expenses.
Understanding and addressing these difficulties is crucial for successfully deploying GPU-accelerated inference workloads, ensuring that the full potential of GPU capabilities is harnessed.
Azure AI Serverless: A Game Changer
Azure AI Serverless is a game changer because it effectively addresses a lot of challenges with deploying GPU-accelerated inference workloads. By leveraging the serverless architecture, it abstracts away the complexities associated with GPU resource allocation, model specific deployment considerations, and API management. This means you can deploy your models without worrying about the underlying infrastructure management, allowing you to focus on your application’s needs. Additionally, Azure AI Serverless supports a diverse collection of models and abstracts away the choice and provisioning of GPU hardware accelerators, ensuring efficient and fast inference times. The platform’s integration with managed services enables robust container orchestration, simplifying the deployment process even further and enhancing overall operational efficiency.
Attractive pay as you go cost model
One of the standout features of Azure AI Serverless is its token-based cost model, which greatly simplifies cost planning. With token-based billing, you are charged based on the number of tokens processed by your model, making it easy to predict costs based on expected usage patterns. This model is particularly beneficial for applications with variable loads, as you only pay for what you use.
Because the managed infrastructure needs to maintain LoRA adapters in memory and swap them on demand, there is an additional per hour cost associated with fine tuned serverless endpoints but it is billed by the hour only while the endpoint is being used. This makes it super easy to plan ahead future bills depending on your expected usage profile.
Also, the hourly cost is meant to go down, it already went down dramatically from $3.09/hour for a Llama 2 7B based model to $0.74/hour for a Llama 3.1 8B based model.
By paying attention to these critical factors, you can ensure that your model deployment is robust, secure, and capable of meeting the demands of your application.
Region Availability
When deploying your Llama 3.1 fine-tuned model, it’s important to consider the geographical regions where the model can be deployed. As of now, Azure AI Studio supports the deployment of Llama 3.1 fine-tuned models in the following regions: East US, East US 2, North Central US, South Central US, West US, and West US 3. Choosing a region that’s closer to your end-users can help reduce latency and improve performance. Ensure you select the appropriate region based on your target audience for optimal results.
For the most up-to-date information on region availability for other models, please refer to this guide on deploying models serverlessly.
Let’s get coding with Azure AI Studio and the Python SDK
Before proceeding to deployment, you’ll need a model that you have previously fine-tuned. One way is to use the process described in the two preceding installments of this fine-tuning blog post series: the first one covers synthetic dataset generation using RAFT and the second one covers fine-tuning. This ensures that you can fully benefit from the deployment steps using Azure AI Studio.
Note: All code samples that follow have been extracted from the 3_deploy.ipynb notebook of the raft-recipe GitHub repository. The snippets have been simplified and some intermediary steps left aside for ease of reading. You can either head over there, clone the repo and start experimenting right away or stick with me here for an overview.
Step 1: Set Up Your Environment
First, ensure you have the necessary libraries installed. You’ll need the Azure Machine Learning SDK for Python. You can install it using pip:
pip install azure-ai-ml
Next, you’ll need to import the required modules and authenticate your Azure ML workspace. This is standard, the MLClient is the gateway to the ML Workspace which gives you access to everything AI and ML on Azure.
Code:
from azure.ai.ml import MLClient
from azure.identity import (
DefaultAzureCredential,
InteractiveBrowserCredential,
)
from azure.ai.ml.entities import MarketplaceSubscription, ServerlessEndpoint
try:
credential = DefaultAzureCredential()
credential.get_token("https://management.azure.com/.default")
except Exception as ex:
credential = InteractiveBrowserCredential()
try:
client = MLClient.from_config(credential=credential)
except:
print("Please create a workspace configuration file in the current directory.")
# Get AzureML workspace object.
workspace = client._workspaces.get(client.workspace_name)
workspace_id = workspace._workspace_id
Step 2: Resolving the previously registered fine-tuned model
Before deploying, you need to resolve your fine-tuned model in the Azure ML workspace.
Since the fine-tuning job might still be running, you may want to wait for the model to be registered, here’s a simple helper function you can use.
Code:
def wait_for_model(client, model_name):
"""Wait for the model to be available, typically waiting for a finetuning job to complete."""
import time
attempts = 0
while True:
try:
model = client.models.get(model_name, label="latest")
return model
except:
print(f"Model not found yet #{attempts}")
attempts += 1
time.sleep(30)
The above function is basic but will make sure your deployment can proceed as soon as your model becomes available.
Code:
print(f"Waiting for fine tuned model {FINETUNED_MODEL_NAME} to complete training...")
model = wait_for_model(client, FINETUNED_MODEL_NAME)
print(f"Model {FINETUNED_MODEL_NAME} is ready")
Step 3: Subscribe to the model provider
Before deploying a model fine-tuned using a base model from a third-party non-Microsoft source, you need to subscribe to the model provider’s marketplace offering. This subscription allows you to access and use the model within Azure ML.
Code:
print(f"Deploying model asset id {model_asset_id}")
from azure.core.exceptions import ResourceExistsError
marketplace_subscription = MarketplaceSubscription(
model_id=base_model_id,
name=subscription_name,
)
try:
marketplace_subscription = client.marketplace_subscriptions.begin_create_or_update(marketplace_subscription).result()
except ResourceExistsError as ex:
print(f"Marketplace subscription {subscription_name} already exists for model {base_model_id}")
Details on how to construct the base_model_id and subscription_name are available in the 3_deploy.ipynb notebook.
Step 4: Deploy the model as a serverless endpoint
This section manages the deployment of a serverless endpoint for your fine-tuned model using the Azure ML client. It checks for an existing endpoint and creates one if it doesn’t exist, then proceeds with the deployment.
Code:
from azure.core.exceptions import ResourceNotFoundError
try:
serverless_endpoint = client.serverless_endpoints.get(endpoint_name)
print(f"Found existing endpoint {endpoint_name}")
except ResourceNotFoundError as ex:
serverless_endpoint = ServerlessEndpoint(name=endpoint_name, model_id=model_asset_id)
serverless_endpoint = client.serverless_endpoints.begin_create_or_update(serverless_endpoint).result()
print("Waiting for deployment to complete...")
serverless_endpoint = ServerlessEndpoint(name=endpoint_name, model_id=model_id)
created_endpoint = client.serverless_endpoints.begin_create_or_update(serverless_endpoint).result()
print("Deployment complete")
Step 5: Check that the endpoint is correctly deployed
As part of a deployment pipeline, it is a good practice to include integration tests that check that the model is correctly deployed and fails fast instead of waiting for steps down the line to fail without context.
Code:
import requests
url = f"{endpoint.scoring_uri}/v1/chat/completions"
prompt = "What do you know?"
payload = {
"messages":[ { "role":"user","content": prompt } ],
"max_tokens":1024
}
headers = {"Content-Type": "application/json", "Authorization": endpoint_keys.primary_key}
response = requests.post(url, json=payload, headers=headers)
response.json()
This code assumes that the deployed model is a chat model for simplicity. The code available in the 3_deploy.ipynb notebook is more generic and will cover both completion and chat models.
Conclusion
Deploying your fine-tuned model with Azure AI Studio and the Python SDK not only simplifies the process but also empowers you with unparalleled control, ensuring you have a robust and reliable platform for your deployment needs.
Stay tuned for our next blog post, in two weeks we will delve into assessing the performance of your deployed model through rigorous evaluation methodologies. Until then, head out to the Github repo and happy coding!
Continue reading...