Azure AI Model Inference API

shardakaur · Aug 20, 2024

The Azure AI Model Inference API provides a unified interface for developers to interact with various foundational models deployed in Azure AI Studio. This API allows developers to generate predictions from multiple models without changing their underlying code. By providing a consistent set of capabilities, the API simplifies the process of integrating and switching between different models, enabling seamless model selection based on task requirements.

Features

Foundational models have made significant advancements, particularly in natural language processing and computer vision. However, these models often excel in specific tasks and may approach the same problem differently. The Azure AI Model Inference API enables developers to:

Enhance performance by selecting the most suitable model for a particular task.
Optimize efficiency by using smaller, faster models for simpler tasks.
Create complex experiences by composing multiple models.
Maintain code portability across different models without sacrificing performance or capabilities.

Availability of Models

The Azure AI Model Inference API is available for the following models:

Serverless API Endpoints:
Managed Inference:
- Meta Llama 3 instructs family
- Phi-3 family
- Mistral and Mixtral family

Additionally, the API is compatible with Azure OpenAI model deployments. Note that models deployed after June 24th, 2024, can take advantage of managed inference capabilities.

API Capabilities

The API supports multiple modalities, allowing developers to:

Retrieve model information: Get details about the deployed model.
Text embeddings: Generate an embedding vector for the input text.
Text completions: Generate text based on a provided prompt.
Chat completions: Create responses for chat conversations.
Image embeddings: Generate embedding vectors for text and image inputs.

Inference SDK Support

The Azure AI Inference SDK provides streamlined clients in several languages, including Python, JavaScript, and C#, making it easy to consume predictions from models using the Azure AI Model Inference API.

Installation

To install the Python package, use:

Code:

pip install azure-ai-inference

Example: Creating a Client for Chat Completions

Here’s a quick example of how to create a client for chat completions using Python:

Code:

import os
from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential

# Create a client using an API key
client = ChatCompletionsClient(
    endpoint=os.environ["AZUREAI_ENDPOINT_URL"],
    credential=AzureKeyCredential(os.environ["AZUREAI_ENDPOINT_KEY"]),
)

# Create a client using Azure Entra ID
from azure.identity import DefaultAzureCredential

client = ChatCompletionsClient(
    endpoint=os.environ["AZUREAI_ENDPOINT_URL"],
    credential=DefaultAzureCredential(),
)

Extensibility

The API allows developers to pass additional parameters to models beyond the specified modalities, using the extra-parameters header. For example, you can pass the safe_mode parameter to the Mistral-Large model, which isn't specified in the API, like this:

Code:

from azure.ai.inference.models import SystemMessage, UserMessage

response = client.complete(
    messages=[
        SystemMessage(content="You are a helpful assistant."),
        UserMessage(content="How many languages are in the world?"),
    ],
    model_extras={"safe_mode": True}
)

print(response.choices[0].message.content)

Handling Different Model Capabilities

If a model doesn't support a specific parameter, the API returns an error. You can handle these cases by inspecting the response:

Code:

import json
from azure.ai.inference.models import SystemMessage, UserMessage, ChatCompletionsResponseFormat
from azure.core.exceptions import HttpResponseError

try:
    response = client.complete(
        messages=[
            SystemMessage(content="You are a helpful assistant."),
            UserMessage(content="How many languages are in the world?"),
        ],
        response_format={"type": ChatCompletionsResponseFormat.JSON_OBJECT}
    )
except HttpResponseError as ex:
    if ex.status_code == 422:
        response = json.loads(ex.response._content.decode('utf-8'))
        for offending in response.get("detail", []):
            param = ".".join(offending["loc"])
            value = offending["input"]
            print(f"Model doesn't support the parameter '{param}' with value '{value}'")
    else:
        raise ex

Content Safety

The API also integrates with Azure AI Content Safety, filtering potentially harmful content. If a request triggers content safety measures, the response will indicate this, allowing developers to handle it accordingly.

Getting Started

To start using the Azure AI Model Inference API, deploy any of the supported models to Serverless API endpoints or Managed Online Endpoints and utilize the provided code to consume predictions.

Model Swapping Demo

Here’s an example of how easy it is to swap models in a Python solution while keeping the code consistent:

Code:

import os
from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential

# Example function to swap models
def swap_model(endpoint_url, api_key):
    client = ChatCompletionsClient(
        endpoint=endpoint_url,
        credential=AzureKeyCredential(api_key),
    )
    return client

# Swapping between two models for evaluation
model_1 = swap_model(os.environ["MODEL1_ENDPOINT"], os.environ["MODEL1_KEY"])
model_2 = swap_model(os.environ["MODEL2_ENDPOINT"], os.environ["MODEL2_KEY"])

response_1 = model_1.complete(messages=[UserMessage(content="What's the weather today?")])
response_2 = model_2.complete(messages=[UserMessage(content="What's the weather today?")])

# Compare the results from the two models
print("Model 1 Response:", response_1.choices[0].message.content)
print("Model 2 Response:", response_2.choices[0].message.content)

Comparing Model Outputs using the Azure Inference API

The Azure Inference API provides a convenient way to evaluate the effectiveness of different models and compare their outputs. By using the API, you can easily swap between models and test their performance on a given input prompt.

Here is an example of how to use the Azure Inference API to compare the outputs of two models:

Code:

import os
from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential
from azure.ai.inference.models import UserMessage

# Set up the models
model_1_endpoint = "https://your-resource-name.cognitiveservices.azure.com/"
model_1_key = "your-model-1-key"
model_2_endpoint = "https://your-resource-name.cognitiveservices.azure.com/"
model_2_key = "your-model-2-key"

# Example function to swap models
def swap_model(endpoint_url, api_key):
    client = ChatCompletionsClient(
        endpoint=endpoint_url,
        credential=AzureKeyCredential(api_key),
    )
    return client

# Swapping between two models for evaluation
model_1 = swap_model(model_1_endpoint, model_1_key)
model_2 = swap_model(model_2_endpoint, model_2_key)

# Set the model names for clarity
model_1_name = "text-davinci-002"
model_2_name = "text-curie-001"

# Set the input prompt
input_prompt = "What's the weather today?"

# Get responses from both models
response_1 = model_1.complete(messages=[UserMessage(content=input_prompt)], model=model_1_name)
response_2 = model_2.complete(messages=[UserMessage(content=input_prompt)], model=model_2_name)

# Compare the results from the two models
print(f"Model 1 ({model_1_name}) Response:", response_1.choices[0].message.content)
print(f"Model 2 ({model_2_name}) Response:", response_2.choices[0].message.content)

Comparison and Contrast:

Both models respond to the input prompt but with different styles and levels of detail.

text-davinci-002 (Model 1) provides a more conversational and friendly response, acknowledging the user's question and offering suggestions on how to find the answer. The response is longer and more elaborate, with a more personal touch.
text-curie-001 (Model 2) provides a more concise and direct response, simply stating that it's not aware of the current weather and offering suggestions on how to find out. The response is shorter and more to the point.

In general, text-davinci-002 is known for its ability to generate more creative and conversational responses, while text-curie-001 is known for its ability to provide more accurate and informative responses. This is reflected in their responses to this input prompt.

Conclusion:

Using the Azure Inference API is a great way to evaluate the effectiveness of models and compare outputs quickly and easily. By swapping between models and testing their performance on a given input prompt, you can gain valuable insights into the strengths and weaknesses of each model and make informed decisions about which model to use for your specific use case.

Explore our samples and read the API reference documentation to get yourself started.

Azure AI Model Inference API

shardakaur