M
mrajguru
Introduction
When integrating Azure OpenAI’s powerful models into your production environment, it’s essential to follow best practices to ensure security, reliability, and scalability. Azure provides a robust platform with enterprise capabilities that, when leveraged with OpenAI models like GPT-4, DALL-E 3, and various embedding models, can revolutionize how businesses interact with AI. This guidance document contains best practices for scaling OpenAI applications within Azure, detailing resource organization, quota management, rate limiting, and the strategic use of Provisioned Throughput Units (PTUs) and Azure API Management (APIM) for efficient load balancing.
Why we care?
Large transformer models are mainstream nowadays, creating SoTA results for a variety of tasks. They are powerful but very expensive to train and use. The extremely high inference cost, in both time and memory, is a big bottleneck for adopting a powerful transformer for solving real-world tasks at scale.
Why is it hard to run inference for large transformer models? Besides the increasing size of SoTA models, there are two main factors contributing to the inference challenge (Pope et al. 2022
- Large memory footprint. Both model parameters and intermediate states are needed in memory at inference time. For example,
- The KV cache should be stored in memory during decoding time; E.g. For a batch size of 512 and context length of 2048, the KV cache totals 3TB, that is 3x the model size (!).
- Inference cost from the attention mechanism scales quadratically with input sequence length.
2. Low parallelizability. Inference generation is executed in an autoregressive fashion, making the decoding process hard to parallel.
Best Practices for Azure OpenAI Resources
- Consolidate Azure OpenAI workloads under a single Azure subscription to streamline management and cost optimization.
- Treat Azure OpenAI resources as a shared service to ensure efficient usage of PTU and PAYG resources.
- Utilize separate subscriptions only for distinct development and production environments or for geographic requirements.
- Prefer resource groups for regional isolation, which simplifies scaling and management compared to multiple subscriptions.
- Maintain a single Azure OpenAI resource per region, allowing up to 30 enabled regions within a single subscription.
- Create both PAYG and PTU deployments within each Azure OpenAI resource for each model to ensure flexible scaling.
- Leverage PTUs for business critical usage and PAYG for traffic that exceeds the PTU allocation.
Quotas and Rate Limiting
Azure imposes certain quotas and limits to manage resources effectively. Be aware of these limits and plan your usage accordingly. If your application is expected to scale, consider how you’ll manage dynamic quotas and provisioned throughput units (PTUs) to handle the load.
- Tokens: Tokens are basic text units processed by OpenAI models. Efficient token management is crucial for cost and load balancing.
- Quotas :
OpenAI sets API quotas based on subscription plans, dictating API usage within specific time frames.
Quotas are per model, per region, and per subscription.
Proactively monitor quotas to prevent unexpected service disruptions.
- Quotas do not guarantee capacity, and traffic may be throttled if the service is overloaded.
- During peak traffic, the service may throttle requests even if the quota has not been reached.
- Rate Limiting
- Rate limiting ensures equitable API access and system stability.
- Rate Limits are imposed on the number of requests per minute (RPM) and the number of tokens per minute (TPM).
- Implement backoff strategies to handle rate limit errors effectively.
- PTUs: In the Azure OpenAI service, which provides Azure customers access to these models, there are fundamentally 2 different levels of service offerings:
- Pay-as-you-go, priced based on usage of the service
- Provisioned Throughput Units (PTU), fixed-term commitment pricing
Metrics and Monitoring
Azure OpenAI Metrics Dashboards: Start with the out-of-box dashboards provided by Azure OpenAI in the Azure portal. These dashboards display key metrics such as HTTP requests, tokens-based usage, PTU utilization, and fine-tuning activities, offering a quick snapshot of your deployment's health and performance.
Analyze Metrics: Utilize Azure Monitor metrics explorer to delve into essential metrics captured by default:
- Azure OpenAI Requests: Tracks the total number of API calls split by Status Code.
- Generated Completion Tokens and Processed Inference Tokens: Monitors token usage, which is crucial for managing capacity and operational costs.
- Provision-managed Utilization V2: Provides insights into utilization percentages, helping prevent overuse and ensuring efficient resource allocation.
- Time to Response: Time taken for the first response to appear after a user send a prompt.
To calculate usage-based chargebacks for Provisioned Throughput Units (PTUs) when sharing an Azure OpenAI instance across multiple business units, it is essential to monitor and log token consumption accurately. Incorporate the "azure-openai-emit-token-metric" policy in Azure API Management to emit token consumption metrics directly into Application Insights. This policy facilitates tracking various token metrics such as Total Tokens, Prompt Tokens, and Completion Tokens, allowing for a thorough analysis of service utilization. Configure the policy with specific dimensions such as User ID, Client IP, and API ID to enhance granularity in reporting and insights. By implementing these strategies, organizations can ensure transparent and fair chargebacks based on actual usage, fostering accountability and optimized resource allocation across different business units.
Azure API Management policy reference - azure-openai-emit-token-metric
Given the 2 options, you would probably gravitate toward the pay-as-you-go pricing, this is a logical conclusion for customers just starting to use these models in Proof of Concept/Experimental use cases. But as customer use cases become production-ready, the PTU model will be the obvious choice.
- Utilize PTUs for baseline usage of OpenAI workloads to guarantee consistent throughput.
- PAYG deployments should handle traffic that exceeds the PTU allocation.
If you think about the Azure OpenAI service as analogous to a freeway, the service helps facilitate Cars (requests) travelling to the models and ultimately back to the original location. The funny thing about highways and interstates, like standard pay-as-you-go deployments, is you are unable to control who is using the highway the same time as you are―which is akin to the service’s utilization we all experience during peak hours of the day. We all have a posted a speed limit, like rate limits, but may never reach the speed we expect due to the factors mentioned above. Moreover, if you managed a fleet of vehicles―vehicles we can think of as service calls―all using different aspects of the highway, you also cannot predict which lane you get stuck in. Some may luckily find the fast lane, but you can never prevent the circumstances ahead in the road. That’s the risk we take when using the highway, but tollways (token-based consumption) all give us the right to use it whenever we want. While some high demand times are foreseeable, such as during rush hour, there may be cases where phantom slowdowns exist where there is no rhyme or reason as to why these slowdowns occur. Therefore, your estimated travel times (Response Latency) can vary drastically based on the different traffic scenarios that can occur on the road.
Provisioned throughput (PTUs) is more analogous to The Boring Company’s Loop than anything else. Unlike public transportation that has predefined stops, the Loop provides a predetermined estimation of time it will take to arrive at your destination because there are no scheduled stops―you travel directly to your destination. Provisioned throughput, like a Loop, is a function of how many tunnels (capacity), stations (client-side queuing), and quantity of vehicles (concurrency) you can handle at any one time. During peak travel times, even the queued wait time to get in the loop at your first station (time to first token) may have you arrive at your destination (end to end response time) faster than taking the highway because the speed limit is conceptually much higher with no traffic. This makes Provisioned throughput much more advantageous―if and only if―we implement a new methodology in our client-side retry logic than how we’ve handled it previously. For instance, if you have a tolerance for longer per-call latencies―only adding a little latency in front of the call (time to first token), exploiting the
retry-after-ms
value returned from the 429 response―you can define how long you are willing to wait before you redirect traffic to Pay-as-you-go or other models. This implementation will also ensure you are getting the highest throughput out of PTUs as possible.In summary, for Azure OpenAI use cases that require Predictable, Consistent, and Cost Efficient usage of the service the Provisioned Throughput Unit (PTU) offering becomes the most reasonable solution especially when it comes to business-critical production workloads.
Latency Improvement Techniques:
There are many techniques to improve the underlying use cases by factors.
- the model used
- number of input tokens
- number of completion tokens
- infrastructure and load on the inferencing engine
Lets now talk about some the best practices on reducing latency.
- Prompt Compression using LLMLingua: As per the benchmark there is a increase in time to first token with size of input token as shown in the below plot hence it is imperative to reduce the input token as effectively as possible. In this we use LLMLingua library which has reduces the input token getting passed to the model. I discussed more on this my previous blog here.
- Skeleton Of Thought: This technique makes it possible to generate longer generations more quickly by first generating a skeleton, then generating each point of the outline. SoT first assembles a skeleton request using the skeleton prompt template with the original question. The skeleton prompt template is written to guide the LLM to output a concise skeleton of the answer. Then, we extract the B points from the skeleton response of the LLM. Here you can find the implementation here.
3. Maximizing Shared Prompt: Maximize shared prompt prefix, by putting dynamic portions (e.g. RAG results, history, etc) later in the prompt. This makes your request more KV cache-friendly (which most LLM providers use) and means fewer input tokens are processed on each request.
4. Streaming: The single most effective approach, as it cuts the waiting time to a second or less. (ChatGPT would feel pretty different if you saw nothing until each response was done.)
5. Generating tokens is almost always the highest latency step when using an LLM: as a general heuristic, cutting 50% of your output tokens may cut ~50% your latency. Use of MAX_TOKENS to the actual generation token size helps too.
6. Parallelization: In use cases like Classification you can parallelize request and use async as much as possible.
Load Balancing with Azure API Management (APIM)
- APIM plays a pivotal role in managing, securing, and analyzing APIs.
- Policies within APIM can be used to manage traffic, secure APIs and enforce usage quotas.
- Load Balancing within APIM distributes traffic evenly, ensuring no single instance is overwhelmed.
- Circuit Breaker policies in APIM prevent cascading failures and improve system resilience.
- Smart Load Balancing with APIM ensures prioritized traffic distribution across multiple OpenAI resources.
APIM Policies for OpenAI:
Many service providers, including OpenAI, set limits on API calls. Azure OpenAI, for instance, has limits on tokens per minute (TPM) and requests per minute (RPM). Exceeding these limits results in a 429 ‘TooManyRequests’ HTTP Status code and a ‘Retry-After’ header, indicating a pause before the next request.
This solution incorporates a comprehensive approach, considering UX/workflow design, application resiliency, fault-handling logic, appropriate model selection, API policy configuration, logging, and monitoring. It introduces an Azure API Management Policy that seamlessly integrates a single endpoint to your applications while efficiently managing consumption across multiple OpenAI or other API backends based on their availability and priority.
Smart vs. Round-Robin Load Balancers
Our solution stands out in its intelligent handling of OpenAI throttling. It is responsive to the HTTP status code 429 (Too Many Requests), a common occurrence due to rate limits in Azure OpenAI. Unlike traditional round-robin methods, our solution dynamically directs traffic to non-throttling OpenAI backends, based on a prioritized order. When a high-priority backend starts throttling, traffic is automatically rerouted to lower-priority backends until the former recovers.
This smart load balancing solution effectively addresses the challenges posed by API limit constraints in Azure OpenAI. By implementing the strategies outlined in the provided documentation, you can ensure efficient and reliable application performance, leveraging the full potential of your OpenAI and Azure API Management resources.
Learn more about this implementation on this github repo.
With support for round-robin, weighted (new), and priority-based (new) load balancing, you can now define your own load distribution strategy according to your specific requirements.
Define priorities within the load balancer configuration to ensure optimal utilization of specific Azure OpenAI endpoints, particularly those purchased as PTUs. In the event of any disruption, a circuit breaker mechanism kicks in, seamlessly transitioning to lower-priority instances based on predefined rules.
Our updated circuit breaker now features dynamic trip duration, leveraging values from the retry-after header provided by the backend. This ensures precise and timely recovery of the backends, maximizing the utilization of your priority backends to their fullest.
Learn more about load balancer and circuit breaker here.
Import OpenAI in APIM
New Import Azure OpenAI as an API in Azure API management provides an easy single click experience to import your existing Azure OpenAI endpoints as APIs.
We streamline the onboarding process by automatically importing the OpenAPI schema for Azure OpenAI and setting up authentication to the Azure OpenAI endpoint using managed identity, removing the need for manual configuration. Additionally, within the same user-friendly experience, you can pre-configure Azure OpenAI policies, such as token limit and emit token metric, enabling swift and convenient setup.
Learn more about Import Azure OpenAI as an API here.
High Availability
- Use Azure API Management to route traffic, ensuring centralized security and compliance.
- Implement private endpoints to secure OpenAI resources and prevent unauthorized access.
- Leverage Managed Identity to secure access to OpenAI resources and other Azure services.
Azure API Management has built a set of GenAI Gateway capabilities:
- Azure OpenAI Token Limit Policy
- Azure OpenAI Emit Token Metric Policy
- Load Balancer and Circuit Breaker
- Import Azure OpenAI as an API
- Azure OpenAI Semantic Caching Policy (in public preview)
source: GitHub - Azure-Samples/AI-Gateway: APIM OpenAI - this repo contains a set of experiments on using GenAI capabilities of Azure API Management with Azure OpenAI and other services
Security and Compliance
Security is paramount when deploying any application in a production environment. Azure OpenAI offers features to help secure your data and comply with various regulations:
- Role-based access control (RBAC) allows you to define who has access to what within your Azure resources.
- Content filtering and asynchronous content filtering can help ensure that the content generated by the AI models aligns with your policies and standards.
- Red teaming large language models (LLMs) can help identify potential vulnerabilities before they become issues.
To use Azure OpenAI in a scenario that does not require key sharing, and thus is more appropriate for the production environment. This scenario has several advantages. If you employ managed identities, you don’t have to handle credentials. In fact, credentials are not even accessible to you. Moreover, you can use managed identities to authenticate to any resource that supports Microsoft Entra authentication, including your own applications. And finally, managed identities are free to use, which is also significant if you have multiple applications that use OpenAI. Learn more about it here.
Responsible AI Practices
Adhering to responsible AI principles is essential. Azure OpenAI provides guidelines and tools to help:
- Transparency notes and a code of conduct can guide your AI’s behavior.
- Data privacy and security measures are crucial to protect your and your customers’ data.
- Monitoring for abuse and managing system message templates can help prevent and respond to any misuse of the AI services.
Azure AI Responsible AI team recently announced Public Preview of ‘Risks & safety monitoring’ feature on Azure OpenAI Service. Microsoft holds the commitment to ensure the development/deployment of AI systems are safe, secure, and trustworthy. And there are a set of tools to help make it possible. In addition to the detection/ mitigation on harmful content in near-real time, the risks & safety monitoring help get a better view of how the content filter mitigation works on real customer traffic and provide insights on potentially abusive end-users. With the risks & safety monitoring feature, customers can achieve:
- Visualize the volume and ratio of user inputs/model outputs that blocked by the content filters, as well as the detailed break-down by severity/category. Then use the data to help developers or model owners to understand the harmful request trend over time and inform adjustment to content filter configurations, blocklists as well as the application design.
- Understand the risk of whether the service is being abused by any end-users through the “potentially abusive user detection”, which analyzes user behaviors and the harmful requests sent to the model and generates a report for further action taking.
Conclusion
This guidance outlines a strategy for leveraging Azure OpenAI resources at an enterprise level. By centralizing OpenAI resources and adopting smart load balancing with APIM, organizations can maximize their investment in OpenAI, ensuring scalability, cost-effectiveness, and performance across a wide range of applications and use cases.
It takes lot of effort to write this kind of blog, please do clap this blog and follow me to keep me motivated to write more such blogs.
Additional Resources
- AI-in-a-Box/guidance/scaling/README.md at main · Azure/AI-in-a-Box
- Smart load balancing with Azure API Management
- Smart load balancing with Azure Container Apps
- Using Azure API Management Circuit Breaker and Load balancing with Azure OpenAI Service
- Azure OpenAI offering models - Explain it Like I'm 5
- Introducing GenAI Gateway Capabilities in Azure API Management
- Best Practice Guidance for PTU
For more detailed information on Azure OpenAI’s capabilities, tokens, quotas, rate limits, and PTUs, visit the [Azure OpenAI documentation](Azure OpenAI Service documentation - Quickstarts, Tutorials, API Reference - Azure AI services).
Continue reading...