Easily Bring your Machine Learning Models into Production with the AzureML Inference Server

alexanderhughes · Oct 9, 2023

Taking your machine learning (ML) models from local development into production can be challenging and time consuming. It requires creating a HTTP layer above your model to process incoming requests, integrate with logging services, and handle errors safely. What’s more, the code required for pre- and post-processing, model loading, and model inference vary across models and must integrate smoothly with the HTTP layer.

Today, we are excited to announce the General Availability (GA) and open sourcing of the Azure Machine Learning Inference Server. This easy-to-use python package provides an extensible HTTP layer that enables you to rapidly prepare your ML models for production scenarios at scale. The package takes care of request processing, logging, and error handling. It also provides a score script interface that allows for custom, user-defined, pre- and post-processing, model loading, and inference code for any model.

Summary of the AzureML Inference HTTP Server

The Azure Machine Learning Inference Server is a Python package that exposes your ML model as a HTTP endpoint. The package contains a Flask-based server run via Gunicorn and is designed to handle production scale requests. It is currently the default server used in the Azure Machine Learning prebuilt Docker images for inference. And, while it is built for production, it is also designed to support rapid local development.

Figure 1: How the Azure Machine Learning Inference Server Handles Incoming Requests

Score Script

The score script (sometimes referred to as the “scoring script” or “user code”) is how you can provide your model to the server. It consists of two parts, an init() function, executed on server startup, and a run() function, executed when the server receives a request to the “/score” route.

On server startup...

The init() function is designed to hold the code for loading the model from the filesystem. It is only run once.

On request to “/score” route...

The run() function is designed to hold the code to handle inference requests. The code written here can be simple: passing raw JSON input to the model loaded in the init() function and returning the output. Or, it can be complex: running several pre-processing functions defined across multiple files, delegating inference to a GPU, and running content moderation on the model output before returning results to the user.

The score script is designed for maximum extensibility. Any code can be placed into init() or run() and it will be run when those functions are called as described above.

An example score script can be found here – azureml-examples – Github.

Key AzureML Inference Server Scenarios

Local Score Script Development and Debugging

Developing a complex score script may require iterative debugging and often it’s not feasible to redeploy an online endpoint several times to debug potential issues. The AzureML Inference Server allows you to run a score script locally to test both model loading and inference request handling. It easily integrates with the VS Code debugger and allows you to step through potentially complex processing or inference steps.

Note: To test your docker image in addition to your score script please refer to the AzureML Local Endpoints documentation.

Validation gates for CI/CD Pipeline

The Azure Machine Learning Inference Server can also be used to create validation gates in a continuous integration and deployment (CICD) pipeline. For example, you can start the server with a candidate score script and run a test suite against this local instance directly in the pipeline, enabling a safe, efficient, and automatable deployment process.

Production Deployments

The Azure Machine Learning Inference Server is designed to support production-scale inference. Once local testing is complete, you can feel confident using the score script you developed alongside the Azure Machine Learning prebuilt inference images to deploy your model as an AzureML Managed Online Endpoint.

Safely bring your models into production using the Azure Machine Learning Inference Server and AzureML Managed Inference by referencing the resources below.

Learn More

Azure Machine Learning (AzureML) Inference Server Documentation: Azure Machine Learning inference HTTP server - Azure Machine Learning | Microsoft Learn

Azure Machine Learning (AzureML) Inference Server Open-Source GitHub repository: microsoft/azureml-inference-server: The AzureML Inference Server is a python package that allows user to easily expose machine learning models as HTTP Endpoints. The server is included by default in AzureML's pre-built docker images for inference. (github.com)
Azure Machine Learning (AzureML) Inference Server PyPI package: azureml-inference-server-http · PyPI

Easily Bring your Machine Learning Models into Production with the AzureML Inference Server

alexanderhughes