J
Jambo0321
Hi, I'm Jambo a Microsoft Learn Student Ambassador.
This article aims to run the quantized Phi-3-vision model in ONNX format on the Jetson platform and successfully perform inference for image+text dialogue tasks.
The Jetson platform, introduced by NVIDIA, consists of small arm64 devices equipped with powerful GPU computing capabilities. Designed specifically for edge computing and AI applications, Jetson devices run on Linux, enabling complex computing tasks with low power consumption. This makes them ideal for developing embedded AI and machine learning projects.
For other versions of the Phi-3 model, we can use llama.cpp to convert them into GUFF format to run on Jetson, and easily switch between different quantizations. Alternatively, you can conveniently use services like ollama or llamaedge which are based on llama.cpp. More information can be found in the Phi-3CookBook.
However, for the vision version, there is currently no way to convert it into GUFF format (#7444). Additionally, resource-constrained edge devices struggle to run the original model without quantization via transformers. Therefore, we can use ONNX Runtime to run the quantized model in ONNX format.
ONNX Runtime is a high-performance inference engine designed to accelerate and execute AI models in the ONNX (Open Neural Network Exchange) format. The onnxruntime-genai is an API specifically built for LLM (Large Language Model) models, providing a simple way to run models like Llama, Phi, Gemma, and Mistral.
At the time of writing this article, onnxruntime-genai does not have a precompiled version for aarch64 + GPU, so we need to compile it ourselves.
Download the aarch64 version of ONNX Runtime from the GitHub releases page and extract the header files and necessary library files.
ONNX Runtime does not provide a precompiled version for aarch64 + GPU, but we can get the required library files from the dusty-nv image.
The following commands will copy the required library files from the dusty-nv image to the ort/lib directory.
You should still be in the
Now we need to prepare to build the Python API. You can use Python
The compiled files will be located in the
You can copy the whl file to other Jetson platforms with the same environment (CUDA) for installation.
If you have multiple CUDA versions, you might need to set the
Navigate to the directory where the whl file is located, or copy the whl file to another directory for installation using the following command.
Download the Phi-3-vision model for onnx-cuda from huggingface.
The FP16 model requires 8 GB of VRAM. If you are running on a device with more resources like the Jetson Orin, you can opt for the FP32 model.
The Int 4 model is a quantized version, requiring only 3 GB of VRAM. This is suitable for more compact devices like the Jetson Orin Nano.
Download the official example script and an example image.
Run the example script.
First, input the path to the image, for example,
Next, input the prompt text, for example:
We can use Jtop to monitor resource usage:
The above inference is run on Jetson Orin Nano using the Int 4 quantized model. As shown, the Python process occupies 5.4 GB of VRAM for inference, with minimal CPU load and nearly full GPU utilization during inference.
We can modify the example script to use the
All of this is achieved on a device with a power consumption of just 15W.
Continue reading...
This article aims to run the quantized Phi-3-vision model in ONNX format on the Jetson platform and successfully perform inference for image+text dialogue tasks.
What is Jetson?
The Jetson platform, introduced by NVIDIA, consists of small arm64 devices equipped with powerful GPU computing capabilities. Designed specifically for edge computing and AI applications, Jetson devices run on Linux, enabling complex computing tasks with low power consumption. This makes them ideal for developing embedded AI and machine learning projects.
For other versions of the Phi-3 model, we can use llama.cpp to convert them into GUFF format to run on Jetson, and easily switch between different quantizations. Alternatively, you can conveniently use services like ollama or llamaedge which are based on llama.cpp. More information can be found in the Phi-3CookBook.
However, for the vision version, there is currently no way to convert it into GUFF format (#7444). Additionally, resource-constrained edge devices struggle to run the original model without quantization via transformers. Therefore, we can use ONNX Runtime to run the quantized model in ONNX format.
What is ONNX Runtime?
ONNX Runtime is a high-performance inference engine designed to accelerate and execute AI models in the ONNX (Open Neural Network Exchange) format. The onnxruntime-genai is an API specifically built for LLM (Large Language Model) models, providing a simple way to run models like Llama, Phi, Gemma, and Mistral.
At the time of writing this article, onnxruntime-genai does not have a precompiled version for aarch64 + GPU, so we need to compile it ourselves.
Compiling onnxruntime-genai
Environment
- Jetpack 6.0 [L4T 36.3.0]
- Compilation platform: Jetson Orin
- Inference platform: Jetson Orin Nano
Cloning the onnxruntime-genai Repository
Code:
git clone https://github.com/microsoft/onnxruntime-genai
cd onnxruntime-genai
Installing ONNX Runtime
Download the aarch64 version of ONNX Runtime from the GitHub releases page and extract the header files and necessary library files.
Note: Ensure you download the aarch64 version. If there is a newer version, you can replace the version number in the link.
Code:
wget https://github.com/microsoft/onnxruntime/releases/download/v1.18.1/onnxruntime-linux-aarch64-1.18.1.tgz
tar -xvf onnxruntime-linux-aarch64-1.18.1.tgz
mv onnxruntime-linux-aarch64-1.18.1 ort
ONNX Runtime does not provide a precompiled version for aarch64 + GPU, but we can get the required library files from the dusty-nv image.
The following commands will copy the required library files from the dusty-nv image to the ort/lib directory.
Code:
id=$(docker create dustynv/onnxruntime:r36.2.0)
docker cp $id:/usr/local/lib/libonnxruntime*.so* - > ort/lib/
docker rm -v $id
Compiling onnxruntime-genai
You should still be in the
onnxruntime-genai
directory at this point.Now we need to prepare to build the Python API. You can use Python
>=3.6
for the compilation. JetPack 6.0 comes with Python 3.10 by default, but you can switch to other versions for the compilation. The compiled whl can only be installed on the Python version used during the compilation.Note: The compilation process will require a significant amount of memory. Therefore, if your Jetson device has limited memory (like the Orin NX), do not use the --parallel parameter.
Code:
python3 build.py --use_cuda --cuda_home /usr/local/cuda-12.2 --skip_tests --skip_csharp [--parallel]
The compiled files will be located in the
build/Linux/Release/dist/wheel
directory, and we only need the .whl
file.You can copy the whl file to other Jetson platforms with the same environment (CUDA) for installation.
Note: The generated subdirectory may differ, but we only need the.whl
file from thebuild
directory.
Installing onnxruntime-genai
If you have multiple CUDA versions, you might need to set the
CUDA_PATH
environment variable to ensure it points to the same version used during compilation.
Code:
export CUDA_PATH=/usr/local/cuda-12.2
Navigate to the directory where the whl file is located, or copy the whl file to another directory for installation using the following command.
Code:
pip3 install *.whl
Running the Phi-3-vision Model
Downloading the Model
Download the Phi-3-vision model for onnx-cuda from huggingface.
Code:
pip3 install huggingface-hub[cli]
The FP16 model requires 8 GB of VRAM. If you are running on a device with more resources like the Jetson Orin, you can opt for the FP32 model.
The Int 4 model is a quantized version, requiring only 3 GB of VRAM. This is suitable for more compact devices like the Jetson Orin Nano.
Code:
huggingface-cli download microsoft/Phi-3-vision-128k-instruct-onnx-cuda --include cuda-fp16/* --local-dir .
# Or
huggingface-cli download microsoft/Phi-3-vision-128k-instruct-onnx-cuda --include cuda-int4-rtn-block-32/* --local-dir .
Running the Example Script
Download the official example script and an example image.
Code:
# Download example script
wget https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3v.py
# Download example image
wget https://onnxruntime.ai/images/table.png
Run the example script.
Code:
python3 phi3v.py -m cuda-int4-rtn-block-32
First, input the path to the image, for example,
table.png
.Next, input the prompt text, for example:
Convert this image to markdown format
.
Code:
```markdown
| Product | Qtr 1 | Qtr 2 | Grand Total |
|---------------------|----------|----------|-------------|
| Chocolade | $744.60 | $162.56 | $907.16 |
| Gummibarchen | $5,079.60| $1,249.20| $6,328.80 |
| Scottish Longbreads | $1,267.50| $1,062.50| $2,330.00 |
| Sir Rodney's Scones | $1,418.00| $756.00 | $2,174.00 |
| Tarte au sucre | $4,728.00| $4,547.92| $9,275.92 |
| Chocolate Biscuits | $943.89 | $349.60 | $1,293.49 |
| Total | $14,181.59| $8,127.78| $22,309.37 |
```
The table lists various products along with their sales figures for Qtr 1, Qtr 2, and the Grand Total. The products include Chocolade, Gummibarchen, Scottish Longbreads, Sir Rodney's Scones, Tarte au sucre, and Chocolate Biscuits. The Grand Total column sums up the sales for each product across the two quarters.
Note: The first round of dialogue during script execution might be slow, but subsequent dialogues will be faster.
We can use Jtop to monitor resource usage:
The above inference is run on Jetson Orin Nano using the Int 4 quantized model. As shown, the Python process occupies 5.4 GB of VRAM for inference, with minimal CPU load and nearly full GPU utilization during inference.
We can modify the example script to use the
time
function at key points to measure the inference speed, which is remarkably fast.All of this is achieved on a device with a power consumption of just 15W.
Continue reading...