Use WebGPU + ONNX Runtime Web + Transformer.js to build RAG applications by Phi-3-mini

kinfey · Jul 15, 2024

Phi-3-mini is deployed in different edge devices, such as iPhone/Android, AIPC/Copilot+PC, as well as cloud and IoT, citing the cross-platform and flexibility of SLM. If you want to follow these deployment methods, you can follow the content of the Phi-3 Cookbook. In model reference, computing power is essential. Through the quantized model, SLM can be deployed and run on a GPU or a traditional CPU. In this topic, we will focus on the model reference of WebGPU.

What's WebGPU？

“WebGPU is a JavaScript API provided by a web browser that enables webpage scripts to efficiently utilize a device's graphics processing unit. This is achieved with the underlying Vulkan, Metal, or Direct3D 12 system APIs. On relevant devices, WebGPU is intended to supersede the older WebGL standard.” - Wikipedia

WebGPU allows developers to leverage the power of modern GPUs to implement web-based graphics and general computing applications on all platforms and devices, including desktops, mobile devices, and VR/AR headsets. WebGPU not only has rich prospects in front-end applications, but is also an important scenario in the field of machine learning. For example, the familiar tensorflow.js uses WebGPU to run machine learning/deep learning acceleration.

Required environment

Support Google Chrome 113+, Microsoft Edge 113+, Safari 18 (macOS 15), Firefox Nightly
Enable WebGPU

Perform the following operations in the Chrome / Microsoft Edge address bar

The chrome://flags/#enable-unsafe-webgpu flag must be enabled (not enable-webgpu-developer-features). Linux experimental support also requires launching the browser with --enable-features=Vulkan.

Safari 18 (macOS 15) is enabled by default
Firefox Nightly Enter about:config in the address bar and set dom.webgpu.enabled to true

Use js script to check whether WebGPU is supported

Code:

if (!navigator.gpu) {
  throw new Error("WebGPU not supported on this browser.");
}

Why should Phi-3-mini run on WebGPU

We hope that the application scenarios are cross-platform, not just running on a single terminal. For example, the browser, as a cross-platform Internet access tool, can quickly expand our application scenarios. For Phi-3-mini, a quantized ONNX-enabled WebGPU model has been released, which can quickly build WebApp applications through NodeJS and ONNX Rutime Web. By combining WebGPU we can build Copilot applications very simply.

Learn about ONNX Runtime Web

ONNX Runtime Web enables you to run and deploy machine learning models in your web application using JavaScript APIs and libraries. This page outlines the general flow through the development process. You can also integrate machine learning into the server side of your web application with ONNX Runtime using other language libraries, depending on your application development environment.

Starting with ONNX Runtime 1.17, ONNX Runtime Web supports WebGPU acceleration, combining the quantized Phi-3-mini-4k-instruct-onnx-web model and Tranformer.js to build a Web-based Copilot application.

Transformer.js

Transformers.js is designed to be functionally equivalent to Hugging Face’s transformers python library, meaning you can run the same pretrained models using a very similar API. These models support common tasks in different modalities, such as:

Natural Language Processing: text classification, named entity recognition, question answering, language modeling, summarization, translation, multiple choice, and text generation.
Computer Vision: image classification, object detection, and segmentation.
Audio: automatic speech recognition and audio classification.
Multimodal: zero-shot image classification.

Transformers.js uses ONNX Runtime to run models in the browser. The best part about it, is that you can easily convert your pretrained PyTorch, TensorFlow, or JAX models to ONNX using Optimum.

Transformers.js has supported numerous models across Natural Language Processing, Vision, Audio, Tabular and Multimodal domains.

Build Phi-3-mini-4k-instruct-onnx-web RAG WebApp application

RAG applications are the most popular scenarios for generative artificial intelligence. This example hopes to integrate Phi-3-mini-4k-instruct-onnx-web and jina-embeddings-v2-base-en vector models to build WebApp applications to build solutions in multiple terminals plan.

A. Create the Phi3SLM class

Using ONNX Runtime Web as the backend of Phi-3-mini-4k-instruct-onnx-web, I built phi3_slm.js with reference to llm.js. If you want to know the complete code, please visit Phi-3CookBook/code/08.RAG/rag_webgpu_chat at main · microsoft/Phi-3CookBook. The following are some relevant points.

What is set here is the location of the model when Transformer.js calls the model, and whether access to the remote model is allowed.

Code:

    constructor() {

        env.localModelPath = 'models';
        env.allowRemoteModels = 0; // disable remote models
        env.allowLocalModels = 1; // enable local models

    }

ONNX Runtime Web Setting

The standard ONNX Runtime Web library includes the following WebAssembly binary files:

SIMD: whether the Single Instruction, Multiple Data (SIMD) feature is supported.
Multi-threading: whether the WebAssembly multi-threading feature is supported.
JSEP: whether the JavaScript Execution Provider (JSEP) feature is enabled. This feature powers the WebGPU and WebNN execution providers.
Training: whether the training feature is enabled.

When using WebGPU or WebNN execution provider, the ort-wasm-simd-threaded.jsep.wasm file is used.

So add the following content to phi3_slm.js

Code:

ort.env.wasm.numThreads = 1;
ort.env.wasm.simd = true;
ort.env.wasm.wasmPaths = document.location.pathname.replace('index.html', '') + 'dist/';

And set it in webpack.config.js

Code:

    plugins: [
        // Copy .wasm files to dist folder
        new CopyWebpackPlugin({
            patterns: [
                {
                    from: 'node_modules/onnxruntime-web/dist/*.jsep.*',
                    to: 'dist/[name][ext]'
                },
            ],
        })
    ],

To use WebGPU we need to set it in the ORT session

like

Code:

const session = await ort.InferenceSession.create(modelPath, { ..., executionProviders: ['webgpu'] });

For other text generation, please refer to async generate(tokens, callback, options)

B. Create RAG class

Calling the jina-embeddings-v2-base-en model through Transformer.js is consistent with Python use, but there are a few things to note.

jina-embeddings-v2-base-en It is recommended to use the model of Xenova/jina-embeddings-v2-base-en · Hugging Face, which will have better performance after adjustment.
Because a vector database is not used, the vector similarity calculation method is used directly to complete the embeding work. This is also the most original method.

Code:

async getEmbeddings(query,kbContents) { 

    const question = query;

    let sim_result = [];

    for(const content of kbContents) {
            const output = await this.extractor([question, content], { pooling: 'mean' });
            const sim = cos_sim(output[0].data, output[1].data);
            sim_result.push({ content, sim });
    }

    sim_result.sort((a, b) => b.sim - a.sim);

    var answer = '';

    console.log(sim_result);

    answer = sim_result[0].content;

    return answer;
}

Please place jina-embeddings-v2-base-en in models and phi-3 mini in the directory of models

C. Running

This application implements the RAG function by uploading markdown documents. We can see that it has good performance and effects in content generation.

If you wish to run the example you can visit this link Sample Code

Resources

Learning Phi-3-mini-4k-instruct-onnx-web microsoft/Phi-3-mini-4k-instruct-onnx-web · Hugging Face
Learning ONNX Runtime Web Web
Learning WebGPU https://www.w3.org/TR/webgpu/
Reading Enjoy the Power of Phi-3 with ONNX Runtime on your device Enjoy the Power of Phi-3 with ONNX Runtime on your device
Official E2E samples onnxruntime-inference-examples/js/chat at main · microsoft/onnxruntime-inference-examples

Use WebGPU + ONNX Runtime Web + Transformer.js to build RAG applications by Phi-3-mini

kinfey

What's WebGPU？​

Required environment​

Why should Phi-3-mini run on WebGPU​

Learn about ONNX Runtime Web​

Transformer.js​

Build Phi-3-mini-4k-instruct-onnx-web RAG WebApp application​

Resources​

What's WebGPU？

Required environment

Why should Phi-3-mini run on WebGPU

Learn about ONNX Runtime Web

Transformer.js

Build Phi-3-mini-4k-instruct-onnx-web RAG WebApp application

Resources