Jump to content

Use WebGPU + ONNX Runtime Web + Transformer.js to build RAG applications by Phi-3-mini


Recommended Posts

Guest kinfey
Posted

Phi-3-mini is deployed in different edge devices, such as iPhone/Android, AIPC/Copilot+PC, as well as cloud and IoT, citing the cross-platform and flexibility of SLM. If you want to follow these deployment methods, you can follow the content of the Phi-3 Cookbook. In model reference, computing power is essential. Through the quantized model, SLM can be deployed and run on a GPU or a traditional CPU. In this topic, we will focus on the model reference of WebGPU.

 

[HEADING=1]What's WebGPU?[/HEADING]

 

“WebGPU is a JavaScript API provided by a web browser that enables webpage scripts to efficiently utilize a device's graphics processing unit. This is achieved with the underlying Vulkan, Metal, or Direct3D 12 system APIs. On relevant devices, WebGPU is intended to supersede the older WebGL standard.” - Wikipedia

 

WebGPU allows developers to leverage the power of modern GPUs to implement web-based graphics and general computing applications on all platforms and devices, including desktops, mobile devices, and VR/AR headsets. WebGPU not only has rich prospects in front-end applications, but is also an important scenario in the field of machine learning. For example, the familiar tensorflow.js uses WebGPU to run machine learning/deep learning acceleration.

 

[HEADING=2]Required environment[/HEADING]


  1. Support Google Chrome 113+, Microsoft Edge 113+, Safari 18 (macOS 15), Firefox Nightly
     
     

  2. Enable WebGPU
     

  • Perform the following operations in the Chrome / Microsoft Edge address bar

 

The chrome://flags/#enable-unsafe-webgpu flag must be enabled (not enable-webgpu-developer-features). Linux experimental support also requires launching the browser with --enable-features=Vulkan.

 


  • Safari 18 (macOS 15) is enabled by default
     
     

  • Firefox Nightly Enter about:config in the address bar and set dom.webgpu.enabled to true
     

  1. Use js script to check whether WebGPU is supported

 

if (!navigator.gpu) {
 throw new Error("WebGPU not supported on this browser.");
}

[HEADING=1]Why should Phi-3-mini run on WebGPU[/HEADING]

 

We hope that the application scenarios are cross-platform, not just running on a single terminal. For example, the browser, as a cross-platform Internet access tool, can quickly expand our application scenarios. For Phi-3-mini, a quantized ONNX-enabled WebGPU model has been released, which can quickly build WebApp applications through NodeJS and ONNX Rutime Web. By combining WebGPU we can build Copilot applications very simply.

 

[HEADING=2]Learn about ONNX Runtime Web[/HEADING]

 

ONNX Runtime Web enables you to run and deploy machine learning models in your web application using JavaScript APIs and libraries. This page outlines the general flow through the development process. You can also integrate machine learning into the server side of your web application with ONNX Runtime using other language libraries, depending on your application development environment.

 

Starting with ONNX Runtime 1.17, ONNX Runtime Web supports WebGPU acceleration, combining the quantized Phi-3-mini-4k-instruct-onnx-web model and Tranformer.js to build a Web-based Copilot application.

 

[HEADING=2]Transformer.js[/HEADING]

 

Transformers.js is designed to be functionally equivalent to Hugging Face’s transformers python library, meaning you can run the same pretrained models using a very similar API. These models support common tasks in different modalities, such as:

 

  • Natural Language Processing: text classification, named entity recognition, question answering, language modeling, summarization, translation, multiple choice, and text generation.
  • Computer Vision: image classification, object detection, and segmentation.
  • Audio: automatic speech recognition and audio classification.
  • Multimodal: zero-shot image classification.

 

Transformers.js uses ONNX Runtime to run models in the browser. The best part about it, is that you can easily convert your pretrained PyTorch, TensorFlow, or JAX models to ONNX using Optimum.

 

Transformers.js has supported numerous models across Natural Language Processing, Vision, Audio, Tabular and Multimodal domains.

 

[HEADING=1]Build Phi-3-mini-4k-instruct-onnx-web RAG WebApp application[/HEADING]

 

RAG applications are the most popular scenarios for generative artificial intelligence. This example hopes to integrate Phi-3-mini-4k-instruct-onnx-web and jina-embeddings-v2-base-en vector models to build WebApp applications to build solutions in multiple terminals plan.

 

largevv2px999.png.5c255206f02b78e1945044b0a2a3ce86.png

 

A. Create the Phi3SLM class

 

Using ONNX Runtime Web as the backend of Phi-3-mini-4k-instruct-onnx-web, I built phi3_slm.js with reference to llm.js. If you want to know the complete code, please visit Phi-3CookBook/code/08.RAG/rag_webgpu_chat at main · microsoft/Phi-3CookBook. The following are some relevant points.

 

  1. What is set here is the location of the model when Transformer.js calls the model, and whether access to the remote model is allowed.

 

   constructor() {

       env.localModelPath = 'models';
       env.allowRemoteModels = 0; // disable remote models
       env.allowLocalModels = 1; // enable local models

   }


  1. ONNX Runtime Web Setting

 

The standard ONNX Runtime Web library includes the following WebAssembly binary files:

 


  • SIMD: whether the Single Instruction, Multiple Data (SIMD) feature is supported.
     
     

  • Multi-threading: whether the WebAssembly multi-threading feature is supported.
     
     

  • JSEP: whether the JavaScript Execution Provider (JSEP) feature is enabled. This feature powers the WebGPU and WebNN execution providers.
     
     

  • Training: whether the training feature is enabled.
     

 

When using WebGPU or WebNN execution provider, the ort-wasm-simd-threaded.jsep.wasm file is used.

 

So add the following content to phi3_slm.js

 

ort.env.wasm.numThreads = 1;
ort.env.wasm.simd = true;
ort.env.wasm.wasmPaths = document.location.pathname.replace('index.html', '') + 'dist/';


 

And set it in webpack.config.js

 

   plugins: [
       // Copy .wasm files to dist folder
       new CopyWebpackPlugin({
           patterns: [
               {
                   from: 'node_modules/onnxruntime-web/dist/*.jsep.*',
                   to: 'dist/[name][ext]'
               },
           ],
       })
   ],

  1. To use WebGPU we need to set it in the ORT session

 

like

 

const session = await ort.InferenceSession.create(modelPath, { ..., executionProviders: ['webgpu'] });


 

For other text generation, please refer to async generate(tokens, callback, options)

 

B. Create RAG class

 

Calling the jina-embeddings-v2-base-en model through Transformer.js is consistent with Python use, but there are a few things to note.

 


  1. jina-embeddings-v2-base-en It is recommended to use the model of Xenova/jina-embeddings-v2-base-en · Hugging Face, which will have better performance after adjustment.
     
     

  2. Because a vector database is not used, the vector similarity calculation method is used directly to complete the embeding work. This is also the most original method.
     

 

async getEmbeddings(query,kbContents) { 

   const question = query;

   let sim_result = [];

   for(const content of kbContents) {
           const output = await this.extractor([question, content], { pooling: 'mean' });
           const sim = cos_sim(output[0].data, output[1].data);
           sim_result.push({ content, sim });
   }

   sim_result.sort((a, b) => b.sim - a.sim);

   var answer = '';

   console.log(sim_result);

   answer = sim_result[0].content;

   return answer;
}


  1. Please place jina-embeddings-v2-base-en in models and phi-3 mini in the directory of models

 

largevv2px999.thumb.png.54f009e76a49b890da126aed2a8a8774.png

 

C. Running

 

largevv2px999.png.14b06290401c52017092fde3ae6d9794.png

 

This application implements the RAG function by uploading markdown documents. We can see that it has good performance and effects in content generation.

 

If you wish to run the example you can visit this link Sample Code

 

[HEADING=1]Resources[/HEADING]


  1. Learning Phi-3-mini-4k-instruct-onnx-web microsoft/Phi-3-mini-4k-instruct-onnx-web · Hugging Face
     
     

  2. Learning ONNX Runtime Web Web
     
     

  3. Learning WebGPU https://www.w3.org/TR/webgpu/
     
     

  4. Reading Enjoy the Power of Phi-3 with ONNX Runtime on your device Enjoy the Power of Phi-3 with ONNX Runtime on your device
     
     

  5. Official E2E samples onnxruntime-inference-examples/js/chat at main · microsoft/onnxruntime-inference-examples
     

 

Continue reading...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...