Visual Studio Code AI Toolkit: Run LLMs locally

Posted June 10, 2024Jun 10

The generative AI landscape is in a constant state of flux, with new developments emerging at a breakneck pace. In recent times along with LLMs we have also seen the rise of SLMs. From virtual assistants to chatbots, SLMs are revolutionizing how we interact with technology through conversation. As the backbone of many conversational models, SLMs enable natural language understanding and generation, leading to more engaging user experiences.

The deployment of large language models (LLMs) and smaller language models (SLMs) on local infrastructure has emerged as a critical area of discussion due to several compelling factors. These factors include maintaining stringent data privacy regulations, achieving cost-effectiveness over time, and enabling greater flexibility for customization and integration.

AI Toolkit (Earlier known as Windows AI Studio) is here to address such problems, some major problems this solves is,

Onboarding the LLMs/ SLMs on our local machines. This toolkit lets us to easily download the models on our local machine.
Evaluation of the model. Whenever we need to evaluate a model to check for the feasibility to any particular application, then this tool lets us do it in a playground environment, which is what we will seeing in this blog.
Fine-tuning, this majorly delas with training the model further to do the tasks that we specifically want the model to do. Usually, it does a generic task and has generic data, with fine-tuning we can give it a particular flavor to perform particular task.

The best part is that it runs on windows machine and has models which are optimized for windows machine. The AI toolkit lets the models run locally and makes it offline capable. AI toolkit opens up plethora of scenarios for organizations in various sectors like healthcare, education, banking, governments and so on.

Bring AI development into your VS Code workflow with the AI Toolkit extension. It empowers you to:

Run pre-optimized AI models locally: Get started quickly with models designed for various setups, including Windows 11 running with DirectML acceleration or direct CPU, Linux with NVIDIA GPUs, or CPU-only environments.

Test and integrate models seamlessly: Experiment with models in a user-friendly playground or use a REST API to incorporate them directly into your application.

Fine-tune models for specific needs: Customize pre-trained models (like popular SLMs Phi-3 and Mistral) locally or in the cloud to enhance performance, tailor responses, and control their style.

Deploy your AI-powered features: Choose between cloud deployment or embedding them within your device applications.

Alright! Now let’s experience this amazing extension on our machines using Visual Studio Code. Since this is available as VS Code extension, Visual Studio code is a direct prerequisite to use this tool. Use this link to download VSCode on your machines.

We can run AI Toolkit Preview directly on local machine. However, certain tasks might only be available on Windows or Linux depending on the chosen model. Mac support is on the way!

For local run on Windows + WSL, WSL Ubuntu distro 18.4 or greater should be installed and is set to default prior to using AI Toolkit. Learn more how to install Windows subsystem for Linux and changing default distribution or I have explained it step-wise in one of the previous blog where I have demonstrated the installation of windows AI studio. You can find it here. Steps of installation of WSL remains the same as explained in that blog. Windows AI Studio is deprecated and is now rebranded as AI Toolkit. For the latest documentation and to download and use the AI Toolkit, please visit the GitHub page.

Once the WSL is installed, launch the Ubuntu terminal and type the following,

[iCODE]code .[/iCODE]

This should launch the Visual studio code. Since this is the first launch, it will collect few things.

Now Visual studio code window will be launched.

Note that this will be in the remote connection with session name as WSL Ubuntu. Now the extensions we will be installing will be done in WSL.

On the activity bar Visual Studio Code window, there is an “Extension” option

Click on this and search for “AI Toolkit” and install the extension, once it is installed, we can see an extra icon on the activity bar.

Once it is installed, a new extension will be visible on the left side menu, when clicked on it, a pop-up notification comes up showcasing the port forwarding capabilities and also auto assigns one port for the Toolkit.

Also, two fresh sections are shown under AI-toolkit namely Models and Resources.

Models section contains the following,

Model Catalog

Resources section contains the following,

Model Playground
Model Finetuning

Models contains the Model Catalog which is basically the list of all the available AI Models. This is where we can choose and download a model which fits our use case.AI Toolkit offers the collection of publicly available AI models already optimized for Windows. The models are stored in the different locations including Hugging Face, GitHub and others, but we can browse the models and find all of them in one place ready for downloading and using in windows application.

We can also find the model cards for each of the model, to check various parameters of the model in order to further decide which one to choose for a particular application. Few more details like, number of parameters the model is pre-trained on, Dependency on CPU or GPU, the size of the model is all available here. Finally, upon deciding, the model can be downloaded using the “Download” button for each model. Any number of models can be downloaded.

For the purpose of this demonstration, I will download Mistral-mistral-7b-v02-int4-gpu and one of the recent SLM of Microsoft Phi-3-mini-128k-cuda-int4-onnx

Note: For optimized performance on Windows devices that have at least one GPU, select model versions that only target Windows. This ensures you have a model optimized for the DirectML accelerator. The model names are in the format of {model_name}-{accelerator}-{quantization}-{format}.

To check whether you have a GPU on your Windows device, open Task Manager and then select the Performance tab. If you have GPU(s), they will be listed under names like "GPU 0" or "GPU 1".

The next interesting part is the Model playground which is available in the resources section. For the models that we have evaluated using model card and downloaded, its time now to test them out using the Playground!

Playground has multiple sections, let’s see each one.

Model: This is the placeholder which lets us load the model. In this case I will be using the Phi-3-mini-128k-cuda-int4-onnx.
Context Instructions: This is the system prompt for the model. It guides the model the way in which it has to behave to a particular scenario. For example, we can ask it to respond in a Shakespearean tone, and it will respond accordingly. I will input “Respond in Shakespearean accent” as the Context Instruction.
Inference Parameters: These are the adjustment parameters for the model. Under this section we have Max response length (tokens), Temperature, Top P, Frequency Penalty, Presence penalty. Hovering over the small “i” icon explains about each parameter.
Chat Area: This is where we type in our messages and finally engage in a chat conversation with the model. The model responds on the pretrained data.

Note: Some machines might show the error as follows while loading the model in the playground.

Failed loading model Phi-3-mini-128k-cuda-int4-onnx: /onnxruntime_src/onnxruntime/core/session/provider_bridge_ort.cc:1426 onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcudnn.so.8: cannot open shared object file: No such file or directory

This is majorly due to some missing libraries, to fix it, execute the following commands one by one at a time in the Ubuntu terminal or the VS Code terminal of the WSL session.

•	pip install onnxruntime
•	pip install onnxruntime-gpu
•	cd /usr/local/cuda/lib64
•	ls 
•	sudo apt install nvidia-cudnn
•	sudo apt update
•	apt list --upgradable
•	sudo apt upgrade
•	sudo apt update
•	sudo apt update —fix-missing
•	sudo apt-get install libcudnn8
•	sudo apt update

This must help you resolve the error!

It is now responding in Shakespearean accent because of the Context information.

We can further evaluate the model based on our needs and the best part is that it is absolutely free and is now running on Local machine!! AI toolkit thus solves a major problem and also helps us in streamlining the development of GenAI applications. In the further blogs let’s see how to interact with the model using Python and build some cool applications. Stay tuned!

Quote

Sign In

Visual Studio Code AI Toolkit: Run LLMs locally

Featured Replies

Join the conversation

Account

Navigation

Search