A better Phi-3 Family is coming - multi-language support, better vision, intelligence MOEs

kinfey · Aug 20, 2024

After the release of Phi-3 at Microsoft Build 2024, it has received different attention, especially the application of Phi-3-mini and Phi-3-vision on edge devices. In the June update, we improved Benchmark and System role support by adjusting high-quality data training. In the August update, based on community and customer feedback, we brought Phi-3-mini-128k-instruct multi-language support, Phi-3-vision-128k with multi-frame image input, and provided Phi-3 MOE newly added for AI Agent. Next, let's take a look

Multi-language support

In previous versions, Phi-3-mini had good English corpus support, but weak support for non-English languages. When we tried to ask questions in Chinese, there were often some wrong questions, such as

Obviously, this is a wrong answer

But in the new version, we can have better understanding and corpus support with the new Chinese prediction support

You can also try the enhancements in different languages, or in the scenario without fine-tuning and RAG, it is also a good model.

Code Sample: https://github.com/Phi-3CookBook/blob/main/code/09.UpdateSamples/Aug/phi3-instruct-demo.ipynb

Better vision

Phi-3-Vision enables Phi-3 to not only understand text and complete dialogues, but also have visual capabilities (OCR, object recognition, and image analysis, etc.). However, in actual application scenarios, we need to analyze multiple images to find associations, such as videos, PPTs, books, etc. In the new Phi-3-Vision, multi-frame or multi-image input is supported, so we can better complete the inductive analysis of videos, PPTs, and books in visual scenes.

As shown in this video

We can use OpenCV to extract key frames. We can extract 21 key frame images from the video and store them in an array.

Code:

images = [] 
placeholder = "" 
for i in range(1,22): 
    with open("../output/keyframe_"+str(i)+".jpg", "rb") as f:

        images.append(Image.open("../output/keyframe_"+str(i)+".jpg"))
        placeholder += f"<|image_{i}|>\n"

Combined with Phi-3-Vision's chat template, we can perform a comprehensive analysis of multiple frames.

This allows us to more efficiently perform dynamic vision-based work, especially in edge scenarios.

Code Sample: https://github.com/Phi-3CookBook/blob/main/code/09.UpdateSamples/Aug/phi3-vision-demo.ipynb

Intelligence MOEs

In order to achieve higher performance of the model, in addition to computing power, model size is one of the key factors to improve model performance. Under a limited computing resource budget, training a larger model with fewer training steps is often better than training a smaller model with more steps.

Mixture of Experts Models (MoEs) have the following characteristics:

Faster pre-training speed than dense models
Faster inference speed than models with the same number of parameters
Requires a lot of video memory because all expert systems need to be loaded into memory
There are many challenges in fine-tuning, but recent research shows that instruction tuning for mixed expert models has great potential.

Now there are a lot of AI Agents applications, we can use MOEs to empower AI Agents. In multi-task scenarios, the response is faster.

We can explore a simple scenario where we want to use AI to help us write Twitter based on some content and translate it into Chinese and publish it to social networks. We can combine Phi-3 MOEs to complete this. We can use Prompt to set and arrange tasks, such as blog content publishing, translated content, and the best answer.

Code:

"""

sys_msg = """You are a helpful AI assistant, you are an agent capable of using a variety of tools to answer a question. Here are a few of the tools available to you:

- Blog: This tool helps you describe a certain knowledge point and content, and finally write it into Twitter or Facebook style content
- Translate: This is a tool that helps you translate into any language, using plain language as required
- Final Answer: the final answer tool must be used to respond to the user. You must use this when you have decided on an answer.

To use these tools you must always respond in JSON format containing `"tool_name"` and `"input"` key-value pairs. For example, to answer the question, "Build Muliti Agents with MOE models" you must use the calculator tool like so:


{
    "tool_name": "Blog",
    "input": "Build Muliti Agents with MOE models"
}


Or to translate the question "can you introduce yourself in Chinese" you must respond:


{
    "tool_name": "Search",
    "input": "can you introduce yourself in Chinese"
}


Remember just output the final result, ouput in JSON format containing `"agentid"`,`"tool_name"` , `"input"` and `"output"`  key-value pairs .:


[
    {   "agentid": "step1",
        "tool_name": "Blog",
        "input": "Build Muliti Agents with MOE models",
        "output": "........."
    },

    {   "agentid": "step2",
        "tool_name": "Search",
        "input": "can you introduce yourself in Chinese",
        "output": "........."
    },
    {
        "agentid": "final"
        "tool_name": "Result",
        "output": "........."
    }
]


The users answer is as follows.


"""

We can see that by telling the model what skills it needs to have and the arrangement of tasks, we can assign them to different tasks through Phi-3 MOEs to complete related work.

Write something about Generative AI with MOEs , translate it to Chinese

The result :

Code:

[
    {   
        "agentid": "step1", 
        "tool_name": "Blog",
        "input": "Generative AI with MOE", 
        "output": "Generative AI with MOE (Mixture of Experts) is a powerful approach that combines the strengths of generative models and the flexibility of MOE architecture. This hybrid model can generate high-quality, diverse, and contextually relevant content, making it suitable for various applications such as content creation, data augmentation, and more."
    },
    {   
        "agentid": "step2",
        "tool_name": "Translate",    
        "input": "Generative AI with MOE is a powerful approach that combines the strengths of generative models and the flexibility of MOE architecture. This hybrid model can generate high-quality, diverse, and contextually relevant content, making it suitable for various applications such as content creation, data augmentation, and more.", 
        "output": "基于生成AI的MOE（Mixture of Experts）是一种强大的方法，它结合了生成模型的优势和MOE架构的灵活性。这种混合模型可以生成高质量、多样化且上下文相关的内容，使其适用于各种应用，如内容创建、数据增强等。"
    },
    {  
         "agentid": "final",
         "tool_name": "Result",
         "output": "基于生成AI的MOE（Mixture of Experts）是一种强大的方法，它结合了生成模型的优势和MOE架构的灵活性。这种混合模型可以生成高质量、多样化且上下文相关的内容，使其适用于各种应用，如内容创建、数据增强等。"
    }
]

If conditions permit, we can more smoothly integrate the Phi-3 MOEs model into frameworks such as AutoGen, Semantic Kernel, and Langchain.

Code Sample: https://github.com/Phi-3CookBook/blob/main/code/09.UpdateSamples/Aug/phi3_moe_demo.ipynb

Thoughts on SLMs

SLMs do not replace LLMs but give GenAI a broader scenario. The update of Phi-3 allows more edge devices to have better support, including text, chat, and vision. In modern AI Agents application scenarios, we hope to have more efficient task execution efficiency. In addition to computing power, MoEs are the key to solving problems. Phi-3 is still iterating, and I hope everyone will pay more attention and give us better feedback.

Resources

1. Download Microsoft Phi-3 Family https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3

2. Read the Phi-3 Cookbook https://aka.ms/phi-3cookbook

3. Learn about MOEs https://huggingface.co/blog/moe](https://huggingface.co/blog/moe

A better Phi-3 Family is coming - multi-language support, better vision, intelligence MOEs

kinfey

Multi-language support​

Better vision​

​

Intelligence MOEs ​

Thoughts on SLMs​

Resources​

​

Multi-language support

Better vision

Intelligence MOEs

Thoughts on SLMs

Resources