Azure Video Indexer & Phi-3 introduce Textual Video Summary on Edge: Better Together story

  • Thread starter Thread starter Shay_Amram
  • Start date Start date
S

Shay_Amram

Azure AI Video Indexer collaborated with the Phi-3 team to introduce a Textual Video Summary capability on Edge.

This collaboration showcases the utilization of the SLM, Phi-3 model, enabling the Azure AI Video Indexer team to extend the same LLM based summarization capabilities that are available for the cloud, to also be available on the Edge.

This comes following Build’s 2024 announcements of the integration of Azure AI Video Indexer with language models to generate textual summaries of videos and the expansion of the Phi-3 models family. The feature is accessible both in the cloud, utilizing Azure Open AI, and at the Edge via the Phi-3-mini-4k-instruct model.



Powered by Phi-3, the Edge video summarization generates summaries for videos and audio files of any length, processing all data locally.

These summaries are accessible through the Azure AI Video Indexer portal or via the Azure AI Video Indexer API. Users have the flexibility to customize the length and style of the summaries to meet their specific requirements, ranging from brief and concise to extensive and formal.


large?v=v2&px=999.jpgFigure 1: a demonstration of VI's summarization capabilities is found here:
View: https://www.youtube.com/watch?v=56NY0oc2470




In this blog, we’ll discuss how both teams collaborated to integrate the Phi-3 Language Model into an Edge environment, offering high-quality video summarization in Azure AI Video Indexer enabled by ARC. We’ll cover the main challenges, the work done to achieve high quality results, and our commitment to maintaining high responsible AI standards.



Background


Azure AI Video Indexer is a one-stop-shop for video analytics and insights, with video summarization being a key component for quickly understanding content without watching the entire video. It also helps in searching and maintaining archives by providing the right level of detail. Given the rapid increase in video content, efficient summarization is essential.

At the same time, concerns about data privacy, residency, and regulations are growing among organizations, law enforcement, and private users. They may also wish to leverage their existing computing resources. Therefore, utilizing Edge infrastructure becomes vital, especially for companies facing legal, security, or privacy challenges, making an Edge solution necessary for video analytics and summarization.



Phi-3-Mini-4K


Creating a summary on Edge requires balancing many requirements: summary quality (see the Summarization section below for more details), runtime, costs, and various aspects of responsible AI. In our experiments with several small language models, Phi-3-Mini-4K provided the best balance between these factors.

Phi-3 is the latest Small Language Model that was released by Microsoft under the Phi-3 family. The Phi-3-Mini-4K-Instruct is a 3.8B parameters, lightweight, state-of-the-art open model trained with the Phi-3 datasets that includes both synthetic data and the filtered publicly available websites data with a focus on high-quality and reasoning dense properties. The model belongs to the Phi-3 family with the Mini version in two variants of context size (tokens): a 4K and a 128K variant (see quality benchmarks). For the summarization task we selected the 4K variant, which is high-quality, comparatively light, and can run on Edge, enabling customers to retain their data locally. In addition, the Phi-3 official model’s training included a safety alignment phase, which makes it compliant with responsible-AI considerations (trained to avoid harmful content, XPIA, etc.). All these make Phi-3 an ideal choice for the task of summarization on Edge.



Summarization


When summarizing a video, it is important to note the multi-modality of the input, as opposed to summarization of a text such as a textbook. We consume multiple products of Azure AI Video Indexer’s other insights to be used for summarization. When evaluating (both manually and semi-automatically) and scoring the summaries, we specifically note these aspects of the summary:

  • Conciseness: the summary is short and to the point. It doesn’t repeat itself.
  • Coherence: the structure is logical and easy to follow.
  • Objectivity: the summary is stated in an unbiased manner.
  • Accurate: the summary is factually accurate with respect to the video (also known as groundness).
  • Completeness: the summary contains all the main points of the video.



Multi-Modality Summarization and Sectioning


Providing a good summary of a video can be challenging. It must be able to correctly weigh the various elements that describe a video: Transcript, OCR, Audio Effects, Visual Labels, Detected Objects, and more. This work explored different methods to achieve that and produce a high-quality video summarization. Due to the limitations of context size (4K) we had to employ smart-sectioning to split the video into context-sized sections (this work has been previously discussed here). But in order to get a unified summary that is based of the entire video and not just a single section, all the sections must be somehow aggregated.

To facilitate this aggregation, we made a “running” summary window: in the beginning only the first section is summarized, but the summary of the second section would include, as part of the input, the result from the first:

607x55?v=v2.png

In this manner each iteration produces a summary that is based on the previous ones, carrying the main points from start to finish. Our summarization solution adapted prompt-engineering to coalesce the multi-modality inputs into a cohesive video summary, without any need to fine-tune or apply additional training to the Phi-3 base-model to adapt it for the summarization task.



Prompt Engineering​


We used the suggested chat-format for Phi-3: “system” (aka meta-prompt), “user”, and “assistant” roles. Our system prompt had to cover the following aspects:

  • Summarization instructions.
  • Guidelines for excluding harmful content in the output summary: avoid hate speech, violence, self-harm, etc.
  • Our prompt includes instructions to protect the model against indirect prompt injection attacks (XPIA).
  • Groundness to ensure that the summary only includes information discussed in the video and does not introduce external knowledge or fabricate facts.
  • Instructions to adhere to the meta-prompt and to avoid modifying the instructions, as well as to suppress the instructions from the summary output (this is also a quality concern, as internal instructions should not appear in the output).
  • Summary styles, such as “formal”, “casual”, “short”, or “long” that can be customized by the users to fit their preferences.

We analyzed 50 videos of different types, lengths, and domains to reflect the typical range of content indexed on Azure AI Video Indexer. These videos were manually assessed for summarization quality (conciseness, coherence, objectivity, etc.) through iterative experimentation with the aforementioned prompt aspects, until achieving satisfactory results across all criteria.

Adjusting the prompt to enhance one aspect can theoretically affect others and may necessitate costly and time-consuming human review. However, Phi-3 demonstrated considerable robustness, effectively following instructions without compromising the quality of the resulting summary. It adapted well to specific prompt changes tailored to our responsible-AI needs, ensuring the output met our requirements.



Evaluations​


After finalizing the prompts, we implemented an external review cycle with independent reviewers to ensure the system consistently meets quality and safety standards. The reviewers rated the summaries for a set of videos not used during development on a scale from 1 (bad) to 10 (perfect). The average scores for the manually labeled videos were: GPT-3.5Turbo at 6.9, Phi-3 at 7.5, and GPT-4 at 8.5. This shows that Phi-3, despite being a much smaller language model, achieved comparable scores to the GPT models in the summarization task.



Conclusions​


This article presents a case study of embedding Phi-3, a new Small Language Model from Microsoft, as an Edge solution for video summarization using Azure AI Video Indexer. Video summarization is a powerful feature that enables users to quickly grasp the content of a video without watching it entirely. It can also help in searching and maintaining an archive, giving just the right level of detail. Summarizing a video requires combining various modalities, extracted by Azure AI Video Indexer such as transcript, OCR, audio effects, visual labels, detected objects, and more, and weighing them accordingly. The article discusses the data science aspects of creating a high-quality video summarization, such as what makes a good summary, how to section videos, and the challenges of maintaining high-quality summary, while addressing responsible AI consideration, such as avoiding harmful content, ensuring data privacy, and complying with regulations. The article showcases the advantages of using Phi-3 as an Edge solution for video summarization, such as its high quality, lightweight, state-of-the-art performance, and its ability to run on Edge devices, powered by ARC.



Note: Azure AI Video Indexer enabled by Arc is an Azure Arc extension enabled service that runs video analysis, audio analysis, and generative AI on Edge devices. The solution is designed to run on Azure Arc enabled Kubernetes and supports many video formats. To leverage the summarization capability on Edge, you must sign up using this form to approve your subscription-id.



Read More


About the feature

About Phi-3 model

About Azure AI Video Indexer


Continue reading...
 
Back
Top