Multimodal Public Preview Blog

Sherry_Shao · Sep 24, 2024

Multimodal Public Preview Blog

We are thrilled to announce the Public Preview release of our Multimodal model in Azure AI Content Safety. The Multimodal API analyzes materials containing both image content and text content to help make applications and services safer from harmful user-generated or AI-generated content.

The key objectives of the "Multimodal" feature is:

Detect Harmful Content Across Multiple Modalities: The primary objective is to detect harmful, inappropriate, or unsafe content by analyzing both text and images (including emojis). This includes identifying explicit content, hate speech, violence, self-harm and sexual within text-image combinations.
Contextual Analysis Across Text and Visuals: The multimodal is able to understand the context behind both textual and visual elements together, detecting subtle or implicit harmful content that might not be evident when looking at the text or image in isolation.
Real-Time Moderation: Provide real-time detection and moderation to prevent the generation, sharing, or dissemination of harmful content across multimodal platforms. This ensures that potentially harmful content is stopped before reaching users.

By addressing these objectives, the multimodal detection feature ensures a safer and more respectful user environment where content generation is creative yet responsible.

User Scenarios

Multimodal harmful content detection involves analyzing and moderating content across multiple modes, including text, images, and videos, to identify harmful, unsafe, or inappropriate materials. It becomes particularly crucial in scenarios where tools like DALL·E 3 are used for generating visual content based on textual prompts. The primary challenge lies in the variety and complexity of how harmful content might manifest, sometimes subtly across both the text and the generated images.

Harmful Imagery

User Scenario: A user prompts DALL·E 3 with text that seems innocent but leads to the generation of subtly harmful imagery (e.g., glorifying violence, hate symbols, or discriminatory representations).

Detection Mechanism: The multimodal detection evaluates the image content after it is generated, using models that can recognize visual cues related to hate speech, violence, and other harmful material.

Mitigation: The multimodal detection flags the generated image, prevents sharing, and asks the user to revise their prompt.

Text Embedded in Images

User Scenario: A user asks for an image containing text that promotes hate speech or false information (e.g., a sign or banner in an image with offensive language).

Detection Mechanism: Text within generated images is analyzed for harmful content using optical character recognition (OCR) alongside NLP techniques to understand the meaning and intent behind the text.

Mitigation: Once detected, the multimodal detection can refuse to display the image.

Multimodal Detection API

The multimodal API accepts both text and image inputs. It is designed to perform multi-class and multi-severity detection, meaning it can classify content across multiple categories and assign a severity score to each one. For each category, the system returns a severity level on a scale of 0, 2, 4, or 6. The higher the number, the more severe the content.

Reference Links

Get started in Azure AI Content Safety studio: Content Safety Studio - Microsoft Azure
Get started in Azure AI Studio
Learn more about our new multimodal model, read our documentation
For API input limits, see the Input requirements section of the Overview.

Multimodal Public Preview Blog

Sherry_Shao