Make your voice chatbots more engaging with new text to speech features

Qinying Liao · Jun 28, 2024

In our increasingly digital world, the importance of giving a voice and image to chatbots cannot be overstated. Transforming a chatbot from an impersonal, automated responder into a relatable and personable assistant significantly enhances user engagement.

Today we're thrilled to announce Azure AI Speech's latest updates, enhancing text to speech capabilities for a more engaging and lifelike chatbot experience. These updates include:

A wider range of multilingual voices for natural and authentic interactions;
More prebuilt avatar options, with latest sample codes for seamless GPT-4o integration; and
A new text stream API that significantly reduces latency for ChatGPT integration, ensuring smoother and faster responses.

Introducing new multilingual and IVR-styled voices

We're excited to introduce our newest collection of voices, equipped with advanced multilingual features. These voices are crafted from a variety of source languages, bringing a rich diversity of personas to enhance your user experience. With their authentic and natural interactions, they promise to transform your chatbot engagement through our technology.

Discover the diverse range of our new voices:

Voice name	Main locale	Gender
en-GB-AdaMultilingualNeural	en-GB (English – United Kingdom)	Female
en-GB-OllieMultilingualNeural	en-GB (English – United Kingdom)	Male
pt-BR-ThalitaMultilingualNeural	pt-BR (Portuguese – Portugal)	Female
es-ES-IsidoraMultilingualNeural	es-ES (Spanish – Spain)	Female
es-ES-ArabellaMultilingualNeural	es-ES (Spanish – Spain)	Female
it-IT-IsabellaMultilingualNeural	it-IT (Italian – Italy)	Female
it-IT-MarcelloMultilingualNeural	it-IT (Italian – Italy)	Male
it-IT-AlessioMultilingualNeural	it-IT (Italian – Italy)	Male

We're also delighted to present two new optimized en-US voices, specifically designed for call center scenarios - a prevalent application of text-to-speech technology.

They are:

Voice name	Main locale	Gender
en-US-LunaNeural	En-US (English – United States)	Female
en-US-KaiNeural	En-US (English – United States)	Male

These voices are currently available for public preview in three regions: East US, West Europe, and South East Asia. Discover more in our Voice Gallery and delve deeper into the details via our developer documentation.

Announcing advanced features for text to speech avatars

Text to speech avatar, previewed at Ignite 2023, enables users to create realistic videos of speaking avatars simply by giving text input and allows users to create real-time interactive bots with visual elements that are more engaging. Since its preview, we have received great feedback and appreciation from customers in various industries. Today, we are glad to share what’s been added to the avatar portfolio.

More prebuilt avatar options and more regions available

Our prebuilt text-to-speech avatars offer ready-to-deploy solutions for our customers. We've recently enriched our portfolio's diversity by introducing five new avatars. They can be used for both batch synthesis and real-time conversational scenarios. We remain committed to expanding our avatar collections to encompass a broader range of cultures and visual identities.

Text to speech avatars

These newly introduced avatars can be accessed in Speech Studio for video creation and live chats. Dive deeper into the process of synthesizing a text-to-speech avatar using Speech SDK for real-time synthesis in chatbot interactions, or batch synthesis for generating creativity videos.

Beyond the previously available service regions - West US 2, West Europe, and Southeast Asia - we are excited to announce the expansion of our avatar service to three additional regions: Sweden Central, North Europe, and South Central US. Learn more here.

Enhanced text to speech avatar chat experience with Azure OpenAI capabilities

Text-to-speech avatars are increasingly leveraged for live chatbots, with many of our customers utilizing Azure OpenAI to develop customer service bots, virtual assistants, AI educators, and virtual tourist guides, among others. These avatars, with their lifelike appearance and natural sounding neural TTS or custom voice, combined with the advanced natural language processing capabilities of the Azure OpenAI GPT model, provide an interaction experience that closely mirrors human conversation.

The Azure OpenAI GPT-4o model is now part of the live chat avatar application in Speech Studio. This allows users to see firsthand the collaborative functioning of the live chat avatar and Azure OpenAI GPT-4o. Additionally, we provide sample code to aid in integrating the text-to-speech avatar with the GPT-4o model. Learn more about how to create lifelike chatbots with real-time avatars and Azure OpenAI GPTs, or dive into code samples here (JS code sample, and python code sample) .

This update also includes sample codes to assist in customizing Azure OpenAI GPT on your data. Azure OpenAI On Your Data is a new feature that enables users to tailor the chatbot's responses according to their unique data source. This proves especially beneficial for enterprise customers aiming to develop an avatar-based live chat application capable of addressing business-specific queries from clients. For guidance on creating a live chat app using Azure OpenAI On Your Data, please refer to this sample code (search "On Your Data").

More Responsible AI support for avatars

Ensuring responsibility in both the development and delivery of AI products is a core value for us. In line with this, we've introduced two features to bolster the responsible AI support for text-to-speech avatars, supplementing our existing transparency note, code of conduct, and disclosure guidelines.

We've integrated Azure AI Content Safety into the batch synthesis process of text to speech avatars for video creation scenarios. This added layer of text moderation allows for the detection of offensive, risky, or undesirable text input, thereby preventing the avatar from producing harmful output. The text moderation feature spans multiple categories, including sexual, violent, hate, self-harm content, and more. It's available for batch synthesis of text-to-speech avatars both in Speech Studio and via the batch synthesis API.
In our bid to provide audiences with clearer insights into the source and history of video content created by text to speech avatars, we've adopted the Coalition for Content Provenance and Authenticity (C2PA) Standard. This standard offers transparent information about AI-generation of video content. For more details on the integration of C2PA with text to speech avatars, refer to Content Credentials in Azure Text to Speech Avatar .

Unlocking real-time speech synthesis with the new text stream API

Our latest release introduces an innovative Text Stream API designed to harness the power of real-time text processing to generate speech with unprecedented speed. This new API is perfect for dynamic text vocalization, such as reading outputs from AI models like GPT in real-time.

The Text Stream API represents a significant leap forward from traditional non-text stream TTS technologies. By accepting input in chunks (as opposed to whole responses), it significantly reduces the latency that typically hinders seamless audio synthesis.

Comparison: Non-Text Stream vs. Text Stream

	Non-Text Stream	Text Stream
Input Type	Whole GPT response	Each GPT output chunk
TTS First Byte Latency	High (Total GPT response time + TTS time)	Low (Few GPT chunks time + TTS time)

The Text Stream API not only minimizes latency but also enhances the fluidity and responsiveness of real-time speech outputs, making it an ideal choice for interactive applications, live events, and responsive AI-driven dialogues.

Utilizing the Text Stream API is straightforward. Simply follow the steps provided with the Speech SDK. For detailed implementation, see the sample code on GitHub.

Get started

Microsoft provides access to more than 500 neural voices spanning over more than 140 languages and locales, complemented by avatar add-ons. These text-to-speech capabilities, part of Azure AI Speech service, allow you to swiftly imbue chatbots with a natural voice and realistic image, thereby enriching the conversational experience for users. Furthermore, the Custom Neural Voice and Custom Avatar features facilitate the creation of a distinctive brand voice and image for your chatbots. With a unique voice and image, a chatbot can seamlessly integrate into your brand's identity, contributing to a cohesive and unforgettable brand experience.

For more information

Try our demo to listen to existing neural voices
Add Text-to-Speech to your apps today
Apply for access to Custom Avatar and Custom Neural Voice
Join Discord to collaborate and share feedback

Zheng Niu and Junwei Gan also contributed to this article.

Continue reading...

Make your voice chatbots more engaging with new text to speech features

Qinying Liao

Introducing new multilingual and IVR-styled voices​

Announcing advanced features for text to speech avatars​

More prebuilt avatar options and more regions available​

Enhanced text to speech avatar chat experience with Azure OpenAI capabilities​

More Responsible AI support for avatars​

Unlocking real-time speech synthesis with the new text stream API​

Get started​

For more information​