Jump to content

Azure Neural TTS now available on devices for disconnected and hybrid scenarios


Recommended Posts

Guest Qinying Liao
Posted

Azure Neural Text-to-Speech (Neural TTS) is a powerful AIGC (AI Generated Content) service that allows users to turn text into lifelike speech. It has been applied to a wide range of scenarios, including voice assistants, content read-aloud capabilities, and accessibility uses. During the past months, Azure Neural TTS has achieved parity with the natural human recordings (see details) and has been extended to support more than 140 languages and variances (see details). These highly natural voices are available on cloud, and on prem through containers.

 

 

 

At the same time, we have received many customer requests to support Neural TTS on devices, especially for scenarios where devices do not have network availability or the network is not stable, and scenarios that require extremely low latency or have privacy constraint. For example, users of screen readers for accessibility (such as the speech feature on Windows) are asking to improve the voice experience with better on-device TTS quality. Automobile manufacturers are requesting features to enable voice assistants in cars when disconnected.

 

 

 

To address the needs for high-quality embedded TTS, we developed a new-generation device neural TTS technology which has significantly improved the embedded TTS quality compared to the traditional on-device TTS voices, e.g, those based on the legacy SPS (Statistical Parametric Speech Synthesis) technology. Thanks to this new technology, natural voices on-device have been released to Microsoft’s flagship products such as Windows 11 Narrator, and are now available in Speech services for Azure customers.

 

 

 

Neural voices on-device

 

 

 

 

A set of natural on-device voices lately became available with Narrator on Windows 11. Check the video below to hear how natural the new voices sound and how much better they are than the old-generation embedded voices.

 

 

 

 

Windows11’s Narrator Natural Voices, start from 30’27’’ to 32’44’’

 

 

 

With Azure Speech service, you can embed the same natural on-device voices into your own business scenarios easily. Check the demo below to see how seamlessly a mobile speech experience is switched from a connected environment to disconnected, with neural TTS voices available both on cloud and embedded.

 

 

 

 

Seamless switch between cloud TTS and device TTS with Azure device neural TTS technology

 

 

 

This new generation on-device neural TTS has three key advances: high quality, high efficiency, and high responsiveness.

 

 

 

High quality

 

 

Traditional TTS on-device voices are built with the legacy SPS technology and the voice quality is significantly lower than the cloud-based TTS, typically with a MOS (Mean Opinion Score) gap higher than 0.5. Now, with the new device neural TTS technology, we have closed the gap between the device TTS and cloud TTS. Our MOS and CMOS (Comparative Mean Opinion Score) tests have shown that the device neural TTS voice quality is very close to the cloud TTS.

 

 

 

Check below table for a comparison of voice naturalness, output support and features available among traditional device TTS (SPS), embedded neural TTS and cloud neural TTS. Here ‘traditional device TTS’ is the device SPS technology we shipped on Windows 10 and the previous Windows versions, which is also the major technology used for embedded TTS in the current industry.

 

 

 




Traditional device TTS

Device neural TTS

Cloud neural TTS

MOS gap (on-device neural TTS as the base)

~-0.5

0

~+0.05

16kHz fidelity

Yes

Yes

Yes

24kHz fidelity

No

Yes

Yes

48kHz fidelity

No

No

Yes

Styles/emotions

No

No

Yes

 

 

 

As you can tell from the above comparison, with the new technology, the naturalness of device neural TTS voices have reached near parity with the cloud version. Hear how close they sound with below samples.

 

 

 


Voice

Device neural TTS

Cloud neural TTS

Jenny, En-US

 

Guy, En-US

 

 

 

 

 

 

 

Xiaoxiao, Zh-CN

 

 

 

 

 

 

 

Yunxi, Zh-CN

 

 

 

 

 

 

 

 

 

 

High efficiency

 

 

Deploying neural network models to IoT devices is a big challenge for both those performing AI research as well as multiple industries today. For device TTS scenarios and customers, the challenge is even bigger due to lower end devices and lower CPU usage reservation in the system according to our customers’ experience. So, we must create a super highly efficient solution for our device neural TTS.

 

 

 

Below are the metrics and the score card for our device neural TTS system. Overall, its efficiency is close to some traditional device TTS systems and can meet almost all customers’ requirements on efficiency.

 

 

 


Metrics

Values

CPU usage (DIMPs)

~1200

RTF1 (820A2, 1 thread)

~0.1

Output sample rate

24 kHz

Model Size (Bytes)

~5 Mb (Acoustic Model + Vocoder)

Memory usage (Bytes)

<120 Mb

NPU3 support

Yes

 

Notes:

 

  1. RTF, or Real-Time Factor, is the measurement of the time in seconds to generate the audio of 1 second in length.
  2. 820A is a type of CPU that is broadly used in car systems currently. It is a typical platform that device TTS is running on and most customers can adopt, so we use this CPU as our platform for measurement.
  3. NPU, or Neural Process Unit, is one of the critical components in the CPU, especially for AI related processing. It can accelerate the neural network inferencing efficiently without increasing general CPU usage. Recently more and more IoT devices like car manufacturers are using NPU to accelerate their systems.

 

 

 

High responsiveness

 

 

High synthesizing speed and low latency are critical factors that affect the user experience in a text-to-speech system. To ensure a highly responsive system, we designed the device NTTS to synthesize in a streaming mode, which means that the latency is independent of the length of the input sentence. This allows for a consistently small latency and a highly responsive experience when synthesizing. To achieve streaming synthesizing, both the acoustic model and vocoder must be able to be inferenced in a streaming manner.

 

 

 

With the streaming inference design, we achieved 100ms latency on 820A with 1 thread.

 

 

 

How did we do that?

 

 

 

 

To improve the device TTS technology with neural networks, overall, we adopted a pipeline architecture similar to the cloud TTS. The pipeline contains three major components: text analyzer, acoustic model, and vocoder.

 

 

 

For acoustic model, we designed a totally brand-new model architecture, named “LeanSpeech”, which is a super light efficiency model with high learning capability. We use LeanSpeech as a student model to learn from the service model which acts as a teacher. With this design, we achieved an acoustic model with a 2.9Mb size in bytes and close quality to the service acoustic model on cloud.

 

 

 

 

largevv2px999.png.70da7b7c738321f06aa1b23c8463214c.png

 

 

 

In addition, we developed the device vocoder based on our last service HiFiNet vocoder on cloud. The biggest challenge we faced was the computation cost. If we just simply applied LeanSpeech + HiFiNet, HiFiNet contributed to higher than 90% of the computation cost, and the total CPU usage would block adoption on some low-end devices or systems that only have limited CPU usage budget , like in many on-car assistant scenarios.

 

 

 

To solve these challenges, we re-designed our HiFiNet, using highly efficient model units and applied model compression methods like model distillation. Finally, we reduced the on-device model size to 7x smaller and decreased the computation cost by 4x compared to the service vocoder on cloud.

 

 

 

Get started

 

 

 

 

Embedded Speech with device neural TTS is in public preview with limited access. You can check how to use it with the Speech SDK here. Apply for access through the Azure Cognitive Services embedded speech limited access review. For more information, see Limited access for embedded speech.

 

 

 

We have below languages and voices released through Azure Embedded Speech public review. More languages will be supported bases on business needs.

 

 

 


Locale Name

Voice Name

Gender

en-US

Jenny

Female

en-US

Aria

Female

zh-CN

Xiaoxiao

Female

de-DE

Katja

Female

en-GB

Libby

Female

ja-JP

Nanami

Female

ko-KR

SunHi

Female

en-AU

Annette

Female

en-CA

Clara

Female

es-ES

Elvira

Female

es-MX

Dalia

Female

fr-CA

Sylvie

Female

fr-FR

Denise

Female

it-IT

Isabella

Female

pt-BR

Francisca

Female

en-US

Guy

Male

zh-CN

Yunxi

Male

de-DE

Conrad

Male

en-GB

Ryan

Male

ja-JP

Keita

Male

ko-KR

InJoon

Male

en-AU

William

Male

en-CA

Liam

Male

es-ES

Alvaro

Male

es-MX

Jorge

Male

fr-CA

Jean

Male

fr-FR

Henri

Male

it-IT

Diego

Male

pt-BR

Antonio

Male

 

 

 

Microsoft offers the best-in-class AI voice generator with Azure Cognitive Services. Quickly add read-aloud functionality for a more accessible app design or give a voice to chatbots to provide a richer conversational experience to your users with over 400 highly natural voices available across more than 140 languages and locales. Or easily create a brand voice for your business with the Custom Neural Voice capability.

 

 

 

For more information:

 

 

Continue reading...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...