TalkwithMe leverages the Azure API to provide customized, immersive AI language learning experiences

Jiechen_Li · Nov 16, 2023

Azure API Meets Language Learning: The Story of “TalkwithMe”

Inspiration

The genesis of our AI language learning product is quite a story — It all started during a discussion about the lack of customizable learning tools for reading practice. My friend Fan (Benjamin) Wang, who is an Online Course Designer & Technical Writer at Shanbay China, then shared his ideas of how to create one with Azure API, and we thought it would be worthwhile to turn the prototype into a product. So I formed the team with Yanxin (Jeff) Luo(Embedded Software Engineer), Yankun (Alex) Meng(Duke CS & ECE), and Vivian Yang(Duke Fuqua Quantitative Management) in Duke Generative AI Hackathon to make it happen. Our innovative approach and dedication culminated in a significant achievement: “TalkwithMe” was honored as the winner of the Beginner Track at the Duke Generative AI Hackathon 2023, a testament to our team's hard work and the potential of our project.

Our product “TalkwithMe” is an innovative browser extension that revolutionizes language learning by enabling users to practice pronunciation with their favorite scripts, providing instant feedback. This AI-driven tool, developed to enrich the language-learning landscape, leverages Microsoft Azure’s advanced speech synthesis models. Our journey in creating “TalkwithMe” involved selecting the most effective AI services and integrating Text-to-Speech (TTS) and Automatic Pronunciation Assessment (APA) functionalities. The challenge was to ensure a seamless user experience, which necessitated a user-friendly interface and smooth backend-frontend integration. Our commitment to solving these challenges has been pivotal in bringing this unique language-learning solution to life.

Technical Implementation

The general pipeline for our Minimum-Viable-Product (MVP) is divided into two independent processes: Text-to-Speech (TTS) and Automatic Pronunciation Assessment (APA). Both are done with the help of Microsoft Azure.

1.Text-to-Speech (TTS)

The initial step in the technical implementation of text-to-speech involves obtaining user input text. This input is provided via a text box or as txt file. The user's input serves as the content that needs to be converted into synthetic speech.

Once the user's input is acquired, we convert the input into an audio blob. This step was done with the help of the Microsoft Speech Synthesizer class (Updated in September 2023). This API takes the input text and returns an audio blob, which is essentially a binary audio data representation.

Within the Microsoft Speech Synthesizer class, we used specific configurations to fine-tune the text-to-speech process. These configurations are essential for customizing the synthesized speech to meet the user's requirements. The language setting is crucial and depends on the speaker, and in this case, it is set to "en-US," ensuring that the generated speech is in US English. Furthermore, the choice of voice is significant. The "Microsoft Server Speech Text to Speech Voice (en-US, JennyNeural)" is specified, which is what we believe to be the most natural and realistic voice provided.

After the text-to-speech conversion is successfully executed, the resulting audio blob is returned to the user interface by creating an audio bar element in JavaScript. The audio blob contains the synthesized speech, and users are provided with the capability to play and stop it at their discretion.

2. Automatic Pronunciation Assessment (APA)

This pipeline begins with user input in the form of audio, which is captured and processed using the Media Capture and Streams API in JavaScript. The main objective here is to assess the user's pronunciation based on the reference text they provide.

To initiate the pronunciation assessment pipeline, we create a WAV blob from the user's input audio. This allows us to efficiently work with the audio data. The audio data and reference text are then extracted from the user's request.

Once we have the necessary data, we proceed to assess the pronunciation. The audio data is read in manageable chunks, here we chose 1024 bytes which is the typical size. These audio data chunks are then sent to the Azure Cognitive Speech API, configured with Pronunciation Assessment settings. The key configurations include:

ReferenceText: The user-provided text is used as the reference for pronunciation assessment. It is a crucial element in evaluating the user's speech.
Grading System: The Grading System is set to a HundredMarks scale, providing a floating-point value ranging from 0 to 100. This scale serves as a comprehensive measure of pronunciation quality.
Granularity: Pronunciation assessment is performed at the Phoneme level. Phonemes are the fundamental sound units in American English and are represented using the American English Phoneme Representation released by Microsoft in April 2012 (SAPI 5.3). For instance, the word "hello" is represented as "h eh l ow." The word “photosynthesis” is represented as “f ow t ax s ih n th ax s ih s”.
EnableMiscue: This configuration is set to "True," indicating that pronunciation errors or mispronunciations are considered in the assessment.

The result of the assessment is provided in the form of JSON data, which includes both Word Level and Phoneme Level evaluation. The scoring process involves comparing the spoken phonemes from the user's audio input with the expected phonemes from the reference text and computing a confidence score on how well it matches.

The four most significant features are extracted from the JSON results and stored persistently in a CSV format. These features are presented to the user, allowing them to track their pronunciation progress effectively. The four primary features include:

Accuracy Score: This score quantifies the accuracy of the user's pronunciation, providing insight into how well they match the reference text.
Fluency Score: The fluency score measures the smoothness and flow of spoken words and phrases, giving users valuable feedback on their fluency.
Completeness Score: This score assesses how comprehensively the user pronounces words and ensures that no parts of the text are omitted.
Pronunciation Score: The overall pronunciation score combines various aspects of pronunciation quality and fluency, offering a comprehensive evaluation of the user's spoken language.

To see these features in action and understand how they enhance the language learning experience, you can view our product walkthrough video

Challenges and Solutions

One of the initial hurdles was conducting a comprehensive literature review and industry research to identify the best available tools that could be utilized to deliver fast and accurate results. We looked into new papers in NLP from renowned universities about TTS and well-developed AI services from OpenAI, AWS, and Google, and we eventually decided to use Microsoft Azure for our task due to the naturalness of its speech synthesis models and easy-to-configure parameters.

A significant milestone we overcame was figuring out how to seamlessly combine the Text-to-Speech (TTS) and Automatic Pronunciation Assessment (APA) pipelines to create a user experience that was both smooth and comfortable. This involved careful configuration and integration of various components to ensure that users could easily transition from generating synthetic speech to assessing their pronunciation with minimal friction.

Other challenges we encountered were programming and implementation tasks, such as how to persist user data into CSV format effectively, how to send requests, and combine JavaScript (Frontend) and Python (Backend) without bugs. This involved carefully coordinating the two components to ensure a synchronized and efficient operation. In terms of user interface (UI) design, we needed to create an interface that was not only intuitive and easy to use but also visually appealing. We opted for a Duke-themed color scheme that struck a balance between aesthetics and user-friendliness. These design choices enhanced the overall user experience and usability of our system.

Accomplishments

In our technical achievements for the demo, we successfully integrated various APIs and SDKs provided by Microsoft Azure and deployed them using Flask/Python3. This integration allowed us to create a seamless user experience, combining Text-to-Speech (TTS) and Automatic Pronunciation Assessment (APA) functionalities.

We successfully fine-tuned the parameters and configurations of our models to tailor them to our specific requirements. For the TTS component, we paid careful attention to language settings, ensuring that the generated speech matched the user's preferences. In the Pronunciation Assessment Model, we made key configurations such as granularity and miscue to improve the accuracy of assessments. Additionally, we developed a visually appealing and fully interactive user interface using vanilla JavaScript, HTML, and CSS. This interface facilitated a user-friendly experience, allowing users to interact easily with the system.

We provided a practical and intuitive working MVP designed for people who wish to practice their speaking skills extensively. Furthermore, our efforts were recognized by our peers at a hackathon, where we received excellent feedback from peers. This positive reception is a testament to the potential of our system and its effectiveness in addressing the needs of speakers looking to enhance their pronunciation and speaking abilities.

Commercial Viability

Referencing and analyzing the graph of “Size of Global E-learning market from 2019 to 2026”, “Revenues generated by Duolingo from 2019 to 2022”, and “Language learning apps awareness and usage in the U.S. from 2019 to 2022”, we interpreted:

Market player strong revenue growth: In 2022, Duolingo Inc. experienced significant growth in its revenues from subscriptions to its premium plan, Duolingo Plus. The revenue reached 273.5 million U.S. dollars, which was more than double the subscription revenues from 2019. This indicates a growing interest in premium language learning services, while the diversified revenue stream suggests that the company is not solely reliant on one source of income. Our product “TalkwithMe” will also be able to capture part of the market.

Prospective Market Growth: In a survey conducted in the United States during the third quarter of 2022, 77 percent of respondents were aware of mobile language learning apps. This suggests a high level of awareness and interest in such apps. Additionally, 22 percent of respondents reported using mobile language learning apps. With users demonstrating a growing interest and demand, the global e-learning market is expected to reach nearly 400 billion U.S. dollars by 2026. This represents significant growth compared to its size of almost 200 billion U.S. dollars in 2019. The learning management system (LMS) market generated around 18 billion U.S. dollars in 2019. Our Chrome Extension business model serves as a more preferred and convenient innovation.

Competitive landscape with unique features: Though Duolingo and Babbel have a stronger presence in the online language learning market, we are targeting a unique niche of new language learners who have customized and immersive learning needs that will improve communication and speech delivery skills. In summary, the online language learning market is experiencing robust growth, with Duolingo being a prominent player, while still a large proportion of unexplored users. The increasing demand for premium language learning services, diversification of revenue sources, and the expanding global e-learning market all indicate a positive outlook for this industry. Additionally, the popularity of mobile language learning apps suggests that learners are increasingly turning to digital platforms to acquire language skills.

Lessons We Learned

Our journey with “TalkwithMe” was an enlightening experience. We learned the importance of comprehensive market research, the need for a pedagogically sound approach, and the intricacies of combining different AI technologies. These insights have enriched our understanding of creating impactful educational tools.

Future Steps

Our future product development strategy encompasses the following key objectives, designed to enhance user experience and engagement:

Customized Learning Enhancement: We aim to facilitate personalized learning experiences by enabling users to select or upload distinct text content, with a focus on refining pronunciation. This feature empowers individuals to address their unique linguistic challenges and improve their language proficiency effectively.
Innovative Translation with "Anti-Vanishing Mode": Our product will introduce a cutting-edge "Anti-vanishing mode" for translation, providing users with real-time, context-aware translation capabilities. This ensures that the selected text is not only translated but also comprehensively explained, bridging language gaps and enhancing linguistic comprehension.
Collaborative Learning Platform: To foster efficient and collaborative user study, our platform will introduce a shared learning space. Leveraging AI capabilities, the platform will recommend learning methods based on collective user learning curves and individual assessment results. This collective intelligence approach will empower users to optimize their learning strategies.
Digital Twins for Personalized Interaction: We are pioneering the development of "digital twins" that will create AI representations tailored to individual users. These digital figures will engage users in face-to-face interactions, offering conversational and instructional support, further enhancing the immersive and personalized nature of our plan.

If you're as passionate about language mastery as we are, discover more about AI and language learning with Microsoft Learn Modules/Docs that inspired us. We'd love to dive deeper into a conversation about our product. Join us in shaping the future of language education!

Continue reading...

TalkwithMe leverages the Azure API to provide customized, immersive AI language learning experiences

Jiechen_Li