How Accurate is Pronunciation Assessment?

Pronunciation assessment APIs usually offer pronunciation scores at phoneme level, syllable level, word level, and sentence level. Yet, how do you know if the scores are accurate?

By comparing the predicted pronunciation scores of the testset with the golden standard (human label). The closer the two results, the more accurate the algorithm is. 

To make the metric(s) representative and useful, we need to think carefully about 1) testset, and 2) performance metrics.


The testset is usually designed by an AI product manager. He/She should ensure that the testset can 1) reflect real user scenarios, 2) have good data variety, and 3) cover a wide range of use cases. For example, SpeechSuper's API testsets consist of masked audios from language learners sampled from our real user base. They usually cover a wide spectrum of phonetic combinations in a specific language. The testsets not only contain data recorded in a quiet environment but also with background noise.


I guess the most important thing in the machine learning industry is to decide on which metric(s) to optimize iteratively. In pronunciation assessment, we all agree that the Pearson correlation coefficient is a reliable measure of how the algorithm performance and human label correlate linearly. It ranges from -1 to 1. 

In the image below, p standards for "Pearson's R coefficient". The good scenario is the P value being closer to 1, as shown in the left bottom chart. 

By Kiatdd - Own work, CC BY-SA 3.0,

The state-of-the-art of pronunciation assessment can achieve a Pearson correlation coefficient value of ~0.9 at the word level. Usually, the finer the granularity (like phoneme-level pronunciation scores), the less precise the algorithm can be because the acoustic features are transient and not easy to capture. Usually, the Pearson's R goes to ~0.8 at the phoneme level.

At SpeechSuper, we develop AI-based speech technologies to analyze speech from language learners, including pronunciation, fluency, completeness, and more. If you’re interested, please contact us on the website.


Qiusi is a product manager in China’s EdTech industry focusing on language learning and AI. She enjoys writing stories. You can reach her at

SpeechSuper provides cutting-edge AI speech assessment (a.k.a pronunciation assessment or pronunciation score) APIs for language learning products. Comprehensive feedback covers pronunciation score, fluency, completeness, rhythm, stress, liaison, etc. Languages supported include English, Mandarin Chinese, French, German, Korean, Japanese, Russian, Spanish, and more.

*Prior written consent is needed for any form of republication, modification, repost, or distribution of the contents.


Popular posts from this blog

SpeechSuper English Speech to Text API Supports Inverse Text Normalization

DON’T Use Speech Recognition in Language Learning Apps