Speechmatics pushes forward recognition of accented English

With the rise of smart speakers and driving assist modes in recent years, speech recognition has gone from handy to critical — yet not everyone has voice detected equally well. When it comes to speech outside of the most frequent American accents, Speechmatics claims to have the most comprehensive and accurate model available, outperforming Amazon, Google, and others.

The business claimed that a Stanford research titled “Racial Disparities on Speech Recognition” published in 2019 prompted them to look into the issue of accuracy. “There were significant racial discrepancies, with an average word error rate (WER) of 0.35 for black speakers compared to 0.19 for white speakers,” according to Amazon, Apple, Google, IBM, and Microsoft speech engines. That is not good! The lack of variety in the datasets used to train these systems may be part of the reason for the difference. After all, if the data contains only a few black speakers, the model will not be able to learn such speech patterns. The same can be said for speakers with different accents, dialects, and other characteristics – America (let alone the United Kingdom) is full of accents, and any firm claiming to provide services for “everyone” should be aware of this.

In any case, Speechmatics, located in the United Kingdom, made accuracy in transcribing accented English a top focus for its current model, claiming to have blown the competition out of the water. “Speechmatics recorded an overall accuracy of 82.8 percent for African American voices compared to Google (68.7%) and Amazon (68.6%),” the business noted in a news statement, based on the identical data sets used in the Stanford study (but using the newest versions of the speech software).

This accomplishment, according to the business, is due to a novel way to develop a voice recognition model. Traditionally, labeled data is delivered to the machine learning system in the form of an audio file of speech with associated metadata or a text file containing what’s being said, which is usually transcribed and reviewed by humans. You would need photographs and data for a cat detection algorithm, such as which ones have cats, where the cat is in each picture, and so on. This is known as supervised learning and it occurs when a model learns correlations between two types of provided data. Self-supervised learning employed by Speechmatics, an approach that has gained popularity in recent years as datasets, learning efficiency, and processing power have increased. It uses raw, unlabeled data in addition to labeled data, and a lot of it, to develop its own “knowledge” of speech with significantly less direction.