Researchers find major demographic differences in speech recognition accuracy
Research indicates that speech recognition technology from the world’s leading consumer technology brands performs with different degrees of accuracy for different demographics, or as some would say is “biased” against black people.
A team of academics from Stanford University tested automated speech recognition (ASR) systems from Amazon, Apple, Google, IBM, and Microsoft for the paper “Racial disparities in automated speech recognition,” in the Proceedings of the National Academy of Sciences journal, and found that they misidentified roughly 19 percent of words uttered by white people, but word error rate (WER) was 35 percent for speech of black people. Audio snippets from white speakers were considered incomprehensible 2 percent of the time, while for black speakers the systems could not read 20 percent.
To analyze WER for different linguistic groups, the researchers took the Corpus of Regional African American Language (CORAAL) dataset compiled in three U.S. communities and samples from the Voice of California (VOC) dataset. Human experts transcribed interview snippets 5 to 50 seconds long, and their results were compared with those of the machine-learning algorithms from the above-mentioned tech giants.
The researchers propose increasing the diversity of training datasets, and including African American Vernacular English, to reduce performance differences.
Apple had the highest error rates for both datasets, and a WER discrepancy of more than 20 percent. Google and Microsoft had the smallest discrepancies, but both were still over 10 percent, and Amazon’s WER for black speakers was equal to that of Google, but its algorithm was slightly more accurate for white speakers. Microsoft’s system was the only one with a WER for black people below 30 percent.
The findings also include some insight into geographic distribution, as speech collected from black speakers in rural and heavily urban settings (Princeville, North Carolina and D.C.) had higher error rates than speech collected in Rochester, NY.
Two different possible explanations for the differences were explored by the researchers; a gap in the lexicon and grammar of the language models used, such as black people using words not included in the ASR systems, and a performance gap in the systems’ acoustic models.
Words spoken by white and black people were identifiable in the vocabulary of Google’s ASR 98.6 percent and 98.7 percent of the time, however. When phrases with identical text were analyzed, the ASR technology made more errors with samples spoken by black speakers, indicating that differences in pronunciation and prosody, such as rhythm, pitch, syllable accenting, vowel duration, and lenition may be behind the performance differences.
Bias has been a significant issue in facial biometrics, where NIST testing has shown differences in accuracy vary widely between different vendors.
R7 Speech Sciences Co-founder Delip Rao explained in a blog post in 2018 that inherent physiological differences between men and women make it difficult to train AI speech recognition systems to perform as accurately with speech from women.
Voice and speech recognition are expected to make up a $26.8 billion market by 2025.