New speech datasets, software target greater inclusion
The University of Illinois Urbana-Champaign (UIUC) has unveiled the Speech Accessibility Project, an initiative to make voice biometrics and speech analysis systems more inclusive of diverse speech patterns for people with disabilities.
According to a blog post on the UIUC website, the project will be supported by tech giants Amazon, Apple, Google, Meta, and Microsoft, alongside various nonprofits.
The Speech Accessibility Project will focus on developing speech recognition and biometric systems capable of interpreting speech patterns associated with disabilities like Lou Gehrig’s disease (ALS), Parkinson’s disease, cerebral palsy, and Down syndrome.
To this end, the initiative will see the collection of speech samples from paid volunteers representing a diversity of speech patterns.
The samples will then be compiled into a private, de-identified dataset that can be used to train machine learning models to understand various speech patterns better.
The Speech Accessibility Project will initially focus on American English. It will be led by Mark Hasegawa-Johnson, the UIUC professor of electrical and computer engineering, with the support of Heejin Kim, a research professor in linguistics, and Clarion Mendes, a clinical professor in speech and hearing science and a speech-language pathologist.
The initiative will also see the participation of several staff members from UIUC’s Beckman Institute for Advanced Science and Technology and community-based organizations Davis Phinney Foundation and Team Gleason, which will assist in participant recruitment, user testing and feedback.
OpenAI releases multilingual speech recognition system
OpenAI has made its speech recognition software Whisper available as open source models and inference code.
Trained on 680,000 hours of multilingual and multitask supervised data collected from the web, Whisper “approaches human level robustness and accuracy” on English speech recognition, according to OpenAI.
“We show that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language,” the company wrote on a web page dedicated to Whisper.
“Moreover, it enables transcription in multiple languages, as well as translation from those languages into English.”
According to the company, other existing approaches frequently use smaller, more closely paired audio-text training datasets or broad but unsupervised audio pretraining.
“Because Whisper was trained on a large and diverse dataset and was not fine-tuned to any specific one, it does not beat models that specialize in LibriSpeech performance, a famously competitive benchmark in speech recognition,” OpenAI explains.
“However, when we measure Whisper’s zero-shot performance across many diverse data sets, we find it is much more robust and makes 50 percent fewer errors than those models.”
Additionally, the company said roughly a third of Whisper’s audio dataset is non-English. The program is either given the task of transcribing in the original language or translating to English.
“We find this approach is particularly effective at learning speech-to-text translation and outperforms the supervised SOTA on CoVoST2 to English translation zero-shot.”
Because the system was trained on a remarkably diversified dataset, however, Whisper does not always perform at its best when predicting text, sometimes including words that were not spoken (but present in its memory, ‘learned’ via training).
Just like any other AI system, the software also has limitations when it comes to speakers of languages that are not well-represented in the training data.
Despite these limitations, a recent analysis of Whisper by VentureBeat suggests the speech analysis software represents a potential ‘return to openness’ for OpenAI after being harshly criticized by the community for not open-sourcing its GPT-3 and DALL-E models.
In particular, Whisper can be run on various devices, from laptops to desktop workstations, from mobile devices to cloud servers. Each size of Whisper calculates accuracy and speed proportionately, based on the device it is running on.
The open source community already uses the voice tool, with journalist Peter Sterne and GitHub engineer Christina Warren recently unveiling a joint project aimed at creating a transcription app for journalists.