Google develops audio-visual AI speech separation model

Apr 16, 2018, 10:54 am EDT | Chris Burt

Categories Biometrics News | Voice Biometrics

Google researchers have developed a method of using computer vision, pattern recognition, and speech processing to separate the speech of a single speaker from other speakers and background noise.

Using AI to mimic the “cocktail party effect,” in which people effectively “mute” other voices and sounds to focus on a particular speaker or source could have a wide range of applications, including video speech enhancement and recognition, video conferencing, and hearing aids, the researchers write in a blog post.

In a research paper titled “Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation,” the team of researchers describe using visual input not only to significantly improve the speech separation quality of audio, but also to associate the separated audio signals with certain speakers visible in the video. The team introduced a new AVSpeech dataset, made up of thousands of hours of video segments from the internet, to train the audio-visual model, and achieved better results than leading audio-only speech separation technologies. It also produced better results than audio-visual methods which are speaker-dependent, requiring a separate model for each speaker, despite the new method being speaker-independent.

“Our method works on ordinary videos with a single audio track, and all that is required from the user is to select the face of the person in the video they want to hear, or to have such a person be selected algorithmically based on context,” Google Research Software Engineers Inbar Mosseri and Oren Lang explain.

The AI system uses face recognition to identify speakers, and then a dilated convolutional neural network to learn a feature, which it compares with the audio input to corelate which speaker is associated with which separated speech. The system was found to be only marginally less effective at separating women speakers, despite the inherent challenges that female voices are reported to present.

Shanghai’s busy subway system is planning to deploy facial and speech recognition technologies developed by Alibaba, which will reportedly enable accurate communication with a smart device five meters away, even in noisy areas.

Article Topics

AI | facial recognition | Google | speech recognition

Google develops audio-visual AI speech separation model

Article Topics

Comments

Leave a ReplyCancel reply

Biometric Market Analysis and Buyer's Guides

Most Viewed This Week

Featured Company

Biometrics Insight, Opinion

Digital ID In-Depth

Biometrics White Papers

Biometrics Events

Google develops audio-visual AI speech separation model

Article Topics

Latest Biometrics News

Yoti challenges academic research, invites independent audit of age assurance platform

US probe puts prediction market identity controls under the spotlight

Age assurance landscape diverging between US, everywhere else

2026 World Cup to test online betting age verification at scale

ID4Africa’s Joseph Atick on why Africa is setting the pace for digital identity

UK selects Cognitec for facial age estimation in asylum assessments

Comments

Leave a ReplyCancel reply

Biometric Market Analysis and Buyer's Guides

Most Viewed This Week

Featured Company

Biometrics Insight, Opinion

Digital ID In-Depth

Biometrics White Papers

Biometrics Events