FB pixel

Google develops audio-visual AI speech separation model


Google researchers have developed a method of using computer vision, pattern recognition, and speech processing to separate the speech of a single speaker from other speakers and background noise.

Using AI to mimic the “cocktail party effect,” in which people effectively “mute” other voices and sounds to focus on a particular speaker or source could have a wide range of applications, including video speech enhancement and recognition, video conferencing, and hearing aids, the researchers write in a blog post.

In a research paper titled “Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation,” the team of researchers describe using visual input not only to significantly improve the speech separation quality of audio, but also to associate the separated audio signals with certain speakers visible in the video. The team introduced a new AVSpeech dataset, made up of thousands of hours of video segments from the internet, to train the audio-visual model, and achieved better results than leading audio-only speech separation technologies. It also produced better results than audio-visual methods which are speaker-dependent, requiring a separate model for each speaker, despite the new method being speaker-independent.

“Our method works on ordinary videos with a single audio track, and all that is required from the user is to select the face of the person in the video they want to hear, or to have such a person be selected algorithmically based on context,” Google Research Software Engineers Inbar Mosseri and Oren Lang explain.

The AI system uses face recognition to identify speakers, and then a dilated convolutional neural network to learn a feature, which it compares with the audio input to corelate which speaker is associated with which separated speech. The system was found to be only marginally less effective at separating women speakers, despite the inherent challenges that female voices are reported to present.

Shanghai’s busy subway system is planning to deploy facial and speech recognition technologies developed by Alibaba, which will reportedly enable accurate communication with a smart device five meters away, even in noisy areas.

Article Topics

 |   |   | 

Latest Biometrics News


Mastercard commits to passkeys for payment, full tokenization in EU by 2030

As the EU gallops into the digital wallet era, Mastercard is going full-token, as it announces a plan to achieve…


Coinbase announces smart wallets for easier transition to chain

Coinbase is riding the wallet wave, with the launch of its next generation of smart wallets. A blog post from…


Michigan City Council orders comprehensive facial recognition policy for local police

In a move aimed at safeguarding civil liberties, the City Council of Ann Arbor in Michigan has taken a decisive…


Video deepfake fraud threat is real, helplessness is not: ID R&D webinar

Deepfakes have become a cause for common concern, with articles and viral posts warning of their power to deceive. Real-life…


Mobile driver’s licenses continue to pick up speed

Mobile driver’s licenses (mDLs) are among the final pieces of a fully-realized digital wallet ecosystem that would see us permanently…


Sumsub expands data sources to improve KYB

Sumsub has provided upgrades to its Business Verification platform aimed at tackling the common challenges that businesses encounter during the…


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Most Read This Week

Featured Company

Biometrics Insight, Opinion

Digital ID In-Depth

Biometrics White Papers

Biometrics Events