FB pixel

Google develops audio-visual AI speech separation model

 

Google researchers have developed a method of using computer vision, pattern recognition, and speech processing to separate the speech of a single speaker from other speakers and background noise.

Using AI to mimic the “cocktail party effect,” in which people effectively “mute” other voices and sounds to focus on a particular speaker or source could have a wide range of applications, including video speech enhancement and recognition, video conferencing, and hearing aids, the researchers write in a blog post.

In a research paper titled “Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation,” the team of researchers describe using visual input not only to significantly improve the speech separation quality of audio, but also to associate the separated audio signals with certain speakers visible in the video. The team introduced a new AVSpeech dataset, made up of thousands of hours of video segments from the internet, to train the audio-visual model, and achieved better results than leading audio-only speech separation technologies. It also produced better results than audio-visual methods which are speaker-dependent, requiring a separate model for each speaker, despite the new method being speaker-independent.

“Our method works on ordinary videos with a single audio track, and all that is required from the user is to select the face of the person in the video they want to hear, or to have such a person be selected algorithmically based on context,” Google Research Software Engineers Inbar Mosseri and Oren Lang explain.

The AI system uses face recognition to identify speakers, and then a dilated convolutional neural network to learn a feature, which it compares with the audio input to corelate which speaker is associated with which separated speech. The system was found to be only marginally less effective at separating women speakers, despite the inherent challenges that female voices are reported to present.

Shanghai’s busy subway system is planning to deploy facial and speech recognition technologies developed by Alibaba, which will reportedly enable accurate communication with a smart device five meters away, even in noisy areas.

Article Topics

 |   |   | 

Latest Biometrics News

 

Yoti challenges academic research, invites independent audit of age assurance platform

Yoti has publicly challenged research presented by academics from the Georgia Institute of Technology and the University of California, Irvine,…

 

US probe puts prediction market identity controls under the spotlight

The U.S. House Committee on Oversight and Government Reform has opened an inquiry into Polymarket and Kalshi, pressing the two…

 

Age assurance landscape diverging between US, everywhere else

In the EU and UK, the debate over age assurance for social media has reached the highest levels of government,…

 

2026 World Cup to test online betting age verification at scale

Jumio research suggests the 2026 World Cup could drive a surge in online sports betting while increasing concerns about minors…

 

ID4Africa’s Joseph Atick on why Africa is setting the pace for digital identity

At the ID4Africa 2026 AGM in Abidjan, digital identity leaders focused on a common theme: building sustainable digital identity ecosystems…

 

UK selects Cognitec for facial age estimation in asylum assessments

The UK government has selected a vendor for facial age estimation. The £322,000 ($433,745) contract begins on June 1, 2026…

Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Biometric Market Analysis and Buyer's Guides

Most Viewed This Week

Featured Company

Biometrics Insight, Opinion

Digital ID In-Depth

Biometrics White Papers

Biometrics Events