Google develops audio-visual AI speech separation model
Google researchers have developed a method of using computer vision, pattern recognition, and speech processing to separate the speech of a single speaker from other speakers and background noise.
Using AI to mimic the “cocktail party effect,” in which people effectively “mute” other voices and sounds to focus on a particular speaker or source could have a wide range of applications, including video speech enhancement and recognition, video conferencing, and hearing aids, the researchers write in a blog post.
In a research paper titled “Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation,” the team of researchers describe using visual input not only to significantly improve the speech separation quality of audio, but also to associate the separated audio signals with certain speakers visible in the video. The team introduced a new AVSpeech dataset, made up of thousands of hours of video segments from the internet, to train the audio-visual model, and achieved better results than leading audio-only speech separation technologies. It also produced better results than audio-visual methods which are speaker-dependent, requiring a separate model for each speaker, despite the new method being speaker-independent.
“Our method works on ordinary videos with a single audio track, and all that is required from the user is to select the face of the person in the video they want to hear, or to have such a person be selected algorithmically based on context,” Google Research Software Engineers Inbar Mosseri and Oren Lang explain.
The AI system uses face recognition to identify speakers, and then a dilated convolutional neural network to learn a feature, which it compares with the audio input to corelate which speaker is associated with which separated speech. The system was found to be only marginally less effective at separating women speakers, despite the inherent challenges that female voices are reported to present.
Shanghai’s busy subway system is planning to deploy facial and speech recognition technologies developed by Alibaba, which will reportedly enable accurate communication with a smart device five meters away, even in noisy areas.