Google patents method of matching voices to speakers’ faces in video
A patent filed by Google for an automated method of matching faces to voices in videos has been published by the World Intellectual Property Organization.
The patent, which was originally filed in April of last year, describes a computer-implemented method for speech diarization, in which a convolutional neural network is used to recognize faces, and a machine learning model is applied to segments of speech to detect different speakers. Wikipedia describes speaker diarization as a process of partitioning an audio input stream into homogenous segments according to speaker identity.
“The content system detects speech sounds in the audio track of the video, and clusters these speech sounds by individual distinct voice,” inventors Sourish Chaudhuri and Kenneth Hoover write in the application. “The content system further identifies faces in the video, and clusters these faces by individual distinct faces. The content system correlates the identified voices and faces to match each voice to each face. By correlating voices with faces, the content system is able to provide captions that accurately represent on-screen and off-screen speakers.”
Google researchers also published a paper earlier this year detailing an audio-visual method for using AI to separate speech from different individuals, mimicking the “cocktail party effect.”