MIT researchers demonstrate cross-modal biometrics with facial approximation from voice
Researchers at MIT have developed a neural network-based model for associating vocal characteristics with facial ones to create approximate images of speakers from a short audio clip, Fast Company reports.
“Speech2Face: Learning the Face Behind a Voice” goes beyond predicting age and gender from speech by including “non-negligible correlations between craniofacial features (e.g., nose structure) and voice,” the researchers write.
“We have demonstrated that our method can predict plausible faces with the facial attributes consistent with those of real images,” they conclude. “By reconstructing faces directly from this cross-modal feature space, we validate visually the existence of cross-modal biometric information postulated in previous studies.
The researchers briefly acknowledge ethical concerns that may arise from the technology, but say it could potentially be used for applications such as attaching a representative face to phone or video calls. They say the system only generalizes from common physical features, such as age and gender, rather than attempting to produce an image specific to the speaker.
Cloudflare researcher Nick Sullivan tweeted concern about the inclusion of his face and voice in the research, but Fast Company notes that voice biometrics have mostly slipped under the radar of lawmakers and regulators considering facial recognition.