Voice deepfakes are inkblots for consumers – they’re threats or entertainment
A recent marketing survey funded by a voice biometrics firm has found that a majority of respondents are concerned about the threat of deepfakes and voice clones.
The company, Pindrop, held a webinar to discuss the survey. Consumers surveyed were more likely to report feeling negative about voice cloning and deepfakes, but totals are not hugely lopsided. Synthedia performed the research.
Voice deepfakes can be detected with software, but the technology is still maturing, as Identt AI Technical Leader for Biometrics Piotr Kawa explained in another webinar to members of the EAB (European Association for Biometrics).
Consumers’ opinions are well-grounded, according to Pindrop webinar panelist Bret Kinsella, CEO and research director of trade publisher Voicebot.ai.
“The level of awareness (among consumers) is higher than I thought it’d be,” Kinsella said. Awareness is not the only insight from the survey that caught his eye.
Exactly 22.3 percent of consumer surveyed about deepfakes said they felt extremely positive about use of the software. And precisely 22.3 percent of people asked the same question said they felt extremely negative.
When surveyors asked consumers about voice clones as a concept, about 18.8 percent saw maximum upside. More people were extremely negative – 21.6 percent – about voice clones.
Among the positives some respondents saw in both voice clones and deepfakes was improved entertainment. Not surprisingly, people who were more concerned saw the negative possibility of impersonations and other problems.
Social media is where most people encounter doctored video and audio. In descending order: YouTube, Tik Tok, Instagram and Facebook. After that it’s movies and news publications.
Kinsella said that is a problem because it is harder to detect deepfakes and voice clones when someone is distracted.
This matches with research Kawa referred to, published back in 2021, that found only 80 percent of participants in a study were able to correctly identify the authenticity of content they were shown. Detection algorithms set on the same task were right 95 percent of the time. Subsequent studies are no more reassuring.
Building resources to meet generalizing challenges
Kawa began the latest EAB lunch talk with an overview of speech synthesis, and the impact generative AI has had on the field. A variety of commercial SaaS and open-source tools for speech synthesis are now widely available, making it “pretty easy” to synthesize speech, according to Kawa.
He differentiated between text-to-speech (TTS) and voice conversion, in which one person is made to sound like another. Either can be used to carry out audio deepfake attacks.
Deepfake detection methods today mostly relying on deep learning algorithms developed by biometrics researchers and are mostly based on finding artefacts left by synthetic speech algorithms. Kawa lists over a dozen, divided between models based on raw audio, front-end based models, algorithmic front-ends and embedding-based front ends for self-supervised learning.
The number of datasets to train them on has also increased rapidly, particularly over the last two years, according to Kawa.
Audio deepfake detection faces a major challenge in generalization. Kawa demonstrated how models tend to do a good job of detecting deepfakes that are created using the same techniques as the dataset the detection model was trained on. For those made with different techniques, however, performance is poor.
Larger training databases that include deepfakes made with various techniques, along with data augmentation techniques, can improve the detection results, but introducing variables like more background noise can make fakes more difficult to detect.
Kawa concluded with a review of open problems in deepfake detection, including generalization and creating models that can run quickly on consumer-grade electronics.