Voice deepfakes from single facial image reveal fine-tuning detection trade-off

A technique for generating a spoof of a person’s voice from only a single facial image, demonstrated at the USENIX Security 2024 conference, is among the more alarming deepfake creation methods uncovered so far. Worse, voice deepfake detection tools on the market tend to struggle with these audio deepfakes, according to a team of Australian researchers.
Fortunately, as the team from Australian digital research network Data61 at CSIRO shows in a recently-published paper, it is possible to tune those tools to more accurately detect deepfakes created with Face-to-voice synthesis, also known as “FOICE.”
In the paper “Can Current Detectors Catch Face-to-Voice Deepfakes?”, the researchers tested FOICE outputs with biometric voice authentication software including WeChat Voiceprint and Microsoft Azure. The spoof attempts were frequently successful, and approached a 100 percent success rate when making multiple attempts.
The researchers point out that this is troubling because of the wider availability of facial images than voice samples.
Four deepfake detectors the researchers characterize as state-of-the-art models “that span distinct architectural families and design goals” performed poorly when tested with deepfakes produced from four datasets. The best-performing, AASIST, had an equal error rate (EER) of 0.163. All models improved when fine-tuned, with AASIST’s EER dropping to 0.003.
Three of these four fine-tuned voice deepfake detectors were less accurate at identifying other kinds of spoofs, however. The drop in AASIST’s accuracy was modest, and the Ren et al. model’s improved, but TCM dropped by 10 percent and Sun et al. was rendered almost completely ineffective.
“Only domain-invariant approaches maintained relatively stable cross-vocoder behavior; noise robustness varied widely, and denoising can unintentionally remove forensic cues,” the researchers conclude. “Lasting defenses therefore require (i) larger, more diverse corpora (including FOICE variants and modern vocoders) and (ii) architectures and training regimes that target vocoder-independent, cross-modal representations.”
Voice deepfakes checks are forecast to surpass 4.8 billion and generate over $2.4 billion in revenue by 2027 in the 2025 Deepfake Detection Market Report and Buyers Guide from Biometric Update and Goode Intelligence.
Article Topics
biometrics research | deepfake detection | deepfakes | synthetic voice | voice authentication | voice biometrics






Comments