FB pixel

New studies warn of difficulty detecting audio deepfakes, but progress is being made

New studies warn of difficulty detecting audio deepfakes, but progress is being made
 

As the evolution of AI-enhanced audio deepfakes advances, the ability to distinguish between what’s real and what’s fake is becoming increasingly difficult, and is allowing the technology – which isn’t hard to find and use – to increasingly be employed for criminal and malign purposes. And that highlights the pressing need for forceful audio deepfake detection (ADD) systems that can weed out the threats.

ADD is the process of detecting spoofing attacks generated by text-to-speech or voice conversion systems. The problem is, ADD technologies that are designed to detect deepfakes are struggling to keep up with the pace of the dangers they are being employed to detect.

While recent research indicates that deepfake audio is becoming increasingly harder to pinpoint, papers published in recent weeks and months on research into the problem offer promising solutions. However, these solutions may be out of the reach of media organizations and the general public.

V.S. Subrahmanian, a Northwestern University computer science professor, tested 14 publicly available detection tools, and told The Poynter Institute that “you cannot rely on audio deepfake detectors today, and I cannot recommend one for use.”

In an interview with Scientific American earlier this year, University of California, Berkeley computer science professor Hany Farid, who studies digital forensics and media analysis, said the skill level needed to identify AI-generated audio is “very high. There’s a huge asymmetry here – in part because there’s a lot of money to be made by creating fake stuff, but there’s not a lot of money to be made in detecting it. Detection is also harder because it’s subtle; it’s complicated; the bar is always moving higher. I can count on one hand the number of labs in the world that can do this in a reliable way. That’s disconcerting.”

Farid said the publicly available deepfake detection tools that are available today simply aren’t “reliable enough. I wouldn’t use them. The stakes are too high not only for individual peoples’ livelihoods and reputations but also for the precedent that each case sets.”

Nevertheless, research to combat the problem continues in earnest because of the increasing threat that’s posed to privacy and security. And if recent research findings are accurate, potential solutions could be on the horizon.

In their paper, Deepfake Forensics: A Survey of Digital Forensic Methods for Multimodal Deepfake Identification on Social Media, researchers from the Department of Computer Science, COMSATS University Islamabad, Lahore, Pakistan and the Department of Cybersecurity, College of Computing, Umm Al-Qura University, Makkah City, Kingdom of Saudi Arabia, said their “systematic survey has illuminated the pressing need for advancing innovation in digital forensic techniques to combat the rapidly evolving threat of deepfakes.”

The research team said that “while methods are progressing, limitations around cross-modality detection, real-time capability, algorithmic bias, and insufficient generalization reveal blindspots demanding attention from researchers. Practical constraints also persist around aspects like computational overhead and the quality/diversity of training datasets.”

However, they said there are “several promising directions” they found that “can guide future efforts to address these gaps. Exploring self-supervised and semi-supervised techniques can potentially reduce dependence on large, labeled datasets,” and that “simpler specialized models can improve detection accuracy while minimizing training requirements. Multi-modal frameworks fusing audio, visual, and textual cues also warrant deeper investigation. Notably, research into ethical considerations around privacy, consent and potential suppression of legitimate speech merits priority to balance security and freedom of expression as detection capability evolves.”

“However,” they pointed out, “the most pivotal direction remains sustained, rapid-cycle innovation as deepfake generation methods continue advancing unabated. Developing agile adaptation mechanisms to respond to novel manipulation techniques could be game-changing. Fostering open-source decentralized communities to crowdsource detection development might confer an edge over adversaries. Insights from intersecting domains like computer vision and multimedia forensics also need synthesis to spur breakthroughs. Underscoring it all is the need to increase awareness among citizens and policymakers so that evidence-based defenses can be enacted before threats overwhelm.”

“For reliable detection, ADD systems must be robust against emerging and unknown deepfake techniques, provide justifiable evidence for their decisions, and integrate seamlessly with other detection tools,” wrote scientists conducting an on-going study funded by the National Institute of Justice (NIJ), a component of the US Department of Justice.

One of the scientists, You (Neil) Zhang, a PhD candidate at the Audio Information Research Lab at the University of Rochester, will present the group’s findings during a presentation next week at the 2024 National Institute of Justice’s Forensic Science Graduate Research Symposium hosted by NIJ’s Forensic Technology Center of Excellence.

The team is working on “a one-class learning approach that compacts the distribution of bona fide speech representations while pushing away deepfake attacks, thereby enhancing detection performance. This framework also encourages the separation of deepfakes in the embedding space and clusters real recordings from diverse settings around multiple centers. This results in ADD systems that surpass existing models in detecting deepfakes generated from novel methods.”

In the paper, Does Audio Deepfake Detection Generalize, which was published last month, researchers from Munich, Germany-based Fraunhofer Institute for Applied and Integrated Security, the Technical University of Munich, and Berlin-based Why Do Birds GmbH, said “current text-to-speech algorithms produce realistic fakes of human voices, making deepfake detection a much-needed area of research.” But, they said, “while researchers have presented various deep learning models for audio spoofs detection, it is often unclear exactly why these architectures are successful: Preprocessing steps, hyperparameter settings, and the degree of fine-tuning are not consistent across related work. Which factors contribute to success, and which are accidental?”

In their conclusion, the researchers said they found “that the ‘in-the-wild’ generalization capabilities of many models may have been overestimated. We demonstrate this by collecting our own audio deepfake dataset and evaluating twelve different model architectures on it. Performance drops sharply, and some models degenerate to random guessing. It may be possible that the community has tailored its detection models too closely to the prevailing benchmark, ASVSpoof, and that deepfakes are much harder to detect outside the lab than previously thought.”

Similarly, in their paper published in June, Harder or Different? Understanding Generalization of Audio Deepfake Detection, researchers from Fraunhofer Institute for Applied and Integrated Security; EURECOM, a French graduate school and research center in digital sciences at the Institut Mines-Télécom; and Pindrop, USA, said “experiments performed using ASVspoof databases indicate that the hardness component is practically negligible, with the performance gap being attributed primarily to the difference component,” and that “this has direct implications for real-world deepfake detection, highlighting that merely increasing model capacity, the currently dominant research trend, may not effectively address the generalization challenge.”

In their Journal of Electrical Systems paper, Efficient Deepfake Audio Detection Using Spectro-Temporal Analysis and Deep Learning, the authors acknowledged that “with the advancement of deepfake technology, particularly in the audio domain, there is an imperative need for robust detection mechanisms to maintain digital security and integrity.” However, they said that “by integrating advanced spectro-temporal analysis with a hybrid deep learning model” they were able to develop “a robust framework [that was] capable of distinguishing between genuine and manipulated audio with high accuracy.”

In their paper, Audio-Deepfake Detection: Adversarial Attacks and Countermeasures, published this week in Expert Systems with Applications, authors Mouna Rabhi, College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar; Spiridon Bakiras, Singapore Institute of Technology; and Roberto Di Pietro, King Abdullah University of Science and Technology, Saudi Arabia, wrote that “audio has always been a powerful resource for biometric authentication: thus, numerous AI-based audio authentication systems (classifiers) have been proposed. While these classifiers are effective in identifying legitimate human-generated input, their security, to the best of our knowledge, has not been explored thoroughly when confronted with advanced attacks that leverage AI-generated deepfake audio.”

The research team concluded that “GAN-based adversarial attacks are quite effective in DNN-trained models and can cause serious threats to DNN detectors. However, such attacks have not yet been addressed in the context of audio-deepfake detection.”

The researchers said they “attempted to fill this gap by demonstrating that a state-of-the-art audio-deepfake detector can be bypassed easily if, as commonly assumed in the literature, the adversary possesses knowledge of the detector’s architecture, and the dataset used for training.”

The researchers further said they were able to “demonstrate that state-of-the-art audio deepfake classifiers are vulnerable to adversarial attacks.”

In a presentation last week at Interspeech 2024 in Kos, Greece, researchers presented their paper, Source Tracing of Audio Deepfake Systems. The team said “while current research on anti-spoofing systems primarily focuses on assessing whether a given audio sample is fake or genuine, there has been limited attention on discerning the specific techniques to create the audio deepfakes. Algorithms commonly used in audio deepfake generation, like text-to-speech and voice conversion, undergo distinct stages including input processing, acoustic modeling, and waveform generation.”

The researchers introduced “a system designed to classify various spoofing attributes, capturing the distinctive features of individual modules throughout the entire generation pipeline,” and evaluated their “system on two datasets: the ASVspoof 2019 Logical Access and the Multi-Language Audio Anti-Spoofing Dataset. Results from both experiments demonstrate the robustness of the system to identify the different spoofing attributes of deepfake generation systems.”

Researchers at the Federal University of Ceará, Campus de Sobral, Brazil, also have put forth promising research. They state in their paper, Speech Audio Deepfake Detection via Convolutional Neural Networks, published in the 2024 IEEE International Conference on Evolving and Adaptive Intelligent Systems, that “supervised experiments with speech samples signals, collected from several voice datasets, were conducted to find the best convolutional neural networks (CNN) topology that performs the detection, in terms of accuracy, regardless of the language spoken.”

The reported that “the best accuracy scores found are: 99 percent for the FoR dataset, 94 percent for the ASV, and 98 percent for the WaveFake. Training the model with all datasets together, and testing with individual datasets, yields accuracies of 98 percent for the FoR base, 92 percent for the ASV, and 96 percent for WaveFake.”

“These results are compatible with those found in state-of-the-art, proving the viability of the model,” the researchers said.

In their paper, AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization, presented at the recent ASVspoof 2024, researchers said that while “the advancement of deep learning algorithms has enabled the generation of synthetic audio through text-to-speech and voice conversion systems, exposing ASV systems to potential vulnerabilities,” using a novel architecture named AASIST3, an enhanced version of the AASIST framework, with Kolmogorov-Arnold networks, additional layers, encoders, and pre-emphasis techniques, they were able to achieve “a more than twofold improvement in performance … significantly enhancing the detection of synthetic voices and improving ASV security.”

Still, the problem remains a challenge. Jennifer Williams, a lecturer at the University of Southampton who specializes in audio AI safety, told The Poynter Institute earlier this year that “detecting audio deepfakes is an active research area, meaning that it is currently treated as an unsolved problem.”

Related Posts

Article Topics

 |   |   |   |   | 

Latest Biometrics News

 

Biometrics cycle from innovations to scale-up opportunities

Biometrics integrations range from the experimental to the everyday in the most-read articles of the week on Biometric Update. Yesterday’s…

 

US Justice developing AI use guidelines for law enforcement, civil rights

The US Department of Justice (DOJ) continues to advance draft guidelines for the use of AI and biometric tools like…

 

Airport authorities expand biometrics deployments with Thales, Idemia tech

Biometric deployments involving Thales, Idemia and Vision-Box, alongside agencies like the TSA,  highlight the aviation industry’s commitment to streamlining operations….

 

Age assurance laws for social media prove slippery

Age verification for social media remains a fluid issue across regions, as stakeholders argue their positions to courts and governments,…

 

ZeroBiometrics passes pioneering BixeLab biometric template protection test

ZeroBiometrics’ face biometrics software meets the specifications for template protection set out in the ISO/IEC 30136, according to a pioneering…

 

Apple patent filing aims for reuse of digital ID without sacrificing privacy

A patent filing from Apple for ensuring a presented reusable digital ID belongs to the person holding it via selfie…

Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Most Read This Week

Featured Company

Biometrics Insight, Opinion

Digital ID In-Depth

Biometrics White Papers

Biometrics Events