New Microsoft benchmark for evaluating deepfake detection prioritizes breadth

Open source dataset project sees collaboration with Northwestern, WITNESS

Jul 22, 2025, 2:54 pm EDT | Joel R. McConvey

New Microsoft benchmark for evaluating deepfake detection prioritizes breadth

Juan M. Lavista Ferres, corporate vice president and chief data scientist at Microsoft’s AI for Good Lab, has announced the release of a “large-scale, open-source benchmark for evaluating deepfake and manipulated media detection systems.”

Writing on LinkedIn, he says the initiative is a collaborative effort between the lab, Northwestern University’s Security and AI Lab, and tech-focused human rights nonprofit WITNESS. In Northwestern University’s words, it is “intended to help evaluate and improve algorithms to detect AI-generated audio, video, and image content.”

Lavista Ferres says it “introduces a rigorously curated dataset designed to support robust, real-world evaluation of multimodal detection tools,” intended to provide a shared foundation for empirical comparison of detection methods.

It is only to be licensed for evaluation, and is not intended for training or commercial purposes.

The dataset includes more than 50,000 samples of real, AI-generated and manipulated audio-visual content – deepfakes and synthetic media – annotated with data from real-world use cases. Adversarial attacks allow for the testing of model robustness.

Lavista Ferres says the benchmark is intended to support research in multimodal forensics, adversarial robustness and detection in real-world media ecosystems, and invites the research community to “explore the dataset and help maintain its relevance by contributing new data and evaluation protocols over time.”

Northwestern offers more background on the project, and how it is driven by advances in generative technologies. “In the past few years, a new paradigm has emerged with the diffusion architecture, showing impressive achievements in audio, image and video generation,” it says. “Previous approaches to detection are now obsolete and the detection scene must re-invent itself.”

The summary from Northwestern notes that, historically, the evaluation of deepfake models was based on large datasets opened up during deepfake detection challenges. “These datasets typically had a lot of depth but almost no breadth. They were suitable for the previous era (the GAN era) but are not up to the challenge brought by the new generative AI landscape and the evolving type of harm it brings: scams, non-consensual intimate image generation, disinformation, etc.”

“We argue that depth is less important than breadth and we propose the creation of an evaluation set that contains small samples of as many generators and ‘in the wild’ cases as possible – rather than millions of samples from a few generators.”

Article Topics

New Microsoft benchmark for evaluating deepfake detection prioritizes breadth

Article Topics

Comments

Leave a ReplyCancel reply

Biometric Market Analysis and Buyer's Guides

Most Viewed This Week

Featured Company

Biometrics Insight, Opinion

Digital ID In-Depth

Biometrics White Papers

Biometrics Events

New Microsoft benchmark for evaluating deepfake detection prioritizes breadth

Related Posts

Article Topics

Latest Biometrics News

ID4Africa 2026 shifts focus to digital identity ecosystems and sustainability

Building digital ID systems that last: African countries share experiences as ID4Africa 2026 opens

Private sector age verification providers aren’t dying – but they do have to change

ICE contract secures nationwide access to private iris biometric database

From identity to intent: Reimagining biometrics for real-time fraud prevention

Global ID, Idiap partner to scale finger vein biometrics with machine learning

Comments

Leave a ReplyCancel reply

Biometric Market Analysis and Buyer's Guides

Most Viewed This Week

Featured Company

Biometrics Insight, Opinion

Digital ID In-Depth

Biometrics White Papers

Biometrics Events