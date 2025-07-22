Juan M. Lavista Ferres, corporate vice president and chief data scientist at Microsoft’s AI for Good Lab, has announced the release of a “large-scale, open-source benchmark for evaluating deepfake and manipulated media detection systems.”

Writing on LinkedIn, he says the initiative is a collaborative effort between the lab, Northwestern University’s Security and AI Lab, and tech-focused human rights nonprofit WITNESS. In Northwestern University’s words, it is “intended to help evaluate and improve algorithms to detect AI-generated audio, video, and image content.”

Lavista Ferres says it “introduces a rigorously curated dataset designed to support robust, real-world evaluation of multimodal detection tools,” intended to provide a shared foundation for empirical comparison of detection methods.

It is only to be licensed for evaluation, and is not intended for training or commercial purposes.

The dataset includes more than 50,000 samples of real, AI-generated and manipulated audio-visual content – deepfakes and synthetic media – annotated with data from real-world use cases. Adversarial attacks allow for the testing of model robustness.

Lavista Ferres says the benchmark is intended to support research in multimodal forensics, adversarial robustness and detection in real-world media ecosystems, and invites the research community to “explore the dataset and help maintain its relevance by contributing new data and evaluation protocols over time.”

Northwestern offers more background on the project, and how it is driven by advances in generative technologies. “In the past few years, a new paradigm has emerged with the diffusion architecture, showing impressive achievements in audio, image and video generation,” it says. “Previous approaches to detection are now obsolete and the detection scene must re-invent itself.”

The summary from Northwestern notes that, historically, the evaluation of deepfake models was based on large datasets opened up during deepfake detection challenges. “These datasets typically had a lot of depth but almost no breadth. They were suitable for the previous era (the GAN era) but are not up to the challenge brought by the new generative AI landscape and the evolving type of harm it brings: scams, non-consensual intimate image generation, disinformation, etc.”

“We argue that depth is less important than breadth and we propose the creation of an evaluation set that contains small samples of as many generators and ‘in the wild’ cases as possible – rather than millions of samples from a few generators.”

