Pindrop’s researchers have dropped a new paper on “Audio and Visual Deepfake Countermeasures for Robust Detection and Fine Grained-Localization.”

The paper “presents solutions for the problems of deepfake video classification and localization.” In this case, classification refers to the question, “does this video contain any synthetic content?” And localization means, “which segments of the video are synthetic, if any?”

In effect, Pindrop’s team says that instead of detecting misalignments in audio and video streams, deepfake detection efforts should deploy “an ensemble of specialized networks that independently target audio and visual manipulations,” wherein specific architectures are optimized for each classification and localization task.

In other words, “methods of learning from audio and visual information together can be explored as a way of improving performance over fused single-modality systems” – and, indeed, probably should be.

The team’s methods focus on face reenactment methods, Diff2Lip and TalkLip, and “particularly focuses on lip synchronization and YourTTS and VITS audio generative engines.” Methods combine an array of countermeasures in the form of audio and visual models, plus a fusion model.

The project was submitted to the ACM 1M Deepfakes Detection Challenge, wherein the team achieved “best performance in the temporal localization task and a top four ranking in the classification task for the TestA split of the evaluation dataset.”

Detection challenges have played a significant role in driving innovation in the absence of an international standard for deepfake detection. The 1MDeepfakes Detection Challenge is based on the AV-Deepfake1M dataset, released in 2024, and the AVDeepfake1M++ extended and enhanced version from 2025.

The latest dataset contains over two million samples across thousands of speakers, introducing “audio-level manipulations through word-level deletions, insertions, and replacements, followed by fine grained alignment of lip movements and facial expressions to match the altered speech content.”

Innovative deepfake detection technique looks into the light

The market for deepfake detection is growing, but it isn’t exclusive to biometric algorithms. A team from Cornell University has developed a novel system for watermarking video with fluctuations in on-location lighting, called “noise-coded illumination” (NCI).

According to New Atlas, the technique adds a very mild flicker to lights used during recording, which functions as a code. Though imperceptible to the naked human eye, the data can be read by a computer and used to reveal discrepancies in video segments.

As the volume of deepfake fraud surges and the deepfake threat takes on new forms, more innovation will be needed to ensure that detection methods keep pace. Facial colour, blood flow, flickering lights – anything that can tell us what we’re looking at is real.

deepfake detection | deepfakes | generative AI | Pindrop | synthetic data