Pindrop collaboration allows Nvidia to rein in zero-shot cloning feature

Pindrop has announced a collaboration with Nvidia, to “advance defenses against unauthorized synthetic speech in support of building safe, robust, and responsibly deployed AI systems,” according to a company blog post.
Specifically, the voice deepfake detection firm is being tapped to provide adequate defenses against an until-now dormant feature in Nvidia’s Riva Magpie, a quadrilingual text-to-speech (TTS) model.
Zero-shot voice cloning is a tool based on the zero-shot learning concept, which refers to scenarios in which a model is not trained on any labeled examples of data classes it will be asked to make predictions about. As such, zero-shot cloning enables synthetic speech to generate a desired voice using just a few seconds of reference audio.
In Pindrop’s words, “‘zero-day’ cloning exploits occur when a new synthetic speech model is used before detection systems have seen or adapted to its artifacts. These blind spots can make even state-of-the-art protections vulnerable.”
The thrust of it is that it will be easier to clone voices with less reference material, and – until now – nothing could detect it. For this reason, Nvidia has withheld the feature. But with Pindrop among a group of firms granted early access to help develop and reinforce safeguards, it can soon be released into the world.
Pindrop gets to train its tech on cutting-edge models
The upside for Pindrop is clear: early access allows it to “proactively train detectors against emerging models before they’re widely available.” It says its detectors are designed to find subtle artifacts like unnatural prosody or spectral anomalies in each stage of the TTS process, and that the partnership with Nvidia allows it to assess detection accuracy across “a wide range of conditions, including male and female voices, multiple languages, short and long utterances, and varying sampling rates and compression levels.”
Nvidia’s AI and audio codec architectures are similar enough to the ones on which Pindrop trains its tech that Pindrop’s systems can generalize well, even for models it hasn’t yet encountered.
“In our initial evaluation of Riva Magpie, using a few thousand 5-second utterances, our technology was able to detect over 90 percent of synthetic samples with false accept rates below 1 percent (meaning fewer than 1 in 100 synthetic samples are incorrectly classified as genuine).” In a subsequent pass, samples were augmented with varying levels of noise, sampling rates and compressed video formats; the re-trained model brought detection accuracy to 99.2 percent, while keeping false accept rates below one percent.
The collaboration is framed as a way to make sure detection systems keep up with potentially harmful generative AI. As such, while Pindrop gets training, Nvidia gets to push its latest technology into the market.
But is there a practical use for zero-shot cloning? Nvidia’s press uses the by-now tired sales pitch that zero-shot cloning “unlocks creative applications,” even though it “can also create new opportunities for misuse, such as impersonation, fraud and misinformation.”
Pindrop’s tech may be up to the task of being able to expose it – but as AI continues to proliferate, one feels a creeping sense that some firms are simply unleashing potent fraud engines, with little benefit.
Article Topics
deepfake detection | deepfakes | generative AI | Nvidia | Pindrop | synthetic voice | voice biometrics | zero-shot cloning





Comments