‘Hi mom, it’s me’: voice cloning services demand stronger voice deepfake detection

“AI can steal your voice, and there’s not much you can do about it”: so says a recent headline from a Philadelphia NBC affiliate for a story that looks at the threat of voice cloning services. It cites a new Consumer Reports investigation based on a survey of the six “leading publicly available AI voice cloning tools,” and found that five have “easily bypassable safeguards.”
Consumer Reports identifies ElevenLabs, Speechify, PlayHT and Lovo as services that “erected no meaningful barriers to cloning someone’s voice without their consent” and “simply require checking a box saying that the person whose voice is being cloned had given authorization.” ElevenLabs is one of the few platforms that charge a fee to create a voice clone; but, at $5 a pop, it’s not much of a barrier to effective biometric voice cloning.
In its responses, ElevenLabs – which was implicated in the deepfake Joe Biden robocall scam of November 2023 – says it is “implementing Coalition for Content Provenance and Authenticity (C2PA) standards by embedding cryptographically-signed metadata into the audio generated on our platform,” and lists customer screening, voice CAPTCHA and its No-Go Voice technology, which blocks the voices of hundreds of public figures, as among safeguards it already deploys.
The report says Descript and Resemble AI, the other two companies surveyed, “took steps to make it more difficult for customers to misuse their products by creating a non-consensual voice clone.” Yet, while “imperfect safeguards are better than none,” more work is needed to implement stronger protections governed by more robust rules and enforcement.
Partnership between Reality Defender, ElevenLabs expands voice training datasets
In the words Grace Gedye, a Consumer Reports policy analyst quoted in a summary of the investigation, “it’s clear that there are techniques companies can use to make it a bit harder to clone someone’s voice without their consent.”
The gravity of the issue becomes clear in financial terms; in a recent blog post from Reality Defender CEO Ben Colman, he cites a study by Deloitte’s Center for Financial Services predicting that generative AI could enable fraud losses to reach $40 billion in the United States by 2027, from US$12.3 billion in 2023 – a compound annual growth rate of 32 percent.
Working together could help. Another blog, this one by Reality Defender CTO Ali Shahriyari, digs into the firm’s strategic partnership with ElevenLabs, which has seen the New York deepfake specialists integrate ElevenLabs’ voice synthesis data from existing and future models into its detection systems.
According to Ali Shahriyari, the firm’s CTO, the impact has been transformative: “our training datasets have been enriched with over 295 hours of high-quality synthetic voice data, providing unprecedented depth and variety in our detection capabilities.”
The collaboration, says Shahriyari, has yielded a tenfold improvement in data generation efficiency, accelerating Reality Defender’s ability to adapt to emerging identity fraud threats. And, “perhaps most significantly, thanks to our partnership with ElevenLabs, the Reality Defender team has expanded our detection capabilities to cover multiple languages and accents, reflecting the global nature of synthetic voice challenges.”
More languages, more training on commercial-grade deepfakes
ElevenLabs’ synthetic voice data now represents approximately 20 percent of Reality Defender’s total training dataset, adding exposure to commercial-grade synthetic voices and critical real-world diversity.
Technically, the implementation focuses on three key areas: comprehensive model training to improve accuracy, multi-language capability enabling detection across eight languages, and inference-based detection that can identify synthetic content regardless of its origin.
“Our enhanced ability to identify commercial-grade voice deepfakes represents a crucial advancement in protecting against sophisticated threats,” Shahriyari says. “Through more efficient data generation processes, we’ve accelerated our development cycles, enabling faster response to emerging synthetic voice technologies.”
The post calls the partnership “a model for how deepfake detection companies can work together to ensure the responsible development of powerful new technologies.”
Voice Channel AI Disruption enables long conversations with emotional bots
In responding to the Consumer Reports investigation, Surya Koppisetti, a senior applied scientist at Reality Defender, says that “a lot has changed” in the synthetic audio threat landscape: “not only is the generated audio very stable for a long conversation, it can be very expressive in its emotions. Human perception of what is a fake voice and what isn’t is no longer good enough.”
Yet another post by Colman addresses a new threat: Voice Channel AI Disruption, or VCAD. “Unlike traditional Telephony Denial of Service (TDoS) attacks that utilize AI-generated voices to overwhelm systems with high call volumes, VCAD employs sophisticated conversational AI bots to engage call center agents in prolonged, realistic dialogues,” Colman writes.
“These interactions drain corporate resources, evade standard detection mechanisms, and inflict significant financial and reputational damage.” He cites a study by Truecaller which found that voice-based fraud results in $25 billion in annual losses in the U.S.
Regulations on deepfakes still lagging, fragments: Colman
“It is now a matter of fact that deepfake technology has reached a level of sophistication that makes it an immediate and ongoing threat to enterprises, financial institutions, and national security,” Colman writes. In a further blog, the CEO (and avid blogger) includes a list of regulatory measures meant to address the deepfake plague, but says the current regulatory ecosystem, which includes patchy state laws, is a fragmented approach that “ensures that cybercriminals will continue to exploit inconsistencies, increasing fraud losses and undermining trust in digital communications.”
“The most effective legislative framework will address deepfakes not as isolated content issues but as sophisticated vectors for fraud, impersonation, and information warfare that threaten both individuals and organizations,” Colman writes. “Until such comprehensive regulation emerges, enterprises must rely on technical safeguards that protect their communications, authentication systems, and digital transactions from increasingly convincing AI-enabled threats.”
Whether the regulation can keep pace with the threat is an open question. While cheap, simple voice cloning tools become increasingly accessible, large players like Microsoft and OpenAI have thus far held back their offerings from wider public release for fear of misuse. But the tech exists to obliterate the boundary between real human voices and synthetic audio deepfakes, and erase the line between vox populi and vox fallaciae.
Article Topics
biometrics | deepfake detection | deepfakes | ElevenLabs | fraud prevention | generative AI | Reality Defender | regulation | synthetic voice | voice biometrics
Comments