Meta declines to make voice tool public as BixeLab highlights voice fraud concerns
AI voice technology can quite literally bring a voice to the voiceless and help us transcend language barriers. Even with such impactful use cases, heightened security risks follow the rise of AI-generated voice technology, particularly for systems using biometric voice authentication and in social engineering attacks, as highlighted in the second issue of BixeLab’s I.D. Risk Alerts newsletter.
BixeLab notes the account of an Australian journalist who used an AI-generated clone of his own voice to gain unauthorized access to his Centrelink account. In the UK, a cybersecurity researcher used an AI-generated version of his own voice to access a bank account. The testing and consulting firm rates the criticality of the fraud risk as “high.”
Aware of the security risks, Meta recently announced – but did not release – its newest generative AI system, Voicebox. The technology can generate spoken dialogue through speech samples and text and has capabilities like speech denoising and editing, text-to-speech synthesis, and diverse speech sampling. Still, the tech giant is “not making the Voicebox model or code publicly available at this time” due to “the potential risks of misuse.”
Voicebox can create outputs from scratch or based on a sample model. With a word error rate of 1.9 percent, the system currently outperforms VALL-E’s error rate of 5.9 percent. Voicebox also outperforms YourTTS on cross-lingual style transfer, with an average word error rate of 5.2 percent compared to 10.9 percent respectively. Voicebox also outperforms VALL-E and YourTTS on audio style similarity.
The technology also uses the Flow Matching model, which is a non-autoregressive generative model that can learn non-deterministic mapping between text and speech, enabling the technology to learn from varied speech data without using labels. As a result, Voicebox can train on more diverse data on a much larger scale.
Meta trained Voicebox with “more than 50,000 hours of recorded speech and transcripts from public domain audiobooks in English, French, Spanish, German, Polish, and Portuguese.” It can infill speech from context and generate the middle of an audio recording without having to re-create the input entirely.
Voicebox can use a two second audio sample to generate a matching audio style that can then be used to generate text-to-speech, which can give a voice to someone unable to speak. Cross-lingual style transfer allows users to turn text from one language into audio in another language, creating a new avenue to overcome language barriers. It can also resynthesize speech to remove background noise, simplifying the audio editing process.
Voice authentication and security threats continue
Voicebox can reportedly enable nefarious AI-generated voice cloning that can surpass voice authentication.
The technology can also be used to strengthen social engineering attacks. At the 2023 Regional Anti-Scam Conference in Singapore, Sun Xueling, the Minister of State for Home Affairs, expressed concerns that this technology could be used to impersonate public figures and spread disinformation.
In January an Arizona mother was the target of a ransomware scam that used Deepfake voice generation technology to trick the woman into thinking her own daughter had been kidnapped and held for ransom. “I will never be able to shake that voice and the desperate cries for help out of my mind,” she said in testimony to the Senate Judiciary Committee.