A new idea to fight voice deepfakes from Ruhr University Bochum researchers
Researchers from the Ruhr-University Bochum in Germany have released a new report with suggestions on how to tackle voice deepfakes through the use of a novel dataset.
The research focuses mainly on the “image domain” as the researchers claimed that studies exploring generated audio signals have so far been neglected by global research. To this end, Joel Frank and Lea Schönherr researched three different aspects of the audio deepfake challenge to “narrow this gap.”
The first consists of an introduction to common signal processing techniques used for analyzing audio signals, including how to read spectrograms for audio signals, and Text-To-Speech (TTS) models.
“While there has been some research into end-to-end models, typical TTS models consist of a two-stage approach,” write the researchers.
“First, we enter the text sequence which we want to generate. This sequence gets mapped by some model (or feature extraction method) to a low-dimensional intermediate representation, often linguistic features or Mel spectrograms. Second, we use an additional model (often called vocoder) to map this intermediate representation to raw audio.”
Specifically, the researchers focus on vocoder literature, since it directly connects to their work on audio deepfakes.
Secondly, the researchers present a novel data set, built on nine sample sets from five different network architectures and spanning two languages.
The new dataset, hosted on zenodo, consists of approximately 196 hours of generated audio files and is mostly based on the LJSPEECH and JSUT datasets. It also includes a range of architectures, including MelGAN, Parallel WaveGAN (PWG), and WaveGlow, among others.
Finally, Frank and Schönherr supplied practitioners with two baseline models adopted from the signal processing community and designed to facilitate further research in the area.
“To provide a baseline for future practitioners, we trained several baseline models. We evaluated their performance across the different data sets and multiple settings. Specifically, we trained Gaussian Mixture Model (GMM) and neural network-based solutions.”
While they found the neural networks performed better overall, the GMM classifiers proved to be more robust, which might give them an advantage in real-life settings.
“Finally, we inspected the different classifiers using an attribution method. We found that lower frequencies cannot be neglected while high-frequency information proved indispensable.”
However, the research warns, the difficulties of obtaining realistic data sets have been a longstanding problem in the security community, and may potentially make the research results not universally applicable.
“Often benign data is readily available, but data used in malicious contexts is hard to come by. That leaves us with estimating real-world performance on proxy data.”
Frank and Schönherr argue that in their case, they might have good odds that results would transfer to the same kinds of data used in attacks.
“Currently, images generated by off-the-shelf neural networks are used in malicious attempts. We expect the number of audio Deepfakes to increase as well.”
For more information about the Ruhr-University Bochum paper, you can follow this link to read it in its entirety.