Pindrop presents three research papers on voice biometrics, speech recognition at ICASSP
Three research papers from Pindrop have been presented at the 2022 International Conference on Acoustics, Speech, & Signal Processing (ICASSP), and indicate the direction of the company’s attempts to further innovate with voice biometrics and speech recognition technologies.
The first paper is titled, ‘Distribution Learning for Age Estimation from Speech.’ It explores a different approach to age estimation based on voice biometrics by using distribution learning problem model rather than the traditional model of a classification or regression problem. The first obstacle that Pindrop’s researchers found with distributed learning is that audio research lacks datasets tagged with “apparent” age.
However, it also found that distribution learning validated for facial age estimation is still viable for audio, meaning a general age range can be estimated at a particular confidence interval. It concludes that while distributed learning is more constrained than facial age estimation, it can even outperform regression and classification algorithms for both matched and mismatched conditions.
The second paper is titled, ‘Speaker Embedding Conversion for Backward and Cross-Channel Compatibility.’ It examines solutions for compatibility issues between voice biometric authentication technology providers that have been migrating their models to newer deep learning techniques. Pindrop’s researcher suggest a deep neural network-based method to allow for backwards compatibility. The experimental results found that the DNN is able to deliver feature-embedding compatibility between two automatic speaker verification systems (ASV) with improved performance over a baseline convertor system, though the converted feature embedding performed worse than the traditional ASV systems at the low FAR range. The researchers say that an extension of their work could explore score calibration to improve this performance at a low FAR range.
The third paper is ‘Unsupervised Model Adaptation for End-to-End ASR,’ and looks into a way to improve automatic speech recognition (ASR) transcription systems that often struggle with mismatched train-test conditions like call centers that have to account for factors like accents and voice audio quality. The Pindrop researchers propose using in-domain data to eliminate the need for human annotations using the relationship between word-error-rate (WER) and the CTC (‘Connectionist Temporal Classification,’ a measure of alignment) loss on one hand, and the WER and the probability ratio-based confidence (PRC) on the other hand.
To solve for this, the research team has proposed a cost-effective way to improve accuracy of ASR systems using in-domain data without the need for costly human annotations. This was made possible by exploring the relationship between the word-error-rate (WER) and connectionist temporal classification loss, and the WER and the probability ratio based confidence (PRC). It found that WER could be reduced by 8 percent in absolute terms without supervision, allowing it to adapt to suboptimal conditions.
However, Pindrop says that the research is experimental and does not reflect the performance of its products.
The online paper presentation portion of ICASSP closes this week, with the in-person event running in Singapore from May 22 to 27.