Google develops on-device real-time speech recognition with new neural network technique
Google is rolling out an end-to-end on-device speech recognition technology entirely driven by neural networks for speech input in its Gboard virtual keyboard app.
In a blog post, Google describes a recent paper which presents a new model trained with a recurrent neural network transducer (RNN-T) compact enough to run on a smartphone. According to “Streaming End-to-End Speech Recognition for Mobile Devices,” end-to-end models directly predict character output based on speech input, and are good candidates for running speech recognition on edge devices. The Google research team found in its experiments that the RNN-T approach outperformed a conventional model based on connectionist temporal classification (CTC) in both latency and accuracy.
Traditional speech recognition systems identify phonemes (sound units) from segments of audio, a model to connect phonemes into words, and a language model to analyze the likelihood of a given phrase, according to the blog. Researchers began attempting to go directly from input waveform to output sentence by training a single neural network around 2014, which led to the development of “attention-based” and “listen-attend-spell” models. While these systems have promising accuracy, they typically require the whole input sequence to be analyzed in full, so cannot support real-time transcription. CTC techniques were also developed, decreasing the latency of speech recognition systems.
“This proved to be an important step in creating the RNN-T architecture adopted in this latest release, which can be seen as a generalization of CTC,” Johan Schalkwyk, a Google Fellow with the company’s Speech Team writes.
The RNN-T models outputs characters one by one, using a feedback loop that feeds predicted symbols (usually letters) back into the model to predict the next one. Early versions reduced word error rates but training was computationally intensive. The researchers developed a parallel implementation to run efficiently in large batches Google’s TPU v2 high-performance cloud hardware, which sped up training.
The search graphs used by traditional speech recognition engines are still too large to run on mobile devices, however, and Google production models were almost 2GB despite sophisticated decoding techniques. The researchers developed a decoding method with a beam search through a single neural network to achieve the same accuracy with a 450MB model, and then further reduced the size with parameter quantization and hybrid kernel techniques, and eventually reduced the final model to 80MB.
The new Gboard speech recognizer will initially be launched to Pixel phones in American English, but the researchers are optimistic that more languages and domains of application can be added with specialized hardware and algorithm improvements.