FB pixel

Google develops on-device real-time speech recognition with new neural network technique

Google develops on-device real-time speech recognition with new neural network technique
 

Google is rolling out an end-to-end on-device speech recognition technology entirely driven by neural networks for speech input in its Gboard virtual keyboard app.

In a blog post, Google describes a recent paper which presents a new model trained with a recurrent neural network transducer (RNN-T) compact enough to run on a smartphone. According to “Streaming End-to-End Speech Recognition for Mobile Devices,” end-to-end models directly predict character output based on speech input, and are good candidates for running speech recognition on edge devices. The Google research team found in its experiments that the RNN-T approach outperformed a conventional model based on connectionist temporal classification (CTC) in both latency and accuracy.

Traditional speech recognition systems identify phonemes (sound units) from segments of audio, a model to connect phonemes into words, and a language model to analyze the likelihood of a given phrase, according to the blog. Researchers began attempting to go directly from input waveform to output sentence by training a single neural network around 2014, which led to the development of “attention-based” and “listen-attend-spell” models. While these systems have promising accuracy, they typically require the whole input sequence to be analyzed in full, so cannot support real-time transcription. CTC techniques were also developed, decreasing the latency of speech recognition systems.

“This proved to be an important step in creating the RNN-T architecture adopted in this latest release, which can be seen as a generalization of CTC,” Johan Schalkwyk, a Google Fellow with the company’s Speech Team writes.

The RNN-T models outputs characters one by one, using a feedback loop that feeds predicted symbols (usually letters) back into the model to predict the next one. Early versions reduced word error rates but training was computationally intensive. The researchers developed a parallel implementation to run efficiently in large batches Google’s TPU v2 high-performance cloud hardware, which sped up training.

The search graphs used by traditional speech recognition engines are still too large to run on mobile devices, however, and Google production models were almost 2GB despite sophisticated decoding techniques. The researchers developed a decoding method with a beam search through a single neural network to achieve the same accuracy with a 450MB model, and then further reduced the size with parameter quantization and hybrid kernel techniques, and eventually reduced the final model to 80MB.

The new Gboard speech recognizer will initially be launched to Pixel phones in American English, but the researchers are optimistic that more languages and domains of application can be added with specialized hardware and algorithm improvements.

Syntiant launched a new line of speech processors for edge devices at MWC 2019 earlier this year, and the voice and speech recognition markets are projected to be worth $6.9 billion by 2025.

Article Topics

 |   |   |   | 

Latest Biometrics News

 

Governance, not tech, needs interrogating in UK digital ID consultation: Tony Allen

Few people in the world, if any, know as much about age assurance as Tony Allen, the chief executive of…

 

FIDO Alliance to start work on interoperable standards for agentic commerce

The FIDO Alliance has announced initiatives to develop interoperable standards for agentic interactions and commerce, and it has a new…

 

Police policy on facial recognition use earns OK in Lawton, needed in Sante Fe

The Lawton, Oklahoma City Council approved a policy governing police use of facial recognition technology (FRT), moving the city closer…

 

EU recommends white label age verification app, but member states are wary

The European Commission really wants member states to adopt its white label age verification app – and quickly. This week,…

 

Amadeus unveils planned €1.2B Idemia PS acquisition to extend travel biometrics

Amadeus IT SA has officially declared its intention to acquire Idemia Public Security for 1.2  billion euros (approximately US$1.4 billion)…

 

Synthetic voice attacks challenge trust across platforms and systems

A parent has related an unsettling experience they had on Roblox. The father says he heard adults using AI‑generated child…

Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Biometric Market Analysis and Buyer's Guides

Most Viewed This Week

Featured Company

Biometrics Insight, Opinion

Digital ID In-Depth

Biometrics White Papers

Biometrics Events