IBM researchers achieve new records in speech recognition

March 10, 2017 - 

IBM researchers have set a milestone in conversational speech recognition by achieving a new industry record of a 5.5 percent word error rate, surpassing its previous record of 6.9 percent, according to the company’s blog post.

The researchers conducted a difficult speech recognition task to achieve this record, where they recorded conversations between humans discussing typical everyday topics like “buying a car.”

This recorded corpus, titled “SWITCHBOARD”, has been used for over two decades to benchmark speech recognition systems.

To achieve the 5.5 percent record, the researchers focused on extending the company’s application of deep learning technologies by combining LSTM (Long Short Term Memory) and WaveNet language models with three strong acoustic models.

The first two models were six-layer bidirectional LSTMs, with one of the models being equipped with multiple feature inputs and the other being trained with speaker-adversarial multi-task learning.

The last model learns from both positive and negative examples, which effectively increases its intelligence over time as well as boosts its performance when similar speech patterns are repeated.

The ultimate industry goal has long been to achieve human parity, meaning an error rate equivalent to that of two humans speaking.

IBM said that while other researchers in the field are also working towards this goal, with some experts recently claiming to have reached 5.9 percent as equivalent to human parity.

As part of its process in achieving a 5.5 percent error rate, the IBM researchers have determined that human parity is actually at 5.1 percent.

IBM researchers determined this number by reproducing human-level results with the help of its speech and search technology partner, Appen.

With the discovery that human parity is at 5.1 percent, the researchers said they have a considerable amount of work to do before they can claim their technology is on par with humans.

The team consulted several industry experts to get their feedback on the study including Yoshua Bengio, leader of the University of Montreal’s MILA (Montreal Institute for Learning Algorithms) Lab, who agreed that the researchers still have a considerable amount of work to do in order to achieve human parity.

“In spite of impressive advances in recent years, reaching human-level performance in AI tasks such as speech recognition or object recognition remains a scientific challengen,” said Bengio. “Indeed, standard benchmarks do not always reveal the variations and complexities of real data.

“For example, different data sets can be more or less sensitive to different aspects of the task, and the results depend crucially on how human performance is evaluated, for example using skilled professional transcribers in the case of speech recognition. IBM continues to make significant strides in advancing speech recognition by applying neural networks and deep learning into acoustic and language models.”

The researchers have also realized that the industry standard measurement for human parity \ is far more complex than it seems.

Aside from SWITCHBOARD, there is another industry corpus called “CallHome” that offers a different set of linguistic data that can be tested, This data is generated from more informal conversations between family members on improvisational topics.

It is significantly more difficult for machines to transcribe conversations from CallHome data prove than those from SWITCHBOARD data, making breakthroughs harder to achieve.

With the CallHome data, IBM researchers achieved an industry record of 10.3 percent, and with the help of Appen, measured human performance in the same situation to be 6.8 percent.

Another challenge is that with SWITCHBOARD, a portion of the same human voices in test speakers’ data is also included in the training data used to train the acoustic and language models.

Since CallHome does not have any overlapping data, the speech recognition models have not been used to test speakers’ data. As a result, there is no repetition which only creates a larger discrepancy between human and machine performance.

The researchers said that ongoing advancements in IBM’s Deep Learning technologies that can detect these kinds of repetitions are essential to resolving these issues.

The research team details the automatic speech recognition milestone in a new white paper.

Leave a Comment

comments

About Justin Lee

Justin Lee has been a contributor with Biometric Update since 2014. Previously, he was a staff writer for web hosting magazine and website, theWHIR. For more than a decade, Justin has written for various publications on issues relating to technology, arts and culture, and entertainment. Follow him on Twitter @BiometricJustin.