If you use spoken commands to initiate Google searches on your iPhone or Android device, you might notice that the results you’re getting are not only more accurate but coming in faster than ever. That’s because Google researchers have developed a new way to combine acoustic models with machine learning for improved handling of voice queries, even when there’s background noise.
The improved capabilities are the latest of several updates Google has made to its voice recognition and transcription tools in recent months. Earlier this month, Google rolled out a new transcription tool for Docs in Google Chrome with support for more than 40 languages. And in July, the company’s software engineers announced that better neural network modeling had reduced transcription errors by 49 percent in Google Voice and the company’s Project Fi phone service.
In a blog post yesterday, members of the Google Speech Team said that they’d achieved further improvements in how machines can “understand” spoken language through new refinements to recurrent neural network modeling. Recurrent neural networks (RNNs) describe machine intelligence that enables dynamic behavior . . . in this case, accurately “hearing” spoken words on the fly in real time.
New Models Are ‘Blazingly Fast’
The improved RNNs are “more accurate, especially in noisy environments, and they are blazingly fast,” according to members of the team. They also require “much lower computational resources,” they noted.
Haşim Sak, one of the team members, posted a video on YouTube showing how Google’s improved model helps a computer recognize a simple sentence like, “How cold is it outside?”
Most humans processing those words wouldn’t think twice about how their brains put together each individual sound, or phoneme, in the sentence. But it takes advanced neural networks’ acoustic models using even more advanced concepts like “connectionist temporal classification” and “sequence discriminative training techniques” to enable a machine to make sense out of those sounds.
The Google team explained it this way: modeling can help a computer capture the individual sounds in a word like “museum” — spelled out in phonetic notation, that would be “/m j u z i @ m/” — and remember each sound with long short-term memory so that, once the entire word is spoken, it can then assemble the sequence of sounds into a word that best matches what was spoken. The tricky part: how to make this happen in real time, the engineers noted.
Fresh Wave of Advances Expected Soon
The team ran into another challenged when they began testing their model on real-world voice traffic. That’s when the engineers discovered the model was delaying its predictions of what sound came next after each phoneme in a word by about 300 milliseconds.
“[I]t had just learned it could make better predictions by listening further ahead in the speech signal,” they said. “This was smart, but it would mean extra latency for our users, which was not acceptable.”
Some additional training helped the computer to output its sound predictions closer to real time, which solved that problem.
Google, of course, isn’t the only technology company working on advances in speech recognition and artificial intelligence technologies. Microsoft and Facebook, for instance, have teams of engineers and researchers focused on developing those capabilities as well.
Microsoft currently holds a dominant position in the number of patents held for speech recognition technology, according to a recent report from iRunway, a technology consulting, finance and litigation firm.
“Another force of change will soon arrive as at least 172 seminal patents belonging to the leading 10 seminal patent owners expire in 2016, bringing them into public domain,” the iRunway report noted. “This will likely prompt a new wave of development in the speech recognition domain with a dramatic impact on the application of this technology.”