Good Technology Exists – So Why Does Speech Recognition Still Fall Short?

At Mobile World Congress, I participated in ZTE’s Mobile Voice Alliance panel. ZTE presented data researched in China that basically said people want to use speech recognition on their phones, but they don’t use it because it doesn’t work well enough. I have seen similar data on US mobile phone users, and the automotive industry has also shown data supporting the high level of dissatisfaction with speech recognition.

In fact, when I bought my new car last year I wanted the state of the art in speech recognition to make navigation easier… but sadly I’ve come to learn that the system used in my Lexus just doesn’t work well — even the voice dialing doesn’t work well.

As an industry, I feel we must do better than this, so in this blog I’ll provide my two-cents as to why speech recognition isn’t where it should be today, even when technology that works well exists:

  1. Many core algorithms, especially the ones provided to the automotive industry are just not that good. It’s kind of ironic, but the largest independent supplier of speech technologies actually has one of the worst performing speech engines. Sadly, it’s this engine that gets used by many of the automotive companies, as well as some of the mobile companies.
  2. Even many of the good engines don’t work well in noise. In many tests, Googles speech recognition would come in as tops, but when the environment gets noisy even Google fails. I use my Moto X to voice dial while driving (at least I try to). I also listen to music while driving. The “OK Google Now” trigger works great (kudo’s to Sensory!), but everything I say after that gets lost and I see an “it’s too noisy” message from Google. I end up turning down the radio to voice dial or use Sensory’s VoiceDial app, because Sensory always works… even when it’s noisy!
  3. Speech Application designs are really bad. I was using the recognizer last week on a popular phone. The room was quiet, I had a great internet connection and the recognizer was working great but as a user I was totally confused. I said “set alarm for 4am” and it accurately transcribed “set alarm for 4am” but rather than confirm that the alarm was set for 4am, it asked me what I wanted to do with the alarm. I repeated the command, it accurately transcribed again and asked one more time what I wanted to do with the alarm. Even though it was recognizing correctly it was interfacing so poorly with me that I couldn’t tell what was happening, and it didn’t appear to be doing what I asked it to do. Simple and clear application designs can make all the difference in the world.
  4. Wireless connections are unreliable. This is a HUGE issue. If the recognizer only works when there’s a strong Internet connection, then the recognizer is going to fail A GREAT DEAL of the time. My prediction – over the next couple of years, the speech industry will come to realize that embedded speech recognition offers HUGE advantages over the common cloud based approaches used today – and these advantages exist in not just accuracy and response time, but privacy too!

Deep learning nets have enabled some amazing progress in speech recognition over the last five years. The next five years will see embedded recognition with high performance noise cancelling and beamforming coming to the forefront, and Sensory will be leading this charge… and just like how Sensory led the way with the “always on” low-power trigger, I expect to see Google, Apple, Microsoft, Amazon, Facebook and others follow suit.