The Problem with Speech Recognition

Speech recognition today works great. Sure, accents create challenges, and signal to noise ratios can make things more difficult, but through the use of deep learned approaches to modeling, we are able to achieve accuracy rates that get well above 90%.

When you add a little Natural Language Understanding (NLU) around the voice recognition engine, it is not hard to correct simple speech confusion problems. For example, if I ask my watch “What time is it?” It can make a common substitution error between “time” and “dime” but through statistical modeling and or NLU approaches its pretty easy to decide that the user wants to know the time.

I would argue that today the recognizers have gotten so good that more problems actually reside in the NLU than the speech recognition. The more domains that are covered the more likely to have domain confusion. This is exactly why Sensory’s embedded speech engine can outperform much larger size cloud-based solutions.

But there is a deeper-seated problem than NLU and domain confusion. It’s a problem that exists outside of morphology, syntax, phonology, deep learning, statistics, and all the various things we associate with speech recognition.

The issue is a speech recognizer that works fine, but that can’t communicate with systems it should control. This problem doesn’t exist in human speech. When we learn a language, we can communicate with another person that knows that language. It doesn’t matter what electronics they have in their pocket, or what car they are driving, or what shirt they are wearing.

But speech recognition systems today are getting more complex. They are a trying to control other items, but that control isn’t always within its grasp.

Amazon’s Voice Interoperability Initiative (VII) will help to address some of these issues by creating common protocols and communication systems across voice assistants. But that’s just a first step and doesn’t address the full scope of the problem.

We live in a world of competing systems. We have multiple streaming video platforms coming into our homes, and different family members subscribe to different audio platforms. As a family we are an Amazon prime member with a Netflix subscription, Hulu, Spotify and Comcast. We have smart speakers by Apple, Google and Amazon. I am an Android user but everyone else is on iOS. Music I can listen to on one smart speaker, I can’t access on another.

Here’s a great example. I LOVE my Xfinity Remote. The speech recognizer works great (although I’d prefer it not be push-to-talk). But I was watching Netflix last night and I wanted to get rid of captions. I tried saying “Remove Captions.” It was recognized but didn’t work. I tried “Remove Subtitles” and once again recognized but no effect. So, I had to use my remote to navigate by hand.

OK so I admit it. The title is misleading. It’s not really a speech recognition problem between a human and a machine, but a systematic issue in the machine to system communications. I’m hoping the VII can expand to go beyond Voice Assistant handoffs and address what I think is a more important issue: any single front-end voice device could have proper protocols to know all the systems it can access and is enabled to communicate with those systems.