At Mobile World Congress, I participated in ZTE’s Mobile Voice Alliance panel. ZTE presented data researched in China that basically said people want to use speech recognition on their phones, but they don’t use it because it doesn’t work well enough. I have seen similar data on US mobile phone users, and the automotive industry has also shown data supporting the high level of dissatisfaction with speech recognition.
In fact, when I bought my new car last year I wanted the state of the art in speech recognition to make navigation easier… but sadly I’ve come to learn that the system used in my Lexus just doesn’t work well — even the voice dialing doesn’t work well.
As an industry, I feel we must do better than this, so in this blog I’ll provide my two-cents as to why speech recognition isn’t where it should be today, even when technology that works well exists:
Many core algorithms, especially the ones provided to the automotive industry are just not that good. It’s kind of ironic, but the largest independent supplier of speech technologies actually has one of the worst performing speech engines. Sadly, it’s this engine that gets used by many of the automotive companies, as well as some of the mobile companies.
Even many of the good engines don’t work well in noise. In many tests, Googles speech recognition would come in as tops, but when the environment gets noisy even Google fails. I use my Moto X to voice dial while driving (at least I try to). I also listen to music while driving. The “OK Google Now” trigger works great (kudo’s to Sensory!), but everything I say after that gets lost and I see an “it’s too noisy” message from Google. I end up turning down the radio to voice dial or use Sensory’s VoiceDial app, because Sensory always works… even when it’s noisy!
Speech Application designs are really bad. I was using the recognizer last week on a popular phone. The room was quiet, I had a great internet connection and the recognizer was working great but as a user I was totally confused. I said “set alarm for 4am” and it accurately transcribed “set alarm for 4am” but rather than confirm that the alarm was set for 4am, it asked me what I wanted to do with the alarm. I repeated the command, it accurately transcribed again and asked one more time what I wanted to do with the alarm. Even though it was recognizing correctly it was interfacing so poorly with me that I couldn’t tell what was happening, and it didn’t appear to be doing what I asked it to do. Simple and clear application designs can make all the difference in the world.
Wireless connections are unreliable. This is a HUGE issue. If the recognizer only works when there’s a strong Internet connection, then the recognizer is going to fail A GREAT DEAL of the time. My prediction – over the next couple of years, the speech industry will come to realize that embedded speech recognition offers HUGE advantages over the common cloud based approaches used today – and these advantages exist in not just accuracy and response time, but privacy too!
Deep learning nets have enabled some amazing progress in speech recognition over the last five years. The next five years will see embedded recognition with high performance noise cancelling and beamforming coming to the forefront, and Sensory will be leading this charge… and just like how Sensory led the way with the “always on” low-power trigger, I expect to see Google, Apple, Microsoft, Amazon, Facebook and others follow suit.
I was very excited to hear Motorola’s announcements today about the new Moto X, MotoG, Moto Hint and Moto 360.
What particularly caught my ear was the statement that they were changing the name from Touchless Control to Moto Voice. They made this decision because so many people thought the technology came from Google in the form of Android, and Moto wanted everyone to know it DIDN’T come from Google.
Actually…It came from Sensory. At least we were an important part of it!!! We have been working on the cool new user defined triggers and are excited that Moto has adopted them for the flagship MotoX (Write-up).
This feature was announced in our TrulyHandsfree 3.0
The new Moto Hint headset is really cool too. It’s a bit like Intel’s Jarvis headset that was announced by Intel CEO Brian Krzanich at CES (and of course uses Sensory!).
Of course the Moto360 is AWESOME, and has some pretty cool voice control features. Yes, Sensory has done an “OK Google” trigger…we even benchmarked our trigger against Google’s…I might share the results in an upcoming blog if there is interest.
Android introduced the new KitKat OS for the Nexus 5, and Sensory has gotten lots of questions about the new “always listening” feature that allows a user to say “OK Google” followed by a Google Now search. Here’s some of the common questions:
Is it Sensory’s? Did it come from LG (like the hardware)? Is it Google’s in-house technology? I believe it was developed within the speech team at Android. LG does use Sensory’s technology in the G2, but this does not appear to be an implementation of Sensory. Google has one of the smartest, most capable, and one of the larger speech recognition groups in the industry, and they certainly have the chops to build a key word spotting technology. Actually, developing a voice activated trigger is not very hard. There are several dozens of companies that can do this today (including Qualcomm!). However, making it useable in an “always on” mode is very difficult where accuracy is really important.
The KitKat trigger is just like the one on MotoX, right? Ugh, definitely not. Moto X really has “always on” capabilities. This requires low power operation. The Android approach consumes too much power to be left “always on”. Also, the Moto X approach combines speaker verification so the “wrong” users can’t just take over the phone with their voice. Motorola is a Sensory licensee, Android isn’t.
How is Sensory’s trigger word technology different than others?
First of all, Sensory’s approach is ultra low power. We have IC partners like Cirrus Logic, DSPG, Realtek, and Wolfson that are measuring current consumption in the 1.5-2mA range. My guess is that the KitKat implementation consumes 10-100 times more power than this. This is for 2 reasons, 1) We have implemented a “deeply embedded” approach on these tiny DSPs and 2) Sensory’s approach requires as little as 5 MIPS, whereas most other recognizers need 10 to 100 times more processing power and must run on the power hungry Android processor!
Second…Sensory’s approach requires minimal memory. These small DSP’s that run at ultra low power allow less RAM and more limited memory access. The traditional approach to speech recognition is to collect tons of data and build huge models that take a lot of memory…very difficult to move this approach onto low power silicon.
Thirdly, to be left always on really pushes accuracy, and Sensory is VERY unique in the accuracy of its triggers. Accuracy is usually measured in looking at the two types of errors – “false accepts” when it fires unintentionally, and “false rejects” when it doesn’t let a person in when they say the right phrase. When there’s a short listening window, then “false accepts” aren’t too much of an issue, and the KitKat implementation has very intentionally allowed a “loose” setting which I suspect would produce too many false accepts if it was left “always on”. For example, I found this YouTube video that shows “OK Google” works great, but so does “OK Barry” and “OK Jarvis”
Finally, Sensory has layered other technologies on top of the trigger, like speaker verification, and speaker identification. Also Sensory has implemented a “user defined trigger” capability that allows the end customer to define their own trigger, so the phone can accurately and at ultra low power respond to the users personalized commands!
Radio Rex. There’s always something special about the first one – this was from almost 100 years ago! Rex was a toy dog that lived in a doghouse, and the waveform from calling his name would vibrate a spring at a certain frequency that would make Rex exit the doghouse. Basically, a mechanical speech recognition device!
Radar the Robot. Sure, this list will be highly biased with products that used Sensory technology. Fisher Price released Radar the Robot back in 1995! Radar would talk to kids, sing songs with them, do math games, word games, and much, much more. I remember one of my kids walking into my room and speaking in a robotic voice to imitate Radar, “I’m sorry, I can’t hear you. Would you like to play word games? Please say yes or no.”
Password Journal. Not only is this the bestselling girls’ electronic product of all time, but it uses voice biometrics as a key feature (to lock a diary). I once heard that half of all 11-year-old girls in the US have a diary and their top concern is that someone unintended will open it and read it. This product was so successful that Girltech, the company Sensory worked with, was acquired by Radica, who was then acquired by Mattel. Most new toy introductions have a 1-2 year life. This product, and its many revisions, has been on the market for over 15 years!
Voice Signal and VOS light switches. Voice Signal Technologies was a company started around 1995 to build voice controlled light switches. They got so excited about speech technology that they successfully transitioned into a leader in embedded speech (they went from Sensory’s customer to competitor!), and were eventually sold to Nuance for just under $300M! Sensory’s customer VOS also made light switches. VOS even introduced a Star-Trek branded light switch and licensed Majel Roddenberry’s voice. Computer Lights On!
Uniden Voice Dial. I’ll never forget the thrill of landing in Las Vegas for CES, and going down the escalator into the baggage claim area and seeing a HUGE sign saying “Uniden Introduces VoiceDial.” The phones worked great. They even ran a TV commercial featuring the famous sumo wrestler Konishiki saying “Pizza-man.”
Moshi Clock. What a great clock! You could set the alarm or time just by speaking to it. The clock would even tell you the weather. And this was pre-SIRI!!
BlueAnt V1. BlueAnt moved two steps ahead of its competitors with the V1. It had a completely voice-driven user interface that replaced the buttons and flashing lights on a Bluetooth headset. This was probably the first consumer electronic device that enabled a full and complex VUI-based experienced. And the reviews were some of best reviews I have ever seen.
Apple SIRI/iPhone 4s. SIRI was an amazing breakthrough for voice recognition – not so much in the capabilities it presented, but in the marketing and brand support behind it. When Apple said the time was right for speech recognition, the world listened and consumer electronic OEMs suddenly changed!
Google Glass. OK, it’s not shipping yet, but they have taken a VERY novel approach to speech by using what they refer to in the press as “hotword” models. We in the industry call this Keyword spotting. I handed my Glass to my wife and she put it on and said “You mean I just say OK Glass? Oh now I see all these other things so I can say Get Directions to Chef Chus restaurant? Woah! It’s showing me directions to Chef Chus!” The device throws out all the wrong words and captures the key words it wants to hear then seamlessly switches to a cloud-based recognizer.
Motorola MotoX. 15M plus views for a TV commercial featuring voice control!!! And the users LOVE it! Touchless Control is one of the best reviewed apps in the GooglePlay store!
Motorola, who just happens to be a Sensory customer, launched a suite of new phones including Moto X and three Droids – Maxx, Ultra, and Mini – all with this awesome feature called “touchless control.” The “touchless control” uses a technology to wake up the phone by voice from a low power state, so the phone is always on and listening. Sorta like TrulyHandsfree! It links into GoogleNow so you can control pretty much anything and access information without touching the phone.
Moto launched an advertising campaign around the Lazy Phone Guy. These are my favorite ads ever, and the best of all these ads is the “no touching” Moto X phone. It’s already hit about 15M views!
Just saw this AdAge article about the Lazy Phone gone viral and beating out iPhone at its new launch. Says the touch ad has hit about 20M!
Even more impressive are the customer reviews for the “touchless control” technology. It’s one of the highest rated apps in the GooglePlay shop.