At Mobile World Congress, I participated in ZTE’s Mobile Voice Alliance panel. ZTE presented data researched in China that basically said people want to use speech recognition on their phones, but they don’t use it because it doesn’t work well enough. I have seen similar data on US mobile phone users, and the automotive industry has also shown data supporting the high level of dissatisfaction with speech recognition.
In fact, when I bought my new car last year I wanted the state of the art in speech recognition to make navigation easier… but sadly I’ve come to learn that the system used in my Lexus just doesn’t work well — even the voice dialing doesn’t work well.
As an industry, I feel we must do better than this, so in this blog I’ll provide my two-cents as to why speech recognition isn’t where it should be today, even when technology that works well exists:
Many core algorithms, especially the ones provided to the automotive industry are just not that good. It’s kind of ironic, but the largest independent supplier of speech technologies actually has one of the worst performing speech engines. Sadly, it’s this engine that gets used by many of the automotive companies, as well as some of the mobile companies.
Even many of the good engines don’t work well in noise. In many tests, Googles speech recognition would come in as tops, but when the environment gets noisy even Google fails. I use my Moto X to voice dial while driving (at least I try to). I also listen to music while driving. The “OK Google Now” trigger works great (kudo’s to Sensory!), but everything I say after that gets lost and I see an “it’s too noisy” message from Google. I end up turning down the radio to voice dial or use Sensory’s VoiceDial app, because Sensory always works… even when it’s noisy!
Speech Application designs are really bad. I was using the recognizer last week on a popular phone. The room was quiet, I had a great internet connection and the recognizer was working great but as a user I was totally confused. I said “set alarm for 4am” and it accurately transcribed “set alarm for 4am” but rather than confirm that the alarm was set for 4am, it asked me what I wanted to do with the alarm. I repeated the command, it accurately transcribed again and asked one more time what I wanted to do with the alarm. Even though it was recognizing correctly it was interfacing so poorly with me that I couldn’t tell what was happening, and it didn’t appear to be doing what I asked it to do. Simple and clear application designs can make all the difference in the world.
Wireless connections are unreliable. This is a HUGE issue. If the recognizer only works when there’s a strong Internet connection, then the recognizer is going to fail A GREAT DEAL of the time. My prediction – over the next couple of years, the speech industry will come to realize that embedded speech recognition offers HUGE advantages over the common cloud based approaches used today – and these advantages exist in not just accuracy and response time, but privacy too!
Deep learning nets have enabled some amazing progress in speech recognition over the last five years. The next five years will see embedded recognition with high performance noise cancelling and beamforming coming to the forefront, and Sensory will be leading this charge… and just like how Sensory led the way with the “always on” low-power trigger, I expect to see Google, Apple, Microsoft, Amazon, Facebook and others follow suit.
It feels like I had a whole week’s worth of the trade show wrapped into one day! By the time mid week hits, I’ll surely be ready to head home! Here are some of the highlights from the first day of Mobile World Congress 2015:
First a word about Catalonia. That’s where Barcelona is…in the heart of Catalonia, a province of Spain. Don’t expect delayed meetings, inefficiencies, relaxed long lunches or anything like that. The Catalonians have the precision of Germans (to continue my gross stereotyping!), and my experience with one of the largest trade shows on the planet is that it’s going off without a hitch! I picked up my badge at the airport in a five-minute line that was well staffed and moved rapidly. I could just about walk into the show yesterday morning. The subways and trains though crowded and overheated ran extremely smoothly. Kudos to the show management for pulling off such a difficult feat!
I’d be remiss without mentioning the Galaxy S6. Samsung invited us to the launch and of course they continue to use Sensory in a relationship that has grown quite strong over the years. Samsung continues to innovate with the Edge, and other products that everyone is talking about. It’s amazing how far Apple took the mantle in the first iPhone and how companies like Samsung and the Android system seem to now be leading the charge on innovation!
My favorite product that doesn’t feature Sensory technology that I bumped into was an electronic jump rope. They put sensors in the handles and a visual display shows across the field of the rope, kind of like those clocks that rapidly flash LED’s as the pendulum quickly moves back and forth in order to display the time. I talked with Alex Woo from Tangram and he said they were going to launch a crowdfunding campaign. I gave Alex a demo of our TrulyHandsfree with jump ropers jumping and all the show noise and of course it worked flawlessly. It would be really cool to be able to ask things like “How much time,” “How many jumps,” “What’s my heart rate,” or “How many calories burned” and so on, and the display would make voice control so much more functional!
We had a couple of partnership announcements here at the show, supporting both Qualcomm and Synopsys – both great partners to add to our support mix, and always nice when its customers driving our platform directions. The Qualcomm platform is interesting because it’s not their standard platform for 3rd parties to support. As far as I know they opened it up to Sensory and ONLY Sensory, and already we are seeing much interest!
Last night ZTE had a press party to indoctrinate Sensory and NXP into its Smart Voice Alliance. ZTE is really putting some forward thinking into the user experience and their research shows how much people want a voice interface but how dissatisfying the current state of the art actually is. Sensory’s hoping to change that! We’ll make one of our biggest announcements in history over the next month… and I’ll let you in on the secret (it’s on our website already!) We call it TrulyNatural, and it will be the highest accuracy large vocabulary embedded speech engine that the world has ever seen!
I know it’s been months since Sensory has blogged and I thank you for pinging me to ask what’s going on…Well, lot’s going on at Sensory. There are really 3 areas that we are putting a strategic focus on, and I’ll briefly mention each:
Applications. We have put our first applications into the Google Play store, and it is our goal over the coming year to put increased focus on making applications and in particular making good user experiences through Sensory technologies in these applications. Download AppLock or VoiceDial These are both free products and more intended as a means to help tune our models and get real user feedback to refine the applications so they delight end users! We will offer the applications with the technology to our mobile, tablet, and PC customers so they can build them directly into their customers’ user experience.
Authentication. Sensory has been a leader in embedded voice authentication for years. Over the past year, though, we have placed increase focus in this area, and we have some EXCELLENT voice authentication technologies that we will be rolling out into our SDK’s in the months ahead. Of course, we aren’t just investing in voice! We have a vision program in place and our vision focus is also on authentication. We call this fusion of voice and vision TrulySecure™, and we think it offers the best security with the most convenience. Try out AppLock in the above link and I hope you will agree that it’s great.
TrulyNatural™. For many years now, Sensory has been a leader in on device speech recognition. We have seen our customers going to cloud-based solutions for the more complex and large vocabulary tasks. In the near future this will no longer be necessary! We have built from the ground up an embedded deep neural net implementation with FST, bag of words, robust semantic parsing and all the goodies you might expect from a state of the art large vocabulary speech recognition solution! We recently benchmarked a 500,000 word vocabulary and we are measuring about a 10% word error rate (WER). On smaller 5K vocabulary tasks the WER is down to the 7-8% range. This is as good as or better than today’s published state-of-the-art cloud based solutions!
Of course, there’s a lot more going on than just this…we recently announced partnerships with Intel and Nok Nok Labs, and we have further lowered power consumption in touchless control and always-on voice systems with the addition of our hardware block for low power sound detection.
A couple of news headlines have appeared recently asserting that voice activation is unsafe. I thought it was time for Sensory to weigh in on a few aspects of this since we are the pioneers in voice activation:
In-Car Speech Recognition. There have been a few studies like AAA/U of Utah The headlines from these studies claim speech recognition creates distraction while driving. Other recent studies have shown that voice recognition in the car is one of the biggest complaints. But if you read into these studies carefully, what you really find are several important aspects:
What they call “hands free” is not 100% TrulyHandsfree. It requires touch to activate so right there I agree it can take your eyes of the road, and potentially your hands off the wheels.
It’s really bad UX design that is distracting and not the speech recognition per se.
It’s not that people don’t want speech recognition. It’s that they don’t want speech recognition that fails all the time.
Here’s my conclusion on all this denigration of in-car speech recognition: there are huge problems with what the automotive companies have been deploying. The UX is bad and the speech recognition is bad. That doesn’t mean that speech recognition is not needed in the car…on the contrary, what’s needed is good speech recognition implemented in good design.
From my own experience it isn’t just that the speech recognition is bad and the UX is bad. The flaky Bluetooth connections and the problems of changing phones adds to the perception of speech not working. When I’m driving, I use speech recognition all the time, and it’s GREAT, but I don’t use the recognizer in my Lexus…I use my MotoX with the always on trigger, and then with Google Now, I can make calls or listen to music, etc.
Lack of Security. The CTO of AVG blasted speech recognition because it is unsafe.Now I previously resisted the temptation to comment on this, because the CTO’s boss (the CEO) is on my board of directors. I kind of agree and I kind of disagree with the CTO. I agree that speech recognition CAN BE unsafe…that’s EXACTLY why we add speaker verification into our wake up triggers…then ONLY the right person can get in. It’s really kind of surprising to me that Apple and Google haven’t done this yet! On the other hand, there are plenty of tasks that don’t really require security. The idea of a criminal lurking outside my home and controlling my television screen seemed more humorous than scary. In the case of TVs, I do think password protection is great but it’s really more for the purpose of identifying who is using the television and to call up their favorites, their voice adapted templates, and their restrictions (if any) on what they can watch AND how long they can watch…yeah I’m thinking about my kids and their need to get homework done. :-)
I was very excited to hear Motorola’s announcements today about the new Moto X, MotoG, Moto Hint and Moto 360.
What particularly caught my ear was the statement that they were changing the name from Touchless Control to Moto Voice. They made this decision because so many people thought the technology came from Google in the form of Android, and Moto wanted everyone to know it DIDN’T come from Google.
Actually…It came from Sensory. At least we were an important part of it!!! We have been working on the cool new user defined triggers and are excited that Moto has adopted them for the flagship MotoX (Write-up).
This feature was announced in our TrulyHandsfree 3.0
The new Moto Hint headset is really cool too. It’s a bit like Intel’s Jarvis headset that was announced by Intel CEO Brian Krzanich at CES (and of course uses Sensory!).
Of course the Moto360 is AWESOME, and has some pretty cool voice control features. Yes, Sensory has done an “OK Google” trigger…we even benchmarked our trigger against Google’s…I might share the results in an upcoming blog if there is interest.
I see a bit of irony that a great Saturday Night Live alumnus is launching a campaign to decrease spoofing. I’m talking about Senator Al Franken, who has been looking into the problem of stolen fingerprints, see article.
Senator Franken challenges Samsung and Apple with some fair concerns about the problem of stolen or spoofed biometrics. The issue is that most biometrics that could be stolen can’t be easily replaced. We only have one face, two eyes, and 10 fingers, so not a lot of chances to replace or change them if they are stolen.
The mobile phone companies, challenged on the fingerprint issue, had two responses:
The biometric data is ON DEVICE. This is very important because when it’s stored in the clouds it becomes much more accessible to a hacker AND much more desirable because the payoff is a whole lot of user information. Cloud security is often hacked into, such as the recent break-in of the European Central Bank. In fact many banks I have spoken to insist that passwords can’t be stored in the clouds because they are just too easy to hack that way.
The fingerprint biometric is not stored as a fingerprint image, but as some sort of mathematical representation. I’m not sure I understand this argument because if the digital representation can be copied and replicated, then the system is cracked whether or not it looks like a fingerprint.
I think Franken is right to question the utility of biometric fingerprints, because a product like Sensory’s TrulySecure (combining voice and vision authentication) offers a large number of advantages:
The TrulySecure biometric is not easy to copy or find. Unlike a fingerprint which gets left everywhere, a voice print with a video image of a person saying a particular phrase is NOT easy to find, and even if well recorded, would fall apart with Sensory’s anti-spoofing technology that requires a live image.
The TrulySecure biometric is readily changeable. Unlike the nine chances that a user has to replace a fingerprint, there are a virtually unlimited number of TrulySecure password phrases that can be used. If by some nearly impossible chance a TrulySecure biometric phrase is copied, it can be changed in a matter of seconds and a virtually unlimited number of times.
TrulySecure works across conditions. Every biometric seems to have a failure mode. Fingerprint sensors seem to require a highly directionalized swipe of a very clean finger. If I cut my finger or have a little peanut butter on it, it just doesn’t work. Likewise a voiceprint by itself might fail in high noise, and a faceprint might fail in low lighting, but that magical dual biometric fusion in TrulySecure seems immune to conditions.
Here’s a demo I gave to UberGizmo in a somewhat dark and very noisy hotel lobby. I like this demo because it shows a real world situation and how FAST TrulySecure works.
Here’s a more canned demo on Sensory’s home page that better showcases some of the anti-spoofing features.
TrulySecure™ is now announced!!!! This is the first on device fusion of voice and vision for authentication, and it really works AMAZINGLY well. I’m so proud of our new computer vision team and in Sensory’s expansion from speech recognition to speech and vision technologies. Now we are much more than “The Leader in Speech Technologies for Consumer Electronics”- we are “The Leader in Speech and Vision Technology for Consumer Products!” Hey check out the new TrulySecure video on our home page, and our new TrulySecure Product Brief. We hope and expect that TrulySecure will have the same HUGE impact on the market as Sensory had with TrulyHandsfree, the technology that pioneered always on touch less control!
Google I/O. Android wants to be everywhere: in our cars, in our homes, and in our phones. They are willing to spend billions of dollars to do it. Why? To observe our behaviors, which in turn will help provide us more of what we want…and they will also assist in those purchases. Of course this is what Microsoft and Apple and others want as well, but right now Google has the best cloud based voice experience, and if you ask me it’s the best user experience that will win the game. Seems like they should try and move ahead on the client, but lucky for Sensory we are staying ahead!
Rumors about Samsung acquiring Nuance…Why would they spend $7B for Nuance when they can pick up a more unique solution from Sensory for only $1B? Yeah, that’s a joke, and is definitely not intended as an offer or solicitation to sell Sensory!
OH! Sensory has a new logo! We made it to celebrate our 20 year anniversary!
I still subscribe to the San Jose Mercury News, as they do a good job of tech business reporting. One of my favorite Mercury News writers is a true critic in the literary sense of the term, Troy Wolverton. Troy rarely raves and is typically critical, but in a smart, logical, and unemotional way.
I was eager to read his review of Cortana this morning and in particular his comparison with Siri. He ended up giving it a 7/10, and concluding Siri was still ahead. What I thought was most interesting though was that in his final summary, he compared three products and three assistants based on the ease of calling up each of those assistants:
Cortana – required two touch steps to activate the personal voice assistant
Siri – required one touch step to activate the personal voice assistant
MotoX – The best, because you can just start talking with the keyword phrase “OK Google Now” making a TrulyHandsfree experience!!
Motorola is Sensory’s customer, and I am happy to read that Troy gets it and considers this front end activation an important metric in comparing personal assistants!
It was about 4 years ago that Sensory partnered with Vlingo to create a voice assistant with a special “in car” mode that would allow the user to just say “Hey Vlingo” then ask any question. This was one of the first “TrulyHandsfree” voice experiences on a mobile phone, and it was this feature that was often cited for giving Vlingo the lead in the mobile assistant wars (and helped lead to their acquisition by Nuance).
About 2 years ago Sensory introduced a few new concepts including “trigger to search” and our “deeply embedded” ultra-low power always listening (now down to under 2mW, including audio subsystem!). Motorola took advantage of these excellent approaches from Sensory and created what I most biasedly think is the best voice experience on a mobile phone. Samsung too has taken the Sensory technology and used in a number of very innovative ways going beyond mere triggers and using the same noise robust technology for what I call “sometimes always listening”. For example when the camera is open it is always listening for “shoot” “photo” “cheese” and a few other words.
So I’m curious about what Google, Microsoft, and Apple will do to push the boundaries of voice control further. Clearly all 3 like this “sometimes always on” approach, as they don’t appear to be offering the low power options that Motorola has enabled. At Apple’s WWDC there wasn’t much talk about Siri, but what they did say seemed quite similar to what Sensory and Vlingo did together 4 years ago…enable an in car mode that can be triggered by “Hey Siri” when the phone is plugged in and charging.
I don’t think that will be all…I’m looking forward to seeing what’s really in store for Siri. They have hired a lot of smart people, and I know something good is coming that will make me go back to the iPhone, but for now it’s Moto and Samsung for me!
Nick Bilton, in a New York Times article, cites Forrester Research analysts who point out the importance of software in differentiating and creating value in the wearables market while avoiding commoditization.
While the new hardware is fun and exciting for consumers, the ultimate value will come from creating a connection and engaging the consumers with effective and useful analysis of all the data collected. And in the small wearable form factor, the user interface is always going to be critical. With little or no room for buttons and displays, and not always having a smartphone handy to run an app, voice will increasingly become the user interface of choice for these devices.
Sensory is very well positioned to support voice user interfaces for wearables with ultra-low power implementations that can be woken by a gesture, and quickly respond to commands or shut down to minimize impact on battery life. Watch this space (pun intended) for product announcements of wearables with great voice user interfaces!