Posts Tagged ‘Wake Word Accuracy’
June 11, 2019
I used to blog a lot about wake words and voice triggers. Sensory pioneered this technology for voice assistants, and we evangelized the importance of not hitting buttons to speak to a voice recognizer. Then everybody caught on and the technology went into main stream use (think Alexa, OK Google, Hey Siri, etc.), and I stopped blogging about it. But I want to reopen the conversation…partly to talk about how important a GREAT wake word is to the consumer experience, and partly to congratulate my team on a recent comparison test that shows how Sensory continues to have the most accurate embedded wake word solutions.
Competitive Test Results. The comparison test was done by Vocalize.ai. Vocalize is an independent test house for voice enabled products. For a while, Sensory would contract out to them for independent testing of our latest technology updates. We have always tested in-house but found that our in-house simulations didn’t always sync up with our customers’ experience. Working with Vocalize allowed us to move from our in-house simulations to more real-world product testing. We liked Vocalize so much that we acquired them. So, now we “contract in” to them but keep their data and testing methodology and reporting uninfluenced by Sensory.
Vocalize compared two Sensory TrulyHandsfree wake word models (1MB size and 250KB size) with two external wake words (Amazon and Kitt.ai’s Snowboy), all using “Alexa” as the trigger. The results are replicable and show that Sensory’s TrulyHandsfree remains the superior solution on the market. TrulyHandsfree was better/lower on BOTH false accepting AND false rejecting. And in many cases our technology was better by a longshot! If you would like see the full report and more details on the evaluation methods, please send an email request to either Vocalize (firstname.lastname@example.org) or Sensory (email@example.com).
It’s Not Easy. There are over 20 companies today that offer on-device wake words. Probably half of these have no experience in a commercially shipping product and they never will; there are a lot of companies that just won’t be taken seriously. The other half can talk a good talk, and in the right environment they can even give a working demo. But this technology is complex, and really easy to do badly and really hard to do great. Some demos are carefully planned with the right noise in the right environment with the right person talking. Sensory has been focused on low power embedded speech for 25 years, we have 65 of the brightest minds working on the toughest challenges in embedded AI. There’s a reason that companies like Amazon, Google, Microsoft and Samsung have turned to Sensory for our TrulyHandsfree technology. Our stuff works, and they understand how difficult it is to make this kind of technology work on-device! We are happy to provide APK’s so you can do you’re your own testing and judge for yourself! OK, enough of the sales pitch…some interesting stuff lays ahead…
It’s Really Important. Getting a wake word to work well is more important than most people realize. It’s like the front door to your house. It might be a small part of your house, but if you aren’t letting the homeowners in then that’s horrible, and if you are letting strangers in by accident that’s even worse. The name a company gives their wake word is usually the company brand name, imagine the sentiment that comes off when I say a brand name and it doesn’t work. Recently I was at a tradeshow that had a Mercedes booth. There were big signs that said “Hey Mercedes”…I walked up to the demo area and I said “Hey Mercedes” but nothing happened…the woman working there informed me that they couldn’t demo it on the show floor because it was really too noisy. I quickly pulled out my mobile phone and showed her that I could use dozens of wake words and command sets without an error in that same environment. Mercedes has spent over 100 years building up one of the best quality brand reputations in the car industry. I wonder what will happen to that reputation if their wake word doesn’t respond in noise? Even worse is when devices accidentally go off. If you have family members that listen to music above volume 7 then you already know the shock that a false alarm causes!
It’s about Privacy. Amazon, like Google and a few others seem to have pretty good wake words, but if you go into your Alexa settings you can see all of the voice data that’s been collected, and a lot of it is being collected when you weren’t intentionally talking to Alexa! You can see this performance issue in the Vocalize test report. Sensory substantially outperformed Amazon in the false reject area. This is when a person tries to speak to Alexa and she doesn’t respond. The difference is most apparent in babble noise where Sensory falsely rejected 3% and Amazon falsely rejected 10% in comparable sized models (250KB). However the False Accept difference is nothing short of AMAZING. Amazon false accepted 13 times in 24 hours of random noise. In this same time period Sensory false accepted ZERO times (on comparably sized 250KB models). How is this possible you may be wondering? Amazon “fixes” its mistakes in the cloud. Even though the device falsely accepts quite frequently, their (larger and more sophisticated) models in the cloud collect the error. Was that a Freudian slip? They correct the error…AND they COLLECT the error. In effect, they are disregarding privacy to save device cost and collect more data.
As the voice revolution continues to grow, you can bet that privacy will continue to be a hot topic. What you now understand is that wake word quality has a direct impact on both the user experience and PRIVACY! While most developers and product engineers in the CE industry are aware of wake words and the difficulty in making them work well on-device, they don’t often consider that competing wake words technologies aren’t created equally – the test results from Vocalize prove it! Sensory is more accurate AND allows more privacy!
January 5, 2017
Virtual handsfree assistants that you can talk to and that talk back have rapidly gained popularity. First, they arrived in mobile phones with Motorola’s MotoX that had an ‘always listening’ Moto Voice powered by Sensory’s TrulyHandsfree technology. The approach quickly spread across mobile phones and PCs to include Hey Siri, OK Google, and Hey Cortana.
Then Amazon took things to a whole new level with the Echo using Alexa. A true voice interface emerged, initially for music but quickly expanding domain coverage to include weather, Q&A, recipes, and the most common queries. On top of that, Amazon took a unique approach by enabling 3rd parties to develop “skills” that now number over 6000! These skills allow Amazon’s Echo line (with Tap, Dot) and 3rd Party Alexa equipped products (like Nucleus and Triby) to be used to control various functions, from reading heartrates on Fitbits to ordering Pizzas and controlling lights.
Until recently, handsfree assistants required a certain minimum power capability to really be always on and listening. Additionally, the hearable market segment including fitness headsets, hearing aids, stereo headsets and other Bluetooth devices needed to use touch control because of their power limitations. Also, Amazons Alexa had required WIFI communications so you could sit on your couch talking to your Echo and query Fitbit information, but you couldn’t go out on a run and ask Alexa what your heartrate was.
All this is changing now with Sensory’s VoiceGenie!
The VoiceGenie runs an embedded recognizer in a low power mode. Initially this is on a Qualcomm/CSR Bluetooth chip, but could be expanded to other platforms. Sensory has taken an SBC music decoder and intertwined a speech recognition system, so that the Bluetooth device can recognize speech while music is playing.
The VoiceGenie is on and listening for 2 keywords:
For example, a Bluetooth headset’s volume, pairing, battery strength, or connection status can only be controlled by the device itself, so VoiceGenie handles those controls without touching required. VoiceGenie can also read incoming callers’ names and ask the user if they want to answer or ignore. VoiceGenie can call up the phone’s assistant, like Google Assistant or Siri or Cortana, to ask by voice for a call to be made or a song to be played.
Some of the important facts behind the new VoiceGenie include:
This third point is perhaps the least understood, yet the most important. People want a personalized assistant that knows them, keeps their secrets safe, and helps them in their daily lives. This help can be accessing information or controlling your environment. It’s very difficult to accomplish this for privacy and power reasons in a cloud powered environment. There needs to be embedded intelligence. It needs to be low power. VoiceGenie is that low powered voice assistant.
June 22, 2016
I’ve written a series of blogs about consumer devices with speech recognition, like Amazon Echo. I mentioned that everyone is getting into the “always listening” game (Alexa, OK Google, Hey Siri, Hi Galaxy, Assistant, Hey Cortana, OK Hound, etc.), and I’ve explained that privacy concerns attempt to be addressed by putting the “always listening” mode on the device, rather than in the cloud.
Let’s now look deeper into the “always listening” approaches and compare some of the different methods and platforms available for embedded triggers.
There are a few basic approaches for running embedded voice wakeup triggers:
First, is running on an embedded DSP, microprocessor, and/or smart microphones. I like to think of this as a “deeply embedded: approach as opposed to running embedded on the operating system (OS). Knowles recently announced a design with a smart mike that provides low-power wake up assistance.
Many leading chip companies have small DSPs that are enabled for “wake up word” detection. These vendors include Audience, Avnera, Cirrus Logic, Conexant, DSPG, Fortemedia, Intel, InvenSense, NXP, Qualcomm, QuickLogic, Realtek, STMicroelectronics, TI, and Yamaha. Many of these companies combine noise suppression or acoustic echo cancellation to make these chips add value beyond speech recognition. Quicklogic recently announced availability of an “always listening” sensor fusion hub, the EOS S3, which lets the sensor listen while consuming very little power.
Next is DSP IP availability. The concept of low-power voice wakeup has gotten so popular amongst processor vendors that the leading DSP/MCU IP cores from ARM, Cadence, CEVA, NXP CoolFlux, Synopsys, and Verisilicon all offer this capability, and some even offer special versions targeting this function.
Running on an embedded OS is another option. Bigger systems like Android, Windows, or Linux can also run voice wake-up triggers. The bigger systems might not be so applicable for battery-operated devices, but they offer the advantage of being able to implement larger and more powerful voice models that can improve accuracy. The DSPs and MCUs might run a 50-kbyte trigger at 1 mA, while bigger systems can cut error rates in half by increasing models to hundreds of megabytes and power consumption to hundreds of milliamps. Apple used this approach in its initial implementation of Siri, thus explaining why the iPhone needed to be plugged in to be “always listening.”
Finally, one can try combinations and multi-level approaches. Some companies are implementing low-power wake-up engines that look to a more powerful system when woken up to confirm its accuracy. This can be done on the device itself or in the cloud. This approach works well for more complex uses of speech technology like speaker verification or identification, where the DSPs are often crippled in performance and a larger system can implement a more state of the art approach. It’s basically getting the accuracy of bigger models and systems, while lowering power consumption by running a less accurate and smaller wakeup system first.
A variant of this approach is accomplished with a low-power speech detection block acting as an always listening front-end, that then wakes up the deeply embedded recognition. Some companies have erred by using traditional speech-detection blocks that work fine for starting a recording of a sentence (like an answering machine), but fail when the job is to recognize a single word, where losing 100 ms can have a huge effect on accuracy. Sensory has developed a very low power hardware sound-detection block that runs on systems like the Knowles mike and Quicklogic sensor hub.
March 28, 2016
Just saw an interesting article on www.eweek.com
Covers a consumer survey about being connected and particularly with IoT devices. What’s interesting is that those surveyed were technically savvy (70% were self-described as intermediate or advanced with computers, and 83% said they could set up their own router), yet the survey found:
1) 68 percent of consumers expressed concern about security risks such as viruses, malware and hackers;
These concerns are quite understandable, since we as consumers tend to give away many of our data rights in return for free services and software.
People have asked me if embedded speech and other embedded technologies will continue to persist if our cloud connections get better and faster, and the privacy issues are one of the reasons why embedded is critical.
This is especially true for “always on” devices that listen for triggers; if the always on listening is in the cloud, then everything we discuss around the always on mics goes into the cloud to be analyzed and potentially collected!
March 30, 2015
At Mobile World Congress, I participated in ZTE’s Mobile Voice Alliance panel. ZTE presented data researched in China that basically said people want to use speech recognition on their phones, but they don’t use it because it doesn’t work well enough. I have seen similar data on US mobile phone users, and the automotive industry has also shown data supporting the high level of dissatisfaction with speech recognition.
In fact, when I bought my new car last year I wanted the state of the art in speech recognition to make navigation easier… but sadly I’ve come to learn that the system used in my Lexus just doesn’t work well — even the voice dialing doesn’t work well.
As an industry, I feel we must do better than this, so in this blog I’ll provide my two-cents as to why speech recognition isn’t where it should be today, even when technology that works well exists:
Deep learning nets have enabled some amazing progress in speech recognition over the last five years. The next five years will see embedded recognition with high performance noise cancelling and beamforming coming to the forefront, and Sensory will be leading this charge… and just like how Sensory led the way with the “always on” low-power trigger, I expect to see Google, Apple, Microsoft, Amazon, Facebook and others follow suit.
February 11, 2015
The advent of “always on” speech processing has raised concerns about organizations spying on us from the cloud.
In this Money/CNN article, Samsung is quoted as saying, “Samsung does not retain voice data or sell it to third parties.” But, does this also mean that your voice data isn’t being saved at all? Not necessarily. In a separate article, the speech recognition system in Samsung’s TVs is shown to be an always-learning cloud-based system solution from Nuance. I would guess that there is voice data being saved, and that Nuance is doing it.
This doesn’t mean Nuance is doing anything evil; this is just the way that machine learning works. There has been this big movement towards “deep” learning, and what “deep” really means is more sophisticated learning algorithms that require more data to work. In the case of speech recognition, the data needed is speech data, or speech features data that can be used to train and adapt the deep nets.
But just because there is a necessary use for capturing voice data and invading privacy, doesn’t mean that companies should do it. This isn’t just a cloud-based voice recognition software issue; it’s an issue with everyone doing cloud based deep learning. We all know that Google’s goal in life is to collect data on everything so Google can better assist you in spending money on the right things. We in fact sign away our privacy to get these free services!
I admit guilt too. When Sensory first achieved usable results for always-on voice triggers, the basis of our TrulyHandsfree technology, I applied for a patent on a “background recognition system” that listens to what you are talking about in private and puts together different things spoken at different times to figure out what you want…. without you directly asking for it.
Can speech recognition be done without having to send all this private data to the cloud? Sure it can! There’s two parts in today’s recognition systems: 1) The wake up phrase; 2) The cloud based deep net recognizer – AND NOW THEY CAN BOTH BE DONE ON DEVICE!
Sensory pioneered the low-power wake up phrase on device (item 1), now we have a big team working on making an EMBEDDED deep learning speech recognition system so that no personal data needs to be sent to the cloud. We call this approach TrulyNatural, and it’s going to hit the market very soon! We have benchmarked TrulyNatural against state-of-the-art cloud-based deep learning systems and have matched and in some cases bested the performance!
January 21, 2015
I know it’s been months since Sensory has blogged and I thank you for pinging me to ask what’s going on…Well, lot’s going on at Sensory. There are really 3 areas that we are putting a strategic focus on, and I’ll briefly mention each:
Of course, there’s a lot more going on than just this…we recently announced partnerships with Intel and Nok Nok Labs, and we have further lowered power consumption in touchless control and always-on voice systems with the addition of our hardware block for low power sound detection.
June 4, 2014
It was about 4 years ago that Sensory partnered with Vlingo to create a voice assistant with a special “in car” mode that would allow the user to just say “Hey Vlingo” then ask any question. This was one of the first “TrulyHandsfree” voice experiences on a mobile phone, and it was this feature that was often cited for giving Vlingo the lead in the mobile assistant wars (and helped lead to their acquisition by Nuance).
About 2 years ago Sensory introduced a few new concepts including “trigger to search” and our “deeply embedded” ultra-low power always listening (now down to under 2mW, including audio subsystem!). Motorola took advantage of these excellent approaches from Sensory and created what I most biasedly think is the best voice experience on a mobile phone. Samsung too has taken the Sensory technology and used in a number of very innovative ways going beyond mere triggers and using the same noise robust technology for what I call “sometimes always listening”. For example when the camera is open it is always listening for “shoot” “photo” “cheese” and a few other words.
So I’m curious about what Google, Microsoft, and Apple will do to push the boundaries of voice control further. Clearly all 3 like this “sometimes always on” approach, as they don’t appear to be offering the low power options that Motorola has enabled. At Apple’s WWDC there wasn’t much talk about Siri, but what they did say seemed quite similar to what Sensory and Vlingo did together 4 years ago…enable an in car mode that can be triggered by “Hey Siri” when the phone is plugged in and charging.
I don’t think that will be all…I’m looking forward to seeing what’s really in store for Siri. They have hired a lot of smart people, and I know something good is coming that will make me go back to the iPhone, but for now it’s Moto and Samsung for me!
November 15, 2013
Android introduced the new KitKat OS for the Nexus 5, and Sensory has gotten lots of questions about the new “always listening” feature that allows a user to say “OK Google” followed by a Google Now search. Here’s some of the common questions:
August 21, 2013
Saw an article about game changers in the Galaxy Note 3.
It has a few interesting insights. They refer to Samsung’s S-Voice now as “Always on S Voice” and mention that the new Note 3 will be designed to be always on, listening for your wake up command.
The Galaxy Note 3 also uses the Qualcomm SnapDragon 800. This is the chip from Qualcomm that has an always listening wake up command built in. Sorry, Qualcomm, but I don’t think Samsung will be using your technology!
The best performing “always listening” processors combine Sensory’s TrulyHandsfree with an ultra-low power chip, like IP from Tensilica and CEVA. Chip companies like Cirrus Logic, DSPG, Realtek, and Wolfson seem well positioned to lead in mobile chips with “always on” listening features.