The Key to a Great Custom Voice Assistant

A large European Telecom company invested close to a hundred million Euros on rolling out their own voice assistant on their own hardware. They tested their custom wake word with Sensory’s wake word algorithm and found that it actually met their accuracy specification requirement. Their existing voice trigger (aka wake word) solution hadn’t met the market requirement, but they shipped anyway, hoping it would improve. It didn’t improve and the voice assistant got pulled from the market due to of lack of traction.  One might conclude that Sensory’s help came too late…

People think wake words are easy because you only need to listen for and identify 1 or 2 things. Wake words are however, one of the harder challenges in speech recognition. The wake word needs to be always listening all other speech needs to be rejected and not responded to, and it needs to work in noise, with accents, from a distance and on device (which usually means small footprint!). People get really frustrated with false accepts and false rejects. Wake words are very easy to make with mediocre  performance and very hard to make with great high accuracy performance. I’ve been really disappointed to see some of the automotive companies with custom wake words that don’t allow their devices to be on and always-listening at trades hows because of false accept and false reject problems. I suspect these issues don’t disappear in an automotive environment.

Sensory’s new VoiceHub can create pretty good wake words using synthetic data approaches. Our in-house tech team says a hand curated wake word model with actual data and hand tuning will be more accurate than VoiceHub a wake word. But VoiceHub is fast and easy to use. VoiceHub can take 4 or 5 hours to create a model with a single wake word (assuming our servers aren’t backed up). Now contrast this with a complex grammar created on VoiceHub. The complex grammar can have tens of thousands of phrases possible with NLU built in and a much bigger model size. Guess which builds the fastest? It’s not the wake word! The wake word modeling takes a LOT more computing resources to build!

I usually have multiple voice assistants working around my house. My personal experience is as follows:

Alexa works quite well. There’s some privacy sacrificed because they do a fair amount of analysis in the cloud. Amazon hands out their test data (which is not a hard test to pass), and we have found that some wake word companies will train on this test data to pass the Amazon test. Uh…that’s a no-no.

Hey Siri performs amazingly well and wins my vote for the best performing big assistant wake word. It appears to be done all or mostly on device to maintain privacy. I suspect they use a very big model. As a side note…I love the brief “hmm?” response that Siri can give if I just say “Hey Siri”

Hey/OK Google false fires all the time. I have it running on my phone and the false fires occur when I’m in conversations about Google. Admittedly this is a difficult challenge to overcome. I suspect Google knew this when they chose their wake word and it’s really a data gathering technique. Interestingly, I haven’t been able to figure out why at times my Google hot word seems to automatically get disabled.  Maybe they only need a certain amount of my data?

Actually all three work quite well, but this is because a TON of data has been collected to improve performance. A company hoping to have a custom wake word and get this level of performance might have unrealistic hopes. Here’s a few ideas to help:

1) Use Sensory technology… try it out on VoiceHub first. If VoiceHub matches or exceeds the performance of your current provider then you are in great shape because we can customize it to outperform.

2) Choose a phrase with more distinguishing sounds. More syllables are better. More unique sounds are better. Unless you want false fires, don’t choose a phrase where 60% or more of it is commonly spoken (e.g. Hey Google)

3) Collect user data speaking the phrase. More data is usually better.

4) Allocate more memory for more difficult tasks. It’s easy to go very small on a TWS earbud. It’s very hard to go small on a speaker that’s supposed to pickup from across the room with music playing

Wake words aren’t easy, but careful choice of wake word, vendor, and platform can make your product successful…and unfortunately the converse is true too. Make sure your wake word accuracy is great, and your product will be off to a good start!