Battle of the Voice Assistants – Embedded vs Cloud

Can a small embedded custom voice assistant take on 800 pound cloud gorillas?

Before answering this and diving into the how, why, and how much, let me backup and talk a little bit about testing and accuracy.

Testing speech recognition performance across speech engines in a fair manner is VERY difficult. The tests need to be representative of the real world and the same test needs to be applied to the different speech engines. This is not easy because different engines run on different hardware with different noise improvement front ends, different memory requirements (bigger is usually better in accuracy), so selecting “fair” platforms is challenging. But choosing the right data to test is even more challenging, because of accents, language, background noise, distance, signal to noise ratios, and all sorts of other assumptions that must be made.

Sensory has always done computer based statistically significant testing with real world audio files and digital modeling of noise and room acoustics, with various sound artifacts (reverb, echo, etc.). We found however that our in house testing did not always align with our customers testing, and it wasn’t just differences in noise assumptions and hardware but the simple act of playing live out of a speaker is different than a live person talking which is different from a high fidelity digital recording of a real person talking. So Sensory turned to an outside development house, Vocalize.ai, and we would contract them to do complete independent testing, not using the biases of Sensory’s assumptions and methodologies. Our thinking was that neither would be completely “right” but at least we’d get a few data points, and we did indeed find that this duo-testing approach led to much closer alignment of our customers in house tests and was helping to better understand their results. Soon we became Vocalize’s biggest customer, and we ended up acquiring them but with the goal of keeping them independent in their processes, data, and methodologies.

A year or so ago, Vocalize did a nice report on a wake word comparison in various noise conditions and distances. A lot of people kept asking what if you go beyond simple wake words and command and control. What about a comparison between large vocabulary engines with NLU and voice assistant capabilities? Vocalize has just completed their report on a domain specific test comparing Amazon, Google, and Sensory, and the results are quite interesting.

Here’s the report which provides details on the methodology, vocabularies tested, and results.

We should understand, however, that it’s not a totally “fair” test, for a few reasons:

1) Vocalize is measuring domain specific performance. Amazon and Sensory had advanced knowledge and preparation for this domain – microwave. Sensory and Amazon had the opportunity to develop a custom voice assistant. Google didn’t. For this reason Vocalize didn’t and couldn’t test Google on Task Completion Rate (TCR), which is probably the best measure of accuracy, instead the Google engine was compared to Sensory and Amazon only on Word Error Rates, whereas Sensory and Amazon could also be compared fairly on TCR.

 

2) The Google and Amazon engines are WAY WAY bigger than Sensory’s. They essentially have unlimited cloud based resources and models whereas Sensory is running all on device, and as such Sensory was forced to make performance decisions that tradeoff cost, accuracy, size, and speed…a huge disadvantage for Sensory.

 

3) A lot of important things weren’t evaluated. Performance was tested independent of noise, and noise is an important variable. The Vocalize rationale was that it was too hard to balance hardware (microphones, filtering, etc.) and added software for front end noise suppression to really compare engine performance with noise in a fair way. Also, many of the intrinsic advantages of running embedded like cloud computing cost, security/privacy, response time, and or ability to perform in changing or loosely connected environments were not part of the analysis. For example if connections failed just 3% of the time then Googles word error rate would have doubled and Sensory’s Word Error Rate would have been 50% lower than Google! But the assumption was 100% connections.

 

Nevertheless, the results are quite interesting. For Task Completion Rate on a real microwave Amazon achieved a 55% accuracy and Sensory achieved 93%. Vocalize took it a step further and looked at whether the errors came from speech recognition (word errors) or the NLU behaving improperly when a correct recognition was made. Quite frankly, Amazon made some very silly NLU mistakes where it correctly recognized phrases that should have been expected like “stop cooking” or “cancel” but then it failed to carry those tasks out. My guess is they didn’t create a custom voice assistant but instead used their generic Alexa capabilities which failed miserably.

 

Part of the problem general purpose assistants will have on domain specific usage is domain confusion, where it could get the recognition all or mostly right but then use it in the wrong domain. A great example of this is in a user named Family05 that gave the Amazon Smart Oven a 1 star review in November 2019 and said “When I asked Alexa to connect to my device and air fryer chicken legs, she replied, ‘chickens have two legs’.”

 

The interesting thing here is that with domain specific usage the NLU can actually correct speech recognition errors, so even if the word errors occur, the task may still get correctly done. This is shown in the results where Sensory had 3 speech recognition errors and 3 NLU errors, but only the NLU errors lead to task failures and not every speech recognition error caused an NLU failure.

 

Perhaps even more important is the benefit of constrained domains on false accepts from wake words erroneously firing. This wasn’t tested in the Vocalize study, but a constrained domain with a custom voice assistant is very good at rejecting random requests outside its domain. Google is very bad at this, and whenever I’m having conversations about Google, it will accidentally hear “Hey Google” then start listening and do some random thing (usually a voice search).  In all of the testing and usage of Sensory’s microwave we have NEVER seen it accidentally start cooking, and that’s because even on the rare occasions when there’s a false accept, the NLU rejects what follows and ignores it as a command. Alexa is pretty good at this too, as you can see the Echo devices light up when it thinks it heard Alexa, and a lot of the time those false fires get automatically rejected by the phrases that follow it.

 

So…Yes, Sensory’s on device assistant WAY outperformed the Amazon cloud assistant in a domain specific environment. But how did Google perform vs Sensory in terms of pure recognition scores. Well, Google outperformed Sensory. If you are surprised by that you shouldn’t be. Google probably has the best performing speech recognizer in the world on US English. Googles word accuracy was 97% and Sensory’s was 96%.  I’m pretty happy that Sensory is able to create an engine several orders of magnitude smaller than Google, and still be so close in accuracy!