Introducing SensoryCloud.ai Part 3: Speech-to-Text & Accuracy

When considering speech-to-text (STT) solutions, businesses are faced with many different solutions and varying degrees of marketing hype. However, when it comes down to choosing a STT partner, one factor tends to outweigh the others: Accuracy. In fact, in a 2022 survey from Opus Research, 90% of respondents indicated that increased accuracy was the critical enabler to expanding speech technology use across their businesses.

State-of-the-art accuracy was a top requirement for Sensory when we launched the SensoryCloud.ai solution. To demonstrate the performance of the SensoryCloud speech-to-text, we hired a 3rd party company to perform a word error rate test (WER) and compare our STT solution to other offerings in their system.

As you can see in the table below, the SensoryCloud.ai STT engine achieves best-in-class performance across well-known STT cloud services. Each engine was provided with hours of audio and text transcripts for WER calculations. This audio was played back in two scenarios: 1) relatively clearly spoken and free of background noise, and 2) Added noise to create SNR 10.

The 3rd party test house used identical test data (podcasts and various other audio files) with no company having access to customized language models (which can improve performance in known domains where there is accessible data).

Table for STT Accuracy: Quiet

Scenario: 1) relatively clearly spoken and free of background noise

STT Accuracy table: Noise

Scenario: 2 Added noise to create SNR 10.

Performance in normal conditions, as shown above, reveals that the SensoryCloud STT engine provides best-in-class accuracy.  The same WER testing was performed with a mix of added noise files (TV, Radio, Babble, Car, and Office) to create SNR 10 and Sensory continued to be near the top in performance (as shown, one company did outperform Sensory in the test). However, this is before the addition of our new noise robust front end, so we expect that within this quarter Sensory will be the most accurate in quiet and noise!

As mentioned, no customized language models were used to match the data in the domain of this test. However, Sensory can customize language models to substantially outperform the broader domain STT models. An earlier study by Sensory’s Vocalize showed that our embedded engines (e.g. TrulyNatural) could beat or match Google and Amazon in a microwave domain that used over 50,000 possible commands, by deploying customized language models.

So, if you are looking for the highest accuracy and the flexibility to work with your team to build a customized solution, then SensoryCloud’s speech-to-text is the best choice in large vocabulary natural language speech recognition! We invite you to subscribe to our blog and stay up to date on all the services offered by SensoryCloud: Speech-to-Text, Wake Word Verification, Sound ID, Face & Voice Biometrics, and Text-to-Speech.