Embedded Dictation, Can You Do it?

It’s funny how many companies ask for or claim that they can provide embedded dictation without qualifying what they really want or provide. Embedded dictation is very easy to do, but one must consider…

Which languages?

Some languages are easier than others. Different accuracies and conditions could require more data to train on. Luckily Sensory has been around long enough that we have collected over 150,000 hours of audio data across over 50 languages and dialects.

What size engine and platform?

Sensory has engines running on the tiniest of platforms with <50KB memory to large solutions requiring powerful DSP and inference engines. We can run a speech to text algorithm on as little as 3MB memory, and this algorithm will have extremely high task completion rates and low word error rates (<5%) for in domain usage, but out of domain it won’t perform as well. To get reasonable cross-domain performance engines need to get up to 20MB and have reasonably powerful processing or at least specialized inference functions.

What domain coverage?

Dictation isn’t specific to domains, or is it? Sensory’s top-of-the-line engine can get under 5% word error rates on certain Ted Talks, but apply a different test set or a different domain and the accuracy can get worse. The more we understand the domain or the testing methodology the better we can do.

What about accuracy?

Accuracy is typically measured in word error rates (WER) and Task Completion Rates (TCR). If it’s not straight dictation being performed, then task completion rates are usually most important because even if a word is recognized incorrectly if it performs the right function then it doesn’t really matter. Sensory likes TCR to ALWAYS exceed 95%, if it gets much below that it starts to feel unusable. The nice thing is that WER can drop as low as 10 or 15% and a good TCR can still be achieved.

What noise or signal to noise ratio?

Noise and distance make it much harder to recognize accurately. It is important to implement noise management strategies that fit the usage model. Sensory’s noise data includes about 15,000 hours of data and we have a variety of noise and acoustic simulation tools. Typically, multi-mic beam-forming helps, but watch out for noise suppression algorithms and nonlinear echo cancellation schemes that were developed around the psychoacoustics of human perception and not around deep learned speech recognizers! Sensory partners with companies like: Alango, Andrea Electronics Corporation, Bolom, DSP Concepts, Meeami Technologies, MightyWorks, Phillips, and Yobe to manage noise for a wide range of environments and usages.

Who can do embedded dictation?

Obviously I believe Sensory can do it, and that’s because we take into consideration all the factors shared above when customizing a solution. To find out more contact us today.