Last week we shared the recording of our last webinar: Build Quality Voice & Vision AI Solutions with Sensory & SensoryCloud. This week we’re excited to share the transcript of that webinar made with the SensoryCloud Speech-to-text engine! It has been edited for clarity in some areas but we’re thrilled with the results. So if you’d prefer to read the content of the webinar or skim it while watching the video on Youtube keep scrolling, this transcript is for you!
Anu Adeboje (Moderator):
Good morning, good afternoon, good evening, thank you everyone for joining today’s webinar, we’ll get started in just about a minute as we wait for folks to join the webinar, just a few more people join.
As I go through the agenda, members of our panel/team will wave, so you know who they are. We’ll begin with an introduction to Sensory from our CEO Todd Mozer, followed by building embedded voice user interfaces with VoiceHub from Jeff Rogers.
And then Bryan McGrane will provide an introduction to the SensoryCloud and all its features, and I’ll do a dive into SensoryCloud’s speech-to-text.
Jonathan Welch will provide an overview of face biometrics both embedded in cloud and then will continue with Bryan Pellham and he will provide an introduction and overview of voice, biometrics and sound id, both embedded and cloud. And finally, we’ll wrap up with some exciting live demos and a Q & A.
Without further ado, I’ll hand it over to Todd.
Thanks Anu. I’m Todd Mozer present Sensory, president, founder, and CEO. I started Sensory over twenty-five years ago to allow people to communicate with products very naturally, the way we communicate with each other using sensory functions. In fact, the company was started as a chip company with an inference engine on a very low-cost ship that was running a neural network.
So, kind of like what people are doing today, we were that good twenty-five, twenty-seven years ahead of our time. We’ve been very successful at getting our technologies out into a lot of products. We’ve shipped in over three billion products over the years by hundreds of different companies. In fact, I think one of the marketing tag lines to join our webinar today was talking about. Why did Google and Amazon and all these giant companies & you know the company in Cupertino’s licensed our technology? All these licenses from these giants that are great at AI, and why do they choose Sensory? The quick answer to that is its combination of a few very important things; there’s accuracy, we’re super accurate and we allow privacy because we can do it on device. We have very, very low resource engines, low power, low heat, we support a whole lot of languages and platforms, and over the last year we’ve rolled out really a state-of-the-art cloud solution, so our technologies that we have are really, good and will show that to you today.
So let me talk a little bit about the big picture here, about features of voice, a type of solutions, and I’ll use one of my favorite products on the market. The Amazon Echo and Alexa is an example of how these things come together. Sensory has a variety of different solutions where we put domain specific assistance into devices. But I love the Echo that was introduced back in, I think, 2014, and it’s been one of my favorite products. I have about five of them in my house, and when you use it, you start with a wake word. You say Alexa to it and what that does is it wakes the device up and then you can do a cloud-based revalidation on Alexa Amazon. It does this, in fact, when you see the light go off, but then it doesn’t respond.
What that usually means is that the product false fired on the wake word, and the cloud decided that the things that you said after that weren’t relevant enough, so it didn’t go forward. So there’s this embedded and cloud working together and Sensory offers both of that and we’ll show you different solutions on the embedded side and the cloud side and the hybrid kind of solutions as well.
You can add biometrics to the wake word if you like and Amazon does this.
For example, if I say “hook up to my phone and play my music”, it knows who I am by my voice. You know when I ask it to play the news in the morning, it tells me I can do that. And it calls me by name that way, if my wife uses it or my son uses it, it can address them, too, by training on the biometrics of my voice.
Now at Sensory, we do voice, and we do face biometrics. Natural language is another component in the stage, because after you’ve said the wake word and the device has identified you, it needs to figure out the intents and what you want to do, and what are you trying to accomplish? Large vocabulary engines that are typically domain specific accomplish this task.
When Amazon first rolled out the echo, it was very, very good at the music domain. I basically used it for playing music and for setting timers. Those were kind of the most popular purposes for it, and they were very, very good at that, and over time they’ve added more and more domains to their stack, so it can do more and more different types of things.
And then there’s speech-to-text and text-to-speech output on the SensoryCloud side of things.
We do both the tiniest speech engines in the world that are used in, answering machines back in the days when we had answering machines, and then we have cloud based text-to-speech that’s really state of the art and quality.
So a final type of technology that can be deployed in these systems, a Sound ID and Amazon does this with their away mode when it listens for glass break so you can set this away mode, and if it hears a glass break it can respond to that and notify you.
Next slide, please.
Let me talk a little bit about Sensory’s product line and I’ll go through this very briefly because you’re going to get demonstrations and examples of each and every one of these. So we have both the embedded side and we have the cloud side of things. On the embedded side we brand it with the Truly name, TrulyHandsfree, TrulySecure and TrulyNatural. TrulyHandsFree is a wake word engine that can do multiple wake words in parallel and can do small command sets, its highly accurate super, tiny and very robust to noise. It has a micro footprint that we’ve put on all sorts of DSPs and microcontrollers, so it really runs everywhere. TrulyNatural is the next step up, it’s a large vocabulary engine where we can do things like statistical language modeling.We can do domain specific assistance and we’ve developed a variety of NLU technologies that can be very, very tiny or larger for intents entities and understanding what a person wants when they speak to it. Then the third embedded product we have is TrulySecure and that’s both face and voice biometrics, and we’ll show you some really neat demos of this.
In the upcoming discussions on the cloud side of things, we do everything that we do embedded, but we do it on steroids because we have less limits, or more resources to work with. So we have a very, very state of the art speech-to-text engine and text-to-speech engine, and we can do all these sorts of technologies with higher accuracy and super fast response times. We’ll give you some examples of our cloud and walk you through what the cloud is in the upcoming demonstrations and talks. Up next is Jeff Rogers, he’s going to use one of our tools, it’s called VoiceHub, and that will allow you to see examples of all of our embedded technologies.
Great thanks, Good morning, everybody or afternoon evening to all.
We created VoiceHub as a tool for allowing companies and developers to very quickly create projects and proof of concepts. In the past and with other solutions, if you wanted to create a custom wake word there’s development time and cost. A command set and a natural English grammar takes a lot of development time and cost and data data data data data. You’ll hear data all the time, and so over the years we’ve because we’ve been in business for so many years we’ve collected a lot of different data, and so we came up with this proprietary approach to synthetic data. And data is really important, obviously, because that’s how you build a model and that’s how you balance between a false accepts where I’m talking and it says “oh, I think I just heard the wake word” and a false rejects where I say the wake word, and nothing happens. So using this synthetic data approach, we can now allow developers and companies to create custom wake words, and you can create one to ten different wake words that are always on, always listening.
And even if there’s noise before the wake word or command set, with either speech or just background noise, whatever, it still does a very good job of spotting it, so it has a very natural language feel to it, and then our TrulyNatural engine, which is our large vocabulary engine, and all this can be done for free, using our VoiceHub.
You just go on to our website, go to resources and you’ll see VoiceHub and you can request access absolutely free. So let me, let me pop out of this then and uh do a screen share here, and I’ll show you what Voicehub looks like, and I’ll do a quick, a couple quick demos here.
Okay, so on the left side of my screen is my phone and I’m just screen sharing my phone. This is an Android phone. We also support iOS in this tool and then on the right side you can see the VoiceHub tool, so to start with, I can start with a new project you can see I can select a custom wake word or wake words, and simple commands called TrulyHandsfree or TrulyNatural, our large vocabulary engine. So I could just select a custom wake word as an example, and then go in here and build it out. The first thing I can do is I can name it whenever I want. I select my language and you can see that there’s a lot of different languages already included in VoiceHub from a lot of different Englishes to other European languages and Asian languages as well, and we continue to add languages to this.
And you’ll note that, for example, we’ve got English adults. We also have English kids, so if you’re doing products where kids are going to speak to it, we have a different model for kids. We have a different model for Uk English and Indian English, Australian English as an example, I’ll just stick with you as English, then the initial size.
This is the size of the model that you’re going to be building now in a wake word, the larger the model size the better, of course, but depending on what platform you’re running on, you might need a much smaller size.
It also depends on the use case, so Todd was talking about the Amazon Echo as an example. That’s one typical use case. There is distance, but if I’m doing a TWS product or wearable well, then I’ve got a much closer signal, a much better signal to noise ratio so I can do a much smaller model. So you can pick here, you can see between eighty kbytes up to about a megabyte. Then choose the output format, so this default is just a Sensory standard SDK, Trulyhandsfree, TrulyNatural.
But many of our close chip partners are also listed in here, as well as IP cores, so you could develop for the Arm cortex M4, or we also have the M7 supported in here, Cadence as far as the Hi Fis go or chip specifically like Ambiq or Cerris or ST or Qualcomm, you can see a lot of different chip partners that are listed here.
So when you choose one of those, whatever you create in Voicehub will automatically be formatted for that platform.
So then you go down to the wake word and you just type in whatever wake word you want, so I can just type in Sensory as an example, and then I hit enter. You See the little speaker icon here if I press that you’ll hear a TTS engine that plays back what the recognizer is expecting. So it has kind of a little bit of a debug built right into the tool, which is handy. Now creating a single wake word will typically take about 45 to 60 minutes depending on how busy the Voicehub servers are at the time, and the reason for that is that we’re balancing out between false accepts and false rejects with tons and tons and tons of this data that we have that I mentioned.
So let me jump to a project that I already created, since we don’t have an hour to wait for the wake word, and this is one that uses TrulyHandsfree and phrase spotting commands.
So In my house I’ve got a GE dishwasher, and on the front you’ve got these little buttons and this tiny, tiny little text that without my reading glasses on I can no longer read, and I thought you know this is a perfect example of where I’d love to add a voice user interface.
So I name my project, I choose my language. I’ve already created a custom wake word here. I choose the size that I want, so you can see I’m doing a 147 kb size. I’ll just use our standard output here, and then I type in the commands, so these are the commands that are on the front of my dishwasher, so nothing special about these. I kinda wondered about sani rinse, wondering how that would work, and again pressing the speaker button. I can hear that, yeah, this is correct. So once it’s all done, I build it, and once it’s built, I can either test directly on the PC here, Mac, or can download and test it on my mobile phone. That’s the one I like to do here, so this pops up a QR code that I can scan.
Now you’ll notice that the size is 1.2 MB, and that’s probably because that GE wake word I made was probably 1MB, and the other model you saw was very small, so the whole thing could be small or larger.
It’s totally up to you. So then on my phone I scan the QR code here and you’ll see that it scans it right in, and so now I’ve got my wake word over here. You can see the green light bouncing so that the microphone’s on and I can just say, hey, GE, I want pro wash right now.
See, though I said the wake word, and I spoke through the whole thing,it still spotted that pro wash, and so again with kind of natural language. People might say something like “uh GE, express wash”. So even though I said that “uh” kind of before, it doesn’t destroy our recognizer. That’s I mean, that’s pretty awesome and amazing. This is all part of Trulyhandsfree.
With just a couple more minutes here, let me show you a TrulyNatural project. This is obviously very different because it allows me to construct a natural language grammar, so this is a home automation demo.
Todd was talking about how he uses Amazon Echo for different things.
Well, you can do the same thing with Sensory, but even better than Amazon. We can do custom wake words, We do custom commands that we can do in a specific domain that is specific to your product, to your customers and the target market. So here I name it. I choose my language. I’m choosing a size 256 KB that’s for the acoustic model size that’s pretty dang small. You can see there’s a lot of different options here, so I’ve got as far as you know, 80 KB up to 8 MB. And then if I want a background model, it goes up to 100 MB. The background model’s basically a huge vocabulary of words that kind of fill in the gaps between the things that you might say. I’ll use the same output format. I’m using a 200KB Voice Genie wake word here and then you also have this out of vocabulary rejection setting so I can be less rejecting or more rejecting, and again this looks linear, although it’s really not. As I changed from twenty to ten as an example, it’s a pretty significant change, and I found that twenty usually worked pretty well. So then what I do is, I come down here, and you’ll see intents and slots and phrases. Intent is basically what is the thing I want to do. Todd talked a little bit about that. So in my demo here, I want to do lighting control, scene control, security control, temperature control, window covering control. Whatever I just made all these up and just hitting the plus key, I can add more intents slots like a bucket of words or phrases that might be said. So you can see, I’ve got raise and lower and rooms and scene, and all these different things I might say. As an example if I go into the room slot, I just add in the different rooms in my house, including some of my kids rooms here and if I want to add a Sensory room, I can just type in sensory and you’ll notice here that it’s page one of one. Why, I can have tens of pages, hundreds of thousands of pages, so it really can be as small or complex as you want to make it.
So then the other thing to point out is these predefined slots. These are pretty handy because things like temperature we’ve already got a temperature for oven setting or thermostat. So in my case here I described the thermostat grammar, so I didn’t have to write and create it myself.
So, under the phrases, what you do is basically drag, it’s all drag and drop. At this point I drag my intent down that I want, and then you can see how I have set temperature and then I’m using the temperature predefined slot in the rooms and you’ll notice there’s a lot of question marks. The question mark means that the preceding word or slot is optional, and so I could say set, and I could say temperature, but I don’t have to. I could just say seventy five degrees in the bedroom in the master bedroom. As an example lighting control, so I’ve got to turn on, turn off the bedroom lights or the lights in the bedroom as an example.
So again this allows you to build out and cover all the different variations of things that you think people might say to the product. And then I did the same with security, scene, control and window covering, and whatnot. So I come down here, and I build it now, this builds very quickly. This is all phonetic recognizer up here at this point, and I’ve already got the wake word built, so this’ll build it. There’s a coverage tab which will randomly generate grammar from the things that you could say from the grammar I’ve created here again. I’ll go to a quick download here. And you’ll see the total size of everything, and there’s literally thousands of different things that I could say is about six hundred kilobytes.
So then again with my phone, oh, and by the way the whole time I was talking there were no mistakes over here. It never thought I accidentally said that wake word or any of these commands.
So just really awesome again.
So scan the QR code and that quickly loads into my phone and on my phone you can see this auto generated grammar there’s literally thousands and thousands of different things that could be said. So now with this done, I can say things like “Voice Genie set temperature to seventy degrees in the family room. So super fast, this is exactly what I said, and you’ll also notice that not only do I get the result here, but I get the NLU or the natural language understanding. So it basically knew it was a temperature control, what I was setting it to, and it was in the family room.
“Voice Genie close the master bedroom shades”, “Voice Genie select dinner scene”, “Voice Genie, disarm the security system”.
I created all this in probably twenty or thirty minutes, and you can see it’s extremely fast, and very accurate.
The success of Voicehub with more than a thousand developers and companies using it today really speeds up product development, ideation, and proof of concepts.
Let me turn it over to Bryan McGrane, who’s going to talk more about our cloud, and if you think these demos are great, wait till you see the demo that he’s got to share, it’s pretty awesome.
Thank you so much, Jeff, I appreciate that and the wonderful demos always. Voicehub is a really fantastic product.
So everybody, my name’s Bryan McGrane, I’m the lead cloud architect here at Sensory and I’ve had the privilege of taking all of Sensory’s amazing technology that they developed over the last twenty-five years and put it into a cloud product. And so, as Todd had mentioned, yes, we offer pretty much everything that, you know, our embedded products offer, but on steroids. I would say there are a couple Of extra things that the cloud can offer in addition to embedded products. To give you an idea of the features that we offer, specifically, speech-to-text is by far our most compelling product by the fact that we can handle streaming at a higher accuracy than Google and Microsoft, as proven by a third party. We support 17 languages and it’s amazing, but our team is able to crank out a new language almost every week or every two weeks, so we’re adding more every day.
Our face biometrics support single frame passive liveness which Jonathan Welch is going to talk about a bit later on what that really means, but we are highly resistant to attacks of people trying to impersonate somebody else and we are glasses invariant.
For voice identification we support active liveliness, which basically means that we can ensure that somebody isn’t playing a recording to try to act as another individual, and we can handle text-dependent and text-independent, which basically just means we can authenticate you, whether you say a specific phrase or whether you want to say anything.
For sound identification, we have custom sound recognition, which is already going to be demoed, and you’ll see that. But we support hundreds of sounds in many different domains and we can even support enrolling custom sounds. Maybe you know doorbells or certain sounds that you might want to recognize in your applications. And finally we have text-to-speech across four different languages and we support the creation of custom voices in case you guys want to have your own.
Now SensoryCloud, what’s really exciting about this, is by having a cloud based system there are a lot of advantages. One major advantage, of course, is the horsepower. We have access to GPU’s, which means we can use very sophisticated models and very large models to perform all of the features that I showed on the prior slide, But we are also leveraging open source technologies such as the Nvidia Triton server, which acts as our inference engine.
And what that means is our data scientists, all they have to focus on is creating models in the frameworks that they choose. Pytorch for instance, one of them, and so we can take all of those models and we can drop them right into the Nvidia Triton server, and that allows us to rapidly release models and iterate on our models.
For instance, if you had a model that was, you know, performing well, but you had some improvements on your requests, we could create a new model for you, and within five minutes we could deploy that model to your server. So it’s extremely easy for us to deploy new models into the field, and the fact that we wrote everything in GO means it’s super fast, it’s super secure. If your developers are interested in communicating with SensoryCloud, we offer SDKs in eight different languages and we can add more at your request, but these are the languages that we support in that first bullet. We also have a security server that handles OAUTH, and I have a security background and I could probably bore you to tears with that. But just know that we have built SensoryCloud to be secure, to be private and, most importantly, data is not persisted, and all of the inference and all of the processing of images and audio can be done on your servers, and we never see any of it and we don’t want to see any of it, unlike a lot of competitors. Finally, we are fully SOC2 compliant and pen tested so you guys can feel very comfortable using this product. So really exciting is I’m actually going to show you this automotive demo that Bryan Pellom put together and he’ll be talking in a little bit.
But what’s really cool about this automotive demo is that it takes all the individual pieces of SensoryCloud and combines them together into a singular product. What’s really neat about our cloud is we’ve created these really amazing verticals of technology. And so when you start to combine them and mix and match them, you can make some really incredible things like this. So you’ll notice a local wake word detection. You’ll notice text-to-speech, speech-to-text, natural language understanding, like Jeff had mentioned and yeah, it’s basically going to act as if you were in a car controlling the navigation system and a few other components.
‘So one thing really important to note is well, I have mentioned a lot of things run in a cloud somewhere with some remote gpu. For automotive specifically, we can run our entire cloud on-device in the car itself and we have, and I’ll talk about this in a slide after so stay tuned for that.
Voice Genie, turn on my headlights and roll down the front passenger window.
Your low beams are turned on opening the front passenger window.
Voice Genie, navigate me to fifteen thirty six street Pine street, Boulder, Colorado.
Starting navigation to fifteen thirty six Pine street, Boulder, Colorado.
Voice Genie, what’s my battery level and what’s my range?
Battery level is at ninety nine percent, the remaining range is one hundred and ninety eight miles.
Voice Genie, send a message to Todd Mozer.
Sending message to Todd Mozer, what’s your message?
Hey Todd, I’m running fifteen minutes late, I’ll see you soon.
Your message is Hey Todd, I’m running fifteen minutes late, I’ll see you soon. Do you want me to send it?
Voice Genie navigate me to the nearest charging station.
Starting navigation to the nearest charging station.
Voice Genie turn on cruise control and set my cruising speed to seventy.
Cruise control activated cruising speed has been set to seventy miles per hour.
Voice Genie increase cruising speed by five miles per hour.
Cruising speed has been set to seventy-five miles per hour.
Voice Genie, set the temperature to seventy-two degrees.
Your vehicle has been set to seventy-two degrees fahrenheit.
Play hotel, California by The Eagles.
A song titled Hotel California by the Eagles cannot be played, but a different song can be played instead, playing Smooth jazz music from Internet radio.
Voice Genie set the volume to seven.
Setting volume, The volume has been set to seven.
Voice Genie Stop music.
What’s really neat about that demo is everything is actually all of the major components of that product are running on a raspberry pi. A tiny little underpowered device, and all the requests are going out to a cloud that’s in Oregon, which is about six, seven hundred miles away from where that demo was located, and you could notice how snappy it was.
So our cloud is not only handling all of those requests and all of those different technologies, but it’s doing so in a highly efficient, highly performant way.
Now, like I mentioned before that demo, we can also deploy SensoryCloud on what we call big embedded, and, for instance, one of these pieces of technology is called the Nvidia Jetson platform, which is kind of like a raspberry pi with a GPU. If that’s the way to put it, but it’s quite small, it’s maybe the size of a couple of credit cards, and we can actually deploy all of our stack, All of our technologies on the Jetson itself. And we offer smaller models as well for things like the Jetson Nanos. So we can actually get a highly accurate speech-to-text system working on a Jetson in under 30 MB. Which typically our cloud models are on the order of hundreds of megabytes, so, as you can imagine, you know, this is still a pretty small device and that’s why we call it big embedded.
So now I’m excited to hand it over to Anu, who is going to be demoing, probably my favorite technology in the cloud and our most powerful, that is speech-to-text.
Hello, everyone, I’m Anu, hello again. And as Bryan mentioned, and thanks Bryan for that exciting demo, the SensoryCloud speech-to-text engine is rich with features and capabilities for a variety of applications. Not only is it available in over sixteen languages, it’s capable of stream or batch processing, its noise robust, and highly accurate with fast response times, and is scalable and efficient.
Other unique features include the ability to add custom vocabulary for proper names and brand identity, real time attribution of punctuation and capitalization. It’s highly contextual, meaning it’s able to use the context to recognize the difference between words like sight or sight when you say something like I’m citing this reference or I’m losing sight.
So I’ll show those features in an upcoming demo, but before we get to that, let’s talk a little bit about the performance and the accuracy.
Early third party testing using hours of tech talks demonstrated that the SensoryCloud speech-to-text engine achieved best in class performance, comparable to and exceeding well known speech-to-text Cloud services in normal conditions. The third party Test house used identical test data with no company having access to customized language models which can affect results. So that’s really great, you know straight off the bat, in early early testing and we’re doing really well. And I can tell you all about it and go on and on about my favorite features, but I think it’s best that I just show you in this demo. Sit back and watch.
SensoryCloud has one of the most accurate and lowest latency speech-to-text products in the world, not only can it handle seventeen languages and counting, but it is one hundred percent, private secure and can be deployed in your cloud.
It handles punctuation capitalization and a whole bunch of other amazing features that you just won’t get anywhere else.
Taking a look at the text that was generated, we notice a couple of things. The first thing SensoryCloud was transcribed as Sensory Cloud, which is because SensoryCloud itself is a proper noun that our speech-to-text engine doesn’t recognize.
We can fix that by adding custom vocabulary words, so, for instance, we can say, put down SensoryCloud when we hear century cloud or Sensory Cloud or things close to SensoryCloud.
In that way, our speech-to-text engine knows how to transcribe it correctly.
The next thing we’re going to notice here is one of our words was highlighted orange. This is because our engine was not entirely sure as to the accuracy of that word, but it printed it out anyway.
And so each one of these words, if you notice as I hover over them, has some kind of accuracy number which indicates our engine’s confidence in that word. One last thing you’ll notice is the number seventeen, and the number one hundred were transcribed as words in your application.
If you want these to be transcribed as numbers, this is called inverse text normalization, and this is a feature coming in the next couple of months to SensoryCloud.
So now that we have our custom vocabulary set, let’s give it a second try. SensoryCloud can be tested for free at a demo found at SensoryCloud.ai go check it out and try for yourself.
Thank you very much.
I just love that automatic punctuation and capitalization feature! I don’t know about you but when I dictate text while I’m driving with Carplay, I’m always embarrassed to see what the text looks like later, I just would love to see a feature like that on my iPhone.
Now I’m going to hand things over to Jonathan, and he’ll tell you all about face biometrics.
I’m Jon Welch, I am director of Vision technologies here at Sensory and I’m going to talk to you a little bit about our face, authentication and liveness detection products.
So face authentication is something that most people are probably familiar with at this stage. Sometimes you hear it called face Id or facial biometrics. It’s a technology where you can take an image of somebody’s face and process it in such a way that you get a unique identifier for that individual. Our particular face authentication product has been trained with a number of different types of data that we’ve collected over the years. We have an application in the Android store that you can download right now. It’s called Applock. Applock is an application that enables you to secure access to individual applications based upon our biometric technology, and what this enables us to do is that by distributing this for free, we actually have an opt-in program where you can share your session data. And all that session data gives us access to a very demographically unbiased set of facial biometric information. And we’ve accumulated this data from all sorts of international markets in addition to across a wide variety of different devices. So whether you’re running very low power devices with very low resolution camera sensors or you’re running a state of the art smartphone, we have data as part of our data set. That actually helps our models to become very robust across all of those different types of variations. Our system is very robust and it’s very strong in the sense that for what I’ve shown here is a very high security threshold. So these two numbers that are shown here, the false reject rate and false accept rate for those that are not familiar. This is a terminology that corresponds to, in this context, if an impostor is trying to authenticate as somebody that is not them, how often will you see that impostor’s biometric actually bypassing the system? That’s what we call the false accept rate, so at our highest security threshold, 1/500,000 means that if I am an impostor trying to authenticate to somebody else it will, on average, take me 500,000 attempts before I will get one system to get through or one frame to get through. And typically when you have a false accept rate, then of course you have a paired false reject rate. So that is the opposite, which is, if I am the person that I’m trying to authenticate, how often will the system reject my true biometric? And at this security threshold we say five percent, and what that really means is that, five out of every hundred attempts, you may see a false rejection. But from a customer user experience perspective, if you’re running, even if something as slow as five frames a second, your users would never notice that you had to collect an additional frame in case you got a false recheck, so it’s very robust.
It’s highly secure and we have a variety of different thresholds that you can authenticate across. For this face authentication, we have two types of deployment. We have our traditional embedded employment, in which case everything, including the data and the authentication system, are all on device. So if you’re a FIDO compliant application, this is an option that you can add biometrics to your application that way. We also have the in cloud which we call in cloud, on prem, or Big embedded, and all three of these use our cloud stack. And in this scenario you get to use those super thin client Sdks that Bryan McGrane mentioned. But you also get this additional benefit which you don’t get currently with the embedded, and that is that if you want to do cross device enrolment with your biometrics, you can do that with this cloud. Meaning that I can enroll on my phone and then I can go to a workstation or an access panel or anything like that to be able to actually enroll with the same thing I’m sorry, authenticate with the same enrollment that I had before.
Additionally, all of our cloud communications are end-to-end encrypted, and so everything is still private in that regard. The next thing that we have is called liveness, detection or spoof detection, and so when you have a biometric system, it’s really important that you are robust against what are called replay attacks or presentation attacks. And so this is where I might take an image, you know a photo or a computer screen or something like that Of the person i’m trying to impersonate and showing that to the biometric system and very often you’ll find that without proper spoof detection that biometric replay attack will get through.
And so we have a state-of-the-art approach to doing this liveness detection and we’re very robust against a variety of different attacks, whether they’re photo cutouts or what we call depth attacks. Depth attacks are trying to give the perception of depth to the camera, or even these hyper realistic 3D masks, like you see in the lower right corner here.
This is something that was created by an artist by hand of our CEO’s face, and so we’re robust against a variety of different attacks like this. Again we have two deployment modes for this we have got an on device, which is a multi-frame, which means it requires more than one frame to be able and process the liveness signal, but it’s a much lower compute, much lower power model, or you can use our in cloud model, which is much more sophisticated and it actually operates on single frames.
So those are kind of the high level overview of our vision, biometric systems, and at this point I think we’ll hand it over to Bryan Pellham.
Hi, I’m Bryan Pellom. I’m vice president of emerging technologies at Sensory and today I want to talk to you about voice biometrics and sound ID. We’ve been producing voice biometrics for about over a decade now and as Todd mentioned earlier, they help your product to know your user. There’s other use cases for enhanced security or customization, all kinds of varieties of ways. You may be using it to access a bank account, you might be using it to simply know which playlist to play. If I say, “hey Sensory play my favorite songs”, well who said that?
We offer three biometric types: an enrolled wake word, So if you, if you say the wake word “hey GE” as we saw earlier, you can extend it with biometrics and know who said it, you can also allow your customers to select and define the user defined passphrase. If I pick my own passphrase, if I say my favorite football team, Denver Broncos, that could be my passphrase as well.
Interesting technologies, we also support text-independent mode where I can enroll my voice by saying a few sentences and then authenticate, now it can identify me by anything I say, and Jeff will show you a live demo of that at the end of this webinar. Regardless of which is your use case, whether it be security or customization of your app, we offer a super large degree of flexibility in how you integrate your solution into your products. We have high security modes that really focus in on low impostor accept rate, and we also have low security modes that really dial in low false rejects, so we want to identify the user without much friction. No matter what you pick, it’s really up to you to adjust as to suit your needs, and we also support some added interesting options, such as active voice liveness. And again, Jeff will show you a digit recognition based voice liveness later in this presentation.
In terms of accuracy, we are always trying to reduce the false reject rate against the impostor accept rate, when combined together, we often talk about that as the equal error rate, so lower is better for these biometric systems. We also look at things such as false accept rate, so how often in a continuous listening system does it accidentally trigger if it’s listening twenty four by seven all day long, and you probably have experienced false triggers on your Tv set or your Alexa product as well.
So we do measure that and we report it. In terms of our biometrics we typically see between one and three percent equal error rate, and it depends on the amount of control you have in your product. If you know the wake word, it’s a branded wake word such as hey GE we saw earlier, then we can build you a very customized wake word model as the lowest equal error rate possible. And as you go out to very unconstrained environments, you give a little bit in performance, but that may lead to very interesting and dynamic applications for the environment. Typically, when we integrate wake words we do wake word detection and then layer biometrics on top of it. This allows us to constrain the false accepts in our systems. And then we could use the following query as you saw earlier, Jeff said “Hey GE, standard wash” or “Hey GE, extended wash”, that extended wash command could be used to also instantiate a biometric. So combine it altogether, think about things as you build your products, things that affect biometrics, environmental and noise, distance of the microphone from the user, enrollment, audio quality place, and factor in the selection of mic in your products. The good news is that Sensory will work with you to build your products to get the best out of voice biometrics.
We often tune and optimize for specific products, so we’ll be looking forward to hearing about the products you want to build, tune, and optimize for your needs.
On the sound ID side, we’ve built a really interesting set of technologies that take voice and face applications to the next level, where you’re actually sensing the environment and trying to understand what’s going on around you. Our solutions today work on-device or in our cloud. As you saw SensoryCloud, we can detect things such as dog barks and baby cries, very discreet sound events. We support sixteen different sounds that are highly tuned, highly optimized. We can also enroll sounds, so if you have a particular sound that’s very unique to your product or something that you specifically want to listen to or a sound that your customers must hear, such as their own unique doorbell, We can do that with Sensory technology. We can also listen continuously and tell you things such as is my customer listening to music right now? Maybe I shouldn’t interrupt them at this point in their interaction.
I’m kind of happy to say that later this quarter we are going to release a new Sound Id system that can classify up to four hundred different sounds and listen to the environment and tell you what it’s hearing. So I think I’m very excited to see a much more open idea of sounds in the coming months.
Our solutions are kind of unique, just to summarize, we have a low power first stage detector that listens for events. It filters over 95% of the events out and then once we detect an event we reclassify using deep learning technologies. We use the best of deep learning in a paired way to get the best out of low power and high accuracy. And again here are some sounds that we can detect just as an example. You’re going to see more of that from Jeff, who I’m going to pass this over to and Jeff’s going to show you some very interesting live demonstrations.
Great thanks, Bryan , we’re a little short on time, so I’ll go through these demos quickly, but know that you can access these demos from our site as well. So let me turn off my camera so I can use my camera for the demos and then if I can just grab the screen here. So let me show you a couple of quick demos in the time that we have left. So this is our biometric demos, and so I’ll do a new enrollment just to see the whole process here. So I’ve got face turned on, I’ve got voice turned on. I typically wear glasses so I got that on because I do when I’m looking at my screen. You can see the liveness I’m setting here. I’ll just leave that a medium. I can pick the language I want for the voice recognition, because this is going to have active liveness with voice, which is super cool, and then I just type in the name of this demo that I want, so let me do a quick enrollment. So actually I did that backwards, but that’s okay.
eight, seven, four, six, one zero.
six, seven zero, four zero, nine.
five, eight, four, five zero zero.
One seven, two, six, six zero,
So now I’m enrolled with both face and voice, so here’s my Sensory demo that I just did. So if I want to do face only, you’ll see as soon as the camera turns on it’s like instant face match liveness detected. And if I use the creepy photo of myself that my daughter loves so much and I put that in front of the camera I’m like, Oh yeah, this is me and I’m moving around. I mean, this is a high quality photo as far as printing goes, and you see that it wasn’t matched and there was no liveness detected, but with the real me it’s almost instant. I mean it’s really awesome.
Then, if I want to do face and voice together, I’ll combine these together, you’ll see how quickly the camera says, yep, that’s me, and then I read out the digits and it says, yeah, that’s actually me saying the digits.
Nine one, six, one, six, six, so I read, I spoke those dishes very quickly. It matched my face, it matched, it was liveness and it was active liveness. So if somebody had a recording of me, they obviously couldn’t get in. We go down to the voice ID side of things, so this is what we talked about, text-independent verification of anything I say. So this is a Sensory demo here, and so, as I’m speaking now, it’s basically learning my voice. It’s building a template of my voice. So really, once I’m using the product I can, I can literally say anything, and that will say that’s Jeff. And you could combine this with our speech-to-text and other technologies such that I say the wake word, and then I say I want to go here or do this or turn on this or what not, and it combines those together and says ok, not only did I recognize what you said, but I know it was Jeff saying it. So now when I run this demo here as I start talking, you’ll see that it’s going to say, yep, that recognized me, and as I keep talking it’ll, keep saying, I recognize Jeff. If I stop a second and then I start talking again, you can see right here how quickly from the silence to what I said, and then again I mean it’s literally within one or two seconds, so it’s really, really fast, very responsive.
Allow me to jump to our Sound ID just really quickly here, so with Sound ID we’ve classified different categories as Bryan showed. So here’s the health one, and you’ll notice that I’ve got coughs, sneezes, and snoring. So as I turn this on and, if I start coughing, it’s going to respond very quickly to that, and sees thatI coughed twice in there. You’ll notice that we got home sounds, so you see a bunch of different home sounds as well as safety sounds, different alarms and whatnot, and this can be used in the car. And this can be used in the home, this is all part of our Sound ID, technology.
And then just for the last one, our speech-to-text only has to turn on a couple of quick things here and then end demonstrations today, let me just end with this.
Thank you for allowing us to spend time together to show you all the different technologies that we can provide. We really oh-
I must have missed this here, let me start again.
And to end our demonstration today, let me just end with this.
It doesn’t seem like it’s, doing that for some reason, I’m not sure why okay here now it’s actually listening to me. I’m not sure why it wasn’t before, but you can see this is totally live, and so whatever I’m saying it’s typing it out. And I turned on the capitalization, I turned on the automatic stop listening after I was done talking so.
Really cool, you could obviously tell that I wasn’t reading that and I was making it up as I go. This is our speech-to-text, so again I appreciate all the attendants here. A lot of different cool technologies, now lets turn it back to Anu for the Q&A section.
Thanks Jeff. All those demos that you saw that Jeff just demonstrated are available at Sensorycloud.ai. And if you want to sign up for any of the other demos, you can hit there. I’m going to copy and paste those Urls into the chat so you can see them.
And now we’ll head into the Q&A. Like, I said before, please enter your questions into the Q&A, and I will read them, and then our panelists will answer them live.
Okay, so we’ve got one anonymous question that says.
Q:What is it throughput for your STT. engine how many users can one Gpu support?
A: That’s a great question and it’s a tough one to answer as well. Mainly because it highly depends on your application, if you have a product that is streaming continuously for twenty four hours to the GPU, essentially hogging up Gpu space, that’s a much different problem than let’s say you have like a command and search application where your users would hit a button, and they would say where’s the nearest restaurant. That’s about eight seconds of audio versus twenty four hours of audio being used by the same user, so it really depends on your users. So for instance, we’ve had a couple of customers who approached us and they have more of the command and search application requirement in mind and that can easily support tens of thousands of users on a single Gpu across the course of a month.
So really, you know if you’re talking about individual GPUs, and that’s on a 1080 which is quite underpowered, but really it depends on your application. So if you’re interested in speech-to-text, you should probably reach out to our sales team and they can figure out with the engineering team exactly how many GPUs you would need for your application, but we are highly competitive in terms of the way that we benchmark against other technologies.
So I’ll read it out loud and, and then Todd will take it. It says:
Q: Does adding a voice ID increase or decrease FARs or FRRs, does the new automotive implementation account for environmental noise?
A: Yes, so these are really, really great questions, so, actually adding a voice ID does decrease the false accept rate because it’s listening for specific person’s, voice it’s, not going to go off on other people’s, voices, it’s, not going to go off other environmental sounds now probably increases the probability slightly that it would go off on your own voice, so there’s some question of what the mix is between your own voice and other things, but in general we believe it decreases False accept rate and improves the accuracy overall. In terms of automotive implementation and the environmental noise, the SensoryCloud is the most noise robust speech recognizer I’ve ever experimented with in my life. I’ve tried everything out for years and years. I can crank up noise in my background and it seems to work well. How it performs an automotive noise we haven’t done any benchmarking at this point, but we’d be happy to work with anybody out there that’s interested in trying it.
Great thanks, Todd, so we’ve got another question here, it says,
Q: How simple is it to have Sound ID or the LVCSR on an M4 or M33 with large memory, but running a simple RTOS.
A: I can, I can take that one Anu, so I’ll start with Sound ID. Sound ID today has been designed to run an application processor with a standards OS. So right now, today’s Sound Id will not run on an M4, M33, we did a demo where we’re running Sound ID on an M7 so that’s moving in that direction. To answer your question today it does require standard Linux or Android or other OS, as opposed to an RTOS.
LVCSR, TrulyNatural, can also run on an M7 today, so in the demo that I did with the home control, I could have done that whole demo running on an M7. So, as an example, we partnered very closely with ST, and on ST were running on their H7, which is based on M7 today. But today, actually TrulyNatural doesn’t run on an M4, we’ve got a demo where I think it will run on an M33, so these are things that are kind of in development.
We’re testing right now if Trulyhandsfree can run across all of those platforms, TrulyNatural, some of those, and sound ID again more designed today for running on an application processor, Good question.
One thing I’ll add to that that’s kind of interesting, we’ve been doing some work to increase our compatibility with tensorflow light, and we have compatibility with Trulyhandsfree, TrulyNatural, and sound ID with certain Tensorflow models.
It depends on which choices you make, so it’s possible that if Tensorflow is supported on a sub OS system, on a microchip, or DSP, we could also support that.
Okay, and I just want to thank everyone because we’re running out of time and, and I don’t want to keep anyone longer, but I thank you so much for joining this webinar.
Know it was packed full of information, I’ve copied and pasted it the relevant links into the chat, and , as I noted before the webinar will be recorded and available and sent to you via email, so just want to thank everyone again for joining today and we hope you have a good rest of your day wherever you are, thanks a lot!
Thanks all, we appreciate your showing up, thanks, everyone.