Archive for the ‘speaker identification’ Category
August 21, 2019
At a recent meeting Sensory was credited for “inventing the wake word”. I explained that Sensory certainly helped to evangelize and popularize it, but we didn’t “invent” it. What we really did was substantially improve upon the state of the art so that it became useable. And it was a VERY hard challenge since we did it in an era before deep learning allowed us to further improve the performance.
Today Sensory is taking on the challenge of sound and scene identification. There are many dozens of companies working on this challenge…and it’s another HUGE challenge. There are some similarities with wake words and dealing with speech but a lot of differences too! I’m writing this to provide an update on our progress, to share some of our techniques, compare a bit with wake words and speech, and to bring more clear metrics to the table to look at accuracy!
Sensory announced our initial SoundID solution at CES 2019 here.
Since then we have been working on accuracy improvements and adding gunshot identification into the mix of our sounds (CO2 and smoke alarms, glass break, baby cry, snoring, door knock/bell, scream/yell, etc.) to be identified.
Sensory is very proud of our progress in sound identification. We welcome and encourage others to share their accuracy reporting…I couldn’t find much online to determine “state of the art”.
Now we will begin work on scene analysis…and I expect Sensory to lead in this development as well!
January 11, 2019
Interview with Karen Webster, one of the best writers and interviewers in tech/fintech.
In 1994 the fastest imaginable connection to the internet was a 28.9 kbps dial-up modem and email was still mostly a new thing that many people were writing off as a fad. There was no such thing as Amazon.com for the first half the year and less than a third of American households owned computers. Given that, it’s not much of a surprise that the number of people thinking about voice-activated, artificial intelligence (AI)-enhanced wireless technology was extremely small — roughly the same as the number of people putting serious thought into flying cars.
But the team at Sensory is not quite as surprised by the rapid onset evolution of the voice-activated technology marketplace as everyone else may be — because when they were first opening their doors 25 years ago in 1994, this is exactly the world they had hoped to see developing two-and-a-half decades down the line, even if the progress has been a bit uneven.
“We still have a long way to go,” Sensory CEO Todd Mozer told Karen Webster in a recent conversation. “I am excited about how good speech recognition has gotten, but natural language comprehension still needs a lot of work. And combined the inputs of all the sensors devices have — for vision and speech together to make things really smart and functional in context — we just aren’t there yet.”
But for all there is still be to done, and advances that still need to be made, the simple fact that the AI-backboned neural net approach to developing for interactive technology has become “more powerful than we ever imagined it would be with deep learning,” is a huge accomplishment in and of itself.
And the accomplishments are rolling forward, he noted, as AI’s reach and voice control of devices is expanding — and embedding — and the nascent voice ecosystem is quickly growing into its adolescent phase.
“Today these devices do great if I need the weather or a recipe. I think in the future they will be able to do far more than that — but they will be increasingly be invisible in the context of what we are otherwise doing.”
Embedding The Intelligence
Webster and Mozer were talking on the eve of the launch of Sensory’s VoiceGenie for Bluetooth speaker — a new product for speaker makers to add voice controls and functions like wake words, without needing any special apps or a Wi-Fi connection. Said simply, Mozer explained, what Sensor is offering for Bluetooth makers is embedded voice — instead of voice via connection to the cloud.
And the expansion into embedded AI and voice control, he noted, is necessary, particularly in the era of data breach, cyber-crime and good old-fashioned user error with voice technology due to its relative newness.
“There are a lot of sensors on our products and phones that are gathering a lot of interesting information about what we are doing and who we are,” Mozer said.
Apart from being a security problem to send all of that information to the cloud, embedding in devices the ability to extract usefully and adapt on demand to a particular user is an area of great potential in improving the devices we all use multiple times daily.
This isn’t about abandoning the cloud, or even a great migration away from it, he said; there’s always going to be a cloud and clients for it. The cloud natively has more power, memory and capacity than anything that can be put into a device at this point on a cost-effective basis.
“But there is going to be this back-and-forth and things right now are swinging toward more embedded ability on devices,” he said. “There is more momentum in that direction.”
The cloud, he noted, will always be the home of things like transactions, which will have to flow through it. But things like verification and authentication, he said, might be centered in the devices’ embedded capacity, as opposed to in the cloud itself.
The Power Of Intermediaries
Scanning the headlines of late in the world of voice connection and advancing AI, it is easy to see two powerful players emerging in Amazon and Google. Amazon announced Alexa’s presence on 100 million devices, and Google immediately followed up with an announcement of its own that Google Assistant will soon be available on over a billion devices.
Their sheer size and scale gives those intermediaries a tremendous amount of power, as they are increasingly becoming the connectors for these services on the way to critical mass and ubiquity, Webster remarked.
Mozer agreed, and noted that this can look a little “scary” from the outside looking in, particularly given how deeply embedded Amazon and Google otherwise are with their respective mastery of eCommerce and online search.
Like many complex ecosystems, Mozer said that the “giants” — Amazon, Google and Apple to a lesser extent — are both partners and competitors, adding that Sensory’s greatest value to the voice ecosystem is when something that is very customized tech and requires a high level of accuracy and customer service features is needed. Sensory’s technology appears in products by Google, Alibaba, Docomo and Amazon, to name a few.
But ultimately, he noted, the marketplace is heading for more consolidation — and probably putting more power in the hands of very few selected intermediaries.
“I don’t think we are going to have 10 different branded speakers. There will be some kind of cohesion — someone or maybe two someones will kick butt and dominate, with another player struggling in third place. And then a lot of players who aren’t players but want to be. We’ve seen that in other tech, I think we will see it with voice.”
As for who those winning players will be, Google and Amazon look good today, but, Mozer noted, it’s still early in the race.
The Future of Connectedness
In the long term future, Mozer said, we may someday look back on all these individual smart devices as a strange sort of clutter from the past, when everyone was making conversation with different appliances. At some point, he ventured, we may just have sensors embedded in our heads that allow us to think about commands and have them go through — no voice interface necessary
“That sounds like science fiction, but I would argue it is not as far out there as you think. It won’t be this decade, but it might be in the next 50 years.”
But in the more immediate — and less Space Age — future, he said, the next several years will be about enhancing and refining voice technologies ability to understand and respond to human voice — and, ultimately, to anticipate the needs of human users.
There won’t be a killer app for voice that sets it on the right path, according to Mozer; it will simply be a lot of capacity unlocked over time that will make voice controls the indispensable tools Sensory has spent the last 25 years hoping they would become.
“When a device is accurate in identifying who you are, and carrying out your desires seamlessly, that will be when it finds its killer function. It is not a thing that someone is going to snap their fingers and come out with,” he said, “it is going to be an ongoing evolution.”
August 30, 2017
A few days ago I wrote a blog that talked about assistants and wake words and I said:
“We’ll start seeing products that combine multiple assistants into one product. This could create some strange and interesting bedfellows.”
Interesting that this was just announced:
Here’s another prediction for you…
All assistants will start knowing who is talking to them. They will hear your voice and look at your face and know who you are. They will bring you the things you want (e.g. play my favorite songs), and only allow you to conduct transaction you are qualified for (e.g. order more black licorice). Today there is some training required but in the near future they will just learn who is who much like a new born quickly learns the family members without any formal training.
October 16, 2015
I saw a LinkedIn message to one of the biometrics groups in which I’m a member linking to a new video on biometrics:
I was quite surprised to see that I am actually in it!
It’s a great topic…Banks turning to biometrics. The video doesn’t talk too much about what’s really happening and why, so I’ll blog about a few salient points, worthy of understanding:
1) Passwords are on their deathbed. This is old news and everyone gets it, but worthy of repeating. Too easy to crack and/or too hard to remember
2) Mobile is everything, and mobile biometrics will be the entry point. Our mobile phones will be the tools to control and open a variety of things. Our phones will know who we are and keep track of the probability of that changing as we use them. Mobile banking apps will be accessed through biometrics and that will allow us to not only check balances, but pay or send money or speed ATM transactions.
3) EMV credit cards are here…Biometric credit confirmation is next! Did you get a smart card from your bank? Europay, Visa, and MasterCard decided to improve fraud by shifting fraud risk based on security implemented. Smart cards are now, biometrics will be added to aid fraud prevention.
4) It’s all about convenience & security. So much focus has been on security that convenience was often overlooked. There was a perception that you can’t have both! With Biometrics you actually can have an extremely fast and convenient solution that is highly accurate.
5) Layered biometrics will rule. Any one biometric or authentication approach in isolation will fail. The key is to layer a variety of authentication techniques that enhance the systems security but don’t hurt convenience. Voice and face authentication can be used together, passwords can be thrown on top if the biometric confirmation is unsure, tokens or fingerprint or iris scans can also be deployed if the security isn’t high enough. The key is knowing the accuracy of match and increasing the security to the desired security level in a stepped function so as to maximize user convenience.
October 1, 2015
Todd Mozer’s interview with Martin Wasserman on FutureTalk
August 6, 2015
We first came out with TrulyHandsfree about five years ago. I remember talking to speech tech executives at MobileVoice as well as other industry tradeshows, and when talking about always-on hands-free voice control, everybody said it couldn’t be done. Many had attempted it, but their offerings suffered from too many false fires, or not working in noise, or consuming too much power to be always listening. Seems that everyone thought a button was necessary to be usable!
In fact, I remember the irony of being on an automotive panel, and giving a presentation about how we’ve eliminated the need for a trigger button, while the guy from Microsoft presented on the same panel the importance of where to put the trigger button in the car.
Now, five years later, voice activation is the norm… we see it all over the place with OK Google, Hey Siri, Hey Cortana, Alexa, Hey Jibo, and of course if you’ve been watching Sensory’s demos over the years, Hello BlueGenie!
Sensory pioneered the button free, touch free, always-on voice trigger approach with TrulyHandsfree 1.0 using a unique, patented keyword spotting technology we developed in-house– and from its inception, it was highly robust to noise and it was ultra-low power. Over the years we have ported it to dozens of platforms, Including DSP/MCU IP cores from ARM, Cadence, CEVA, NXP CoolFlux, Synopsys and Verisilicon, as well as for integrated circuits from Audience, Avnera, Cirrus Logic, Conexant, DSPG, Fortemedia, Intel, Invensense, NXP, Qualcomm, QuickLogic, Realtek, STMicroelectronics, TI and Yamaha.
This vast platform compatibility has allowed us to work with numerous OEMs to ship TrulyHandsfree in over a billion products!
Sensory didn’t just innovate a novel keyword spotting approach, we’ve continually improved it by adding features like speaker verification and user defined triggers. Working with partners, we lowered the draw on the battery to less than 1mA, and Sensory introduced hardware and software IP to enable ultra-low-power voice wakeup of TrulyHandsfree. All the while, our accuracy has remained the best in the industry for voice wakeup.
We believe the bigger, more capable companies trying to make voice triggers have been forced to use deep learning speech techniques to try and catch up with Sensory in the accuracy department. They have yet to catch up, but they have grown their products to a very usable accuracy level, through deep learning, but lost much of the advantages of small footprint and low power in the process.
Sensory has been architecting solutions for neural nets in consumer electronics since we opened the doors more than 20 years ago. With TrulyHandsfree 4.0 we are applying deep learning to improve accuracy even further, pushing the technology even more ahead of all other approaches, yet enabling an architecture that has the ability to remain small and ultra-low power. We are enabling new feature extraction approaches, as well as improved training in reverb and echo. The end result is a 60-80% boost in what was already considered industry-leading accuracy.
I can’t wait for TrulyHandsfree 5.0…we have been working on it in parallel with 4.0, and although it’s still a long ways off, I am confident we will make the same massive improvements in speaker verification with 5.0 that we are doing for speech recognition in 4.0! Once again further advancing the state of the art in embedded speech technologies!
June 30, 2014
May 7, 2014
If you read through the biometrics literature you will see a general security based ranking of biometric techniques starting with retinal scans as the most secure, followed by iris, hand geometry and fingerprint, voice, face recognition, and then a variety of behavioral characteristics.
The problem is that these studies have more to do with “in theory” than “in practice” on a mobile phone, but they never-the-less mislead many companies into thinking that a single biometric can provide the results required. This is really not the case in practice. Most companies will require that False Accepts (error caused by wrong person or thing getting in) and False Rejects (error caused by the right person not getting in) be so low that the rate where these two are equal (equal error rate or EER) would be well under 1% across all conditions. Here’s why the studies don’t reflect the real world of a mobile phone user:
A great case in point is the fingerprint readers now deployed by Apple and Samsung. These are extremely expensive devices, and the literature would make one think that they are highly accurate, but Apple doesn’t have the confidence to allow them to be used in the iTunes store for ID, and San Jose Mercury News columnist Troy Wolverton says:
“I’ve not been terribly happy with the fingerprint reader on my iPhone, but it puts the one on the S5 to shame. Samsung’s fingerprint sensor failed repeatedly. At best, I would get it to recognize my print on the second try. But quite often, it would fail so many times in a row that I’d be prompted to enter my password instead. I ended up turning it off because it was so unreliable (full article).”
There is a solution to this problem…It’s to utilize sensors already on the phone to minimize cost, and deploy a biometric chain combining face verification, voice verification, or other techniques that can be easily implemented in a user friendly manner that allows the combined usage to create a very low equal error rate, that become “immune” to conditions and compliance issues by having a series of biometric and other secure backup systems.
Sensory has an approach we call SMART, Sensory Methodology for Adaptive Recognition Thresholding that takes a look at environmental and usage conditions and intelligently deploys thresholds across a multitude of biometric technologies to yield a highly accurate solution that is easy to use and fast in responding yet robust to environmental and usage models AND uses existing hardware to keep costs low.
November 15, 2013
Android introduced the new KitKat OS for the Nexus 5, and Sensory has gotten lots of questions about the new “always listening” feature that allows a user to say “OK Google” followed by a Google Now search. Here’s some of the common questions: