HEAR ME -
Speech Blog
HEAR ME - Speech Blog  |  Read more September 17, 2019 - IFA 2019 Takes Assistants Everywhere to a New Level
HEAR ME - Speech Blog

Archives

Categories

Archive for the ‘speaker identification’ Category

Identifying Sounds as Accurately as Wake Words

August 21, 2019

At a recent meeting Sensory was credited for “inventing the wake word”. I explained that Sensory certainly helped to evangelize and popularize it, but we didn’t “invent” it. What we really did was substantially improve upon the state of the art so that it became useable. And it was a VERY hard challenge since we did it in an era before deep learning allowed us to further improve the performance.

Today Sensory is taking on the challenge of sound and scene identification. There are many dozens of companies working on this challenge…and it’s another HUGE challenge. There are some similarities with wake words and dealing with speech but a lot of differences too! I’m writing this to provide an update on our progress, to share some of our techniques, compare a bit with wake words and speech, and to bring more clear metrics to the table to look at accuracy!

Sensory announced our initial SoundID solution at CES 2019 here.

Since then we have been working on accuracy improvements and adding gunshot identification into the mix of our sounds (CO2 and smoke alarms, glass break, baby cry, snoring, door knock/bell, scream/yell, etc.) to be identified.

  1.  General Approach. Sensory is using its TrulySecure Speaker Verification platform for sound ID. This approach using proprietary statistical and shallow learning techniques runs smaller models on device. It also uses a wider bandwidth filtering approach as it is intended to differentiate speech and sounds as opposed to simply recognizing words.
    1. A 2nd stage approach can be applied to improve accuracy. This second stage uses a DeepNet and can also run on device or in the cloud. It is more MIPS and memory intensive but by using the first stage power consumption is easily managed, and the first stage can be more accepting while the 2nd stage eliminates false alarms
      1. Second Stage (Deep Neural Network) eliminates 95% of false alarms from the first stage, while passing 97% of the real events.
      2. This enables to tune to the desired operating point (1 FA/day, .5 FAs/day, etc…)
      3. FR rate stays extremely low (despite of FA reduction) due to very accurate deep neural network and a “loose” first stage that is less discriminative
    2. Second stage classifier (deep neural network) is trained on many target sound examples. In order to separate target events from similar sounding non-target events we apply proprietary algorithmic and model building approaches to remove false alarms
    3. Combined model (1st and 2nd stage) smaller than 5 MB
    4.  Does a 3rd stage make sense? Sensory uses its TrulyHandsfree (THF) technology performing key word spotting for its wake words, and often transfers to TrulySecure for higher performance speaker verification. This allows wake words to be listened for at the lowest possible power consumption. Sensory is now exploring using THF as an initial stage for Sound ID to enable a 3 stage approach with the best in accuracy and the best in power consumption. This way power consumption can average less than 2 milliamps.
  2. Testing Results. Here’s a few important findings that effect our test results:
    1. The difference between a quiet and noisy environment is quite pronounced. It’s easy to perform well in quiet and very difficult to perform great in noise, and it’s a different challenge than we faced with speech recognition, as the noises we are looking to identify can cover a much wider range of frequencies that can more closely match background noises. There’s a very good reason that when Alexa listens for glass break sounds, she does it in an “away” mode…that is when the home is quiet!! (Kudos to Amazon for the clever approach!). The results we will report all use noise-based testing. Spoiler alert…Sensory kicks ass! In our Alexa test simple drum beats and music caused glassbreaks to be detected. Sensory’s goal is avoiding this!
    2. Recorded sound effects are quite different than they sound live. The medium of playback (mobile phone vs PC vs high end speaker) can have a very big impact on the frequency spectrum and the ability to identify a sound. Once again this is quite different than human speech which falls into a relatively narrow frequency band and isn’t as affected by the playback mechanism. For testing, Sensory is using only high-quality sound playback.
    3. Some sounds are repeated others aren’t. This can have a huge effect on false rejects where the sound isn’t properly identified. It can be a “free” second chance to get it right. But this ability varies from sound to sound, for example, a glass break probably happens just once and it is absolutely critical to catch it; whereas a dog bark or baby cry happening once and not repeating may be unimportant and OK to ignore. We will show the effect on repeated sounds on accuracy tests.
    4. Our 2 stage approach works great. All the results shown are the performance of the 2 stage side by side.

 

  • This at 1 FA in 24 hours on balanced mix of noise data. We tend to work on the sounds until we exceed 90% accuracy with 1 FA/day. So its no surprise that they hover in the same percentage region…some of these took more work than others. ;-)

 

 

  • Once again at 1 FA in 24 hours on balanced mix of data. You can see how detection accuracy drops as noise levels grow. Of course we could tradeoff FA and FR to not drop performance so rapidly, and as the chart below shows we can also improve performance by requiring multiple events.

  • Assuming 1 FA in 24 hours on balanced mix of data. The general effects of multiple instances hold true across sound ID categories. So for things like repeated dog barks or baby cries the solution can be very accurate. As a dog owner, I really wouldn’t want to be notified if my dark barked once or twice in a minute, but if it barked 10 times within a minute it might be more indicative of an issue I want to be notified about. Devices with Sensory technology can allow parametric controls of the number of instances to cause a notification.

 

Sensory is very proud of our progress in sound identification. We welcome and encourage others to share their accuracy reporting…I couldn’t find much online to determine “state of the art”.

Now we will begin work on scene analysis…and I expect Sensory to lead in this development as well!

Sensory CEO On Voice-Activated Technology’s Next Big Wave

January 11, 2019

Interview with Karen Webster, one of the best writers and interviewers in tech/fintech.

In 1994 the fastest imaginable connection to the internet was a 28.9 kbps dial-up modem and email was still mostly a new thing that many people were writing off as a fad. There was no such thing as Amazon.com for the first half the year and less than a third of American households owned computers. Given that, it’s not much of a surprise that the number of people thinking about voice-activated, artificial intelligence (AI)-enhanced wireless technology was extremely small — roughly the same as the number of people putting serious thought into flying cars.

But the team at Sensory is not quite as surprised by the rapid onset evolution of the voice-activated technology marketplace as everyone else may be — because when they were first opening their doors 25 years ago in 1994, this is exactly the world they had hoped to see developing two-and-a-half decades down the line, even if the progress has been a bit uneven.

“We still have a long way to go,” Sensory CEO Todd Mozer told Karen Webster in a recent conversation. “I am excited about how good speech recognition has gotten, but natural language comprehension still needs a lot of work. And combined the inputs of all the sensors devices have — for vision and speech together to make things really smart and functional in context — we just aren’t there yet.”

But for all there is still be to done, and advances that still need to be made, the simple fact that the AI-backboned neural net approach to developing for interactive technology has become “more powerful than we ever imagined it would be with deep learning,” is a huge accomplishment in and of itself.

And the accomplishments are rolling forward, he noted, as AI’s reach and voice control of devices is expanding — and embedding — and the nascent voice ecosystem is quickly growing into its adolescent phase.

“Today these devices do great if I need the weather or a recipe. I think in the future they will be able to do far more than that — but they will be increasingly be invisible in the context of what we are otherwise doing.”

Embedding The Intelligence

Webster and Mozer were talking on the eve of the launch of Sensory’s VoiceGenie for Bluetooth speaker — a new product for speaker makers to add voice controls and functions like wake words, without needing any special apps or a Wi-Fi connection. Said simply, Mozer explained, what Sensor is offering for Bluetooth makers is embedded voice — instead of voice via connection to the cloud.

And the expansion into embedded AI and voice control, he noted, is necessary, particularly in the era of data breach, cyber-crime and good old-fashioned user error with voice technology due to its relative newness.

“There are a lot of sensors on our products and phones that are gathering a lot of interesting information about what we are doing and who we are,” Mozer said.

Apart from being a security problem to send all of that information to the cloud, embedding in devices the ability to extract usefully and adapt on demand to a particular user is an area of great potential in improving the devices we all use multiple times daily.

This isn’t about abandoning the cloud, or even a great migration away from it, he said; there’s always going to be a cloud and clients for it. The cloud natively has more power, memory and capacity than anything that can be put into a device at this point on a cost-effective basis.

“But there is going to be this back-and-forth and things right now are swinging toward more embedded ability on devices,” he said. “There is more momentum in that direction.”

The cloud, he noted, will always be the home of things like transactions, which will have to flow through it. But things like verification and authentication, he said, might be centered in the devices’ embedded capacity, as opposed to in the cloud itself.

The Power Of Intermediaries

Scanning the headlines of late in the world of voice connection and advancing AI, it is easy to see two powerful players emerging in Amazon and Google. Amazon announced Alexa’s presence on 100 million devices, and Google immediately followed up with an announcement of its own that Google Assistant will soon be available on over a billion devices.

Their sheer size and scale gives those intermediaries a tremendous amount of power, as they are increasingly becoming the connectors for these services on the way to critical mass and ubiquity, Webster remarked.

Mozer agreed, and noted that this can look a little “scary” from the outside looking in, particularly given how deeply embedded Amazon and Google otherwise are with their respective mastery of eCommerce and online search.

Like many complex ecosystems, Mozer said that the “giants” — Amazon, Google and Apple to a lesser extent — are both partners and competitors, adding that Sensory’s greatest value to the voice ecosystem is when something that is very customized tech and requires a high level of accuracy and customer service features is needed. Sensory’s technology appears in products by Google, Alibaba, Docomo and Amazon, to name a few.

But ultimately, he noted, the marketplace is heading for more consolidation — and probably putting more power in the hands of very few selected intermediaries.

“I don’t think we are going to have 10 different branded speakers. There will be some kind of cohesion — someone or maybe two someones will kick butt and dominate, with another player struggling in third place. And then a lot of players who aren’t players but want to be. We’ve seen that in other tech, I think we will see it with voice.”

As for who those winning players will be, Google and Amazon look good today, but, Mozer noted, it’s still early in the race.

The Future of Connectedness

In the long term future, Mozer said, we may someday look back on all these individual smart devices as a strange sort of clutter from the past, when everyone was making conversation with different appliances. At some point, he ventured, we may just have sensors embedded in our heads that allow us to think about commands and have them go through — no voice interface necessary

“That sounds like science fiction, but I would argue it is not as far out there as you think. It won’t be this decade, but it might be in the next 50 years.”

But in the more immediate — and less Space Age — future, he said, the next several years will be about enhancing and refining voice technologies ability to understand and respond to human voice — and, ultimately, to anticipate the needs of human users.

There won’t be a killer app for voice that sets it on the right path, according to Mozer; it will simply be a lot of capacity unlocked over time that will make voice controls the indispensable tools Sensory has spent the last 25 years hoping they would become.

“When a device is accurate in identifying who you are, and carrying out your desires seamlessly, that will be when it finds its killer function. It is not a thing that someone is going to snap their fingers and come out with,” he said, “it is going to be an ongoing evolution.”

I Nailed It!

August 30, 2017

A few days ago I wrote a blog that talked about assistants and wake words and I said:

“We’ll start seeing products that combine multiple assistants into one product. This could create some strange and interesting bedfellows.”

Interesting that this was just announced:

http://fortune.com/2017/08/30/amazon-alexa-microsoft-cortana-siri/

Here’s another prediction for you…

All assistants will start knowing who is talking to them. They will hear your voice and look at your face and know who you are. They will bring you the things you want (e.g. play my favorite songs), and only allow you to conduct transaction you are qualified for (e.g. order more black licorice). Today there is some training required but in the near future they will just learn who is who much like a new born quickly learns the family members without any formal training.

Banks Looking to Biometrics for Improved Customer Security

October 16, 2015

I saw a LinkedIn message to one of the biometrics groups in which I’m a member linking to a new video on biometrics:

I was quite surprised to see that I am actually in it!

It’s a great topic…Banks turning to biometrics. The video doesn’t talk too much about what’s really happening and why, so I’ll blog about a few salient points, worthy of understanding:

1)    Passwords are on their deathbed. This is old news and everyone gets it, but worthy of repeating. Too easy to crack and/or too hard to remember

2)    Mobile is everything, and mobile biometrics will be the entry point. Our mobile phones will be the tools to control and open a variety of things. Our phones will know who we are and keep track of the probability of that changing as we use them. Mobile banking apps will be accessed through biometrics and that will allow us to not only check balances, but pay or send money or speed ATM transactions.

3)    EMV credit cards are here…Biometric credit confirmation is next! Did you get a smart card from your bank? Europay, Visa, and MasterCard decided to improve fraud by shifting fraud risk based on security implemented. Smart cards are now, biometrics will be added to aid fraud prevention.

4)    It’s all about convenience & security. So much focus has been on security that convenience was often overlooked. There was a perception that you can’t have both! With Biometrics you actually can have an extremely fast and convenient solution that is highly accurate.

5)    Layered biometrics will rule. Any one biometric or authentication approach in isolation will fail. The key is to layer a variety of authentication techniques that enhance the systems security but don’t hurt convenience. Voice and face authentication can be used together, passwords can be thrown on top if the biometric confirmation is unsure, tokens or fingerprint or iris scans can also be deployed if the security isn’t high enough. The key is knowing the accuracy of match and increasing the security to the desired security level in a stepped function so as to maximize user convenience.

Sensory’s CEO, Todd Mozer, interviewed on FutureTalk

October 1, 2015

Todd Mozer’s interview with Martin Wasserman on FutureTalk

TrulyHandsfree 4.0… Maintaining the big lead!

August 6, 2015

We first came out with TrulyHandsfree about five years ago. I remember talking to speech tech executives at MobileVoice as well as other industry tradeshows, and when talking about always-on hands-free voice control, everybody said it couldn’t be done. Many had attempted it, but their offerings suffered from too many false fires, or not working in noise, or consuming too much power to be always listening. Seems that everyone thought a button was necessary to be usable!

In fact, I remember the irony of being on an automotive panel, and giving a presentation about how we’ve eliminated the need for a trigger button, while the guy from Microsoft presented on the same panel the importance of where to put the trigger button in the car.

Now, five years later, voice activation is the norm… we see it all over the place with OK Google, Hey Siri, Hey Cortana, Alexa, Hey Jibo, and of course if you’ve been watching Sensory’s demos over the years, Hello BlueGenie!

Sensory pioneered the button free, touch free, always-on voice trigger approach with TrulyHandsfree 1.0 using a unique, patented keyword spotting technology we developed in-house– and from its inception, it was highly robust to noise and it was ultra-low power. Over the years we have ported it to dozens of platforms, Including DSP/MCU IP cores from ARM, Cadence, CEVA, NXP CoolFlux, Synopsys and Verisilicon, as well as for integrated circuits from Audience, Avnera, Cirrus Logic, Conexant, DSPG, Fortemedia, Intel, Invensense, NXP, Qualcomm, QuickLogic, Realtek, STMicroelectronics, TI and Yamaha.

This vast platform compatibility has allowed us to work with numerous OEMs to ship TrulyHandsfree in over a billion products!

Sensory didn’t just innovate a novel keyword spotting approach, we’ve continually improved it by adding features like speaker verification and user defined triggers. Working with partners, we lowered the draw on the battery to less than 1mA, and Sensory introduced hardware and software IP to enable ultra-low-power voice wakeup of TrulyHandsfree. All the while, our accuracy has remained the best in the industry for voice wakeup.

We believe the bigger, more capable companies trying to make voice triggers have been forced to use deep learning speech techniques to try and catch up with Sensory in the accuracy department. They have yet to catch up, but they have grown their products to a very usable accuracy level, through deep learning, but lost much of the advantages of small footprint and low power in the process.

Sensory has been architecting solutions for neural nets in consumer electronics since we opened the doors more than 20 years ago. With TrulyHandsfree 4.0 we are applying deep learning to improve accuracy even further, pushing the technology even more ahead of all other approaches, yet enabling an architecture that has the ability to remain small and ultra-low power. We are enabling new feature extraction approaches, as well as improved training in reverb and echo. The end result is a 60-80% boost in what was already considered industry-leading accuracy.

I can’t wait for TrulyHandsfree 5.0…we have been working on it in parallel with 4.0, and although it’s still a long ways off, I am confident we will make the same massive improvements in speaker verification with 5.0 that we are doing for speech recognition in 4.0! Once again further advancing the state of the art in embedded speech technologies!

Random Blogger Thoughts

June 30, 2014

  • TrulySecure™ is now announced!!!! This is the first on device fusion of voice and vision for authentication, and it really works AMAZINGLY well. I’m so proud of our new computer vision team and in Sensory’s expansion from speech recognition to speech and vision technologies. Now we are much more than “The Leader in Speech Technologies for Consumer Electronics”- we are “The Leader in Speech and Vision Technology for Consumer Products!” Hey check out the new TrulySecure video on our home page, and our new TrulySecure Product Brief. We hope and expect that TrulySecure will have the same HUGE impact on the market as Sensory had with TrulyHandsfree, the technology that pioneered always on touch less control!
  • Google I/O. Android wants to be everywhere: in our cars, in our homes, and in our phones. They are willing to spend billions of dollars to do it. Why? To observe our behaviors, which in turn will help provide us more of what we want…and they will also assist in those purchases. Of course this is what Microsoft and Apple and others want as well, but right now Google has the best cloud based voice experience, and if you ask me it’s the best user experience that will win the game. Seems like they should try and move ahead on the client, but lucky for Sensory we are staying ahead!
  • Rumors about Samsung acquiring Nuance…Why would they spend $7B for Nuance when they can pick up a more unique solution from Sensory for only $1B? Yeah, that’s a joke, and is definitely not intended as an offer or solicitation to sell Sensory!
  • OH! Sensory has a new logo! We made it to celebrate our 20 year anniversary!

Biometrics – The Studies Don’t Reveal the Truth

May 7, 2014

If you read through the biometrics literature you will see a general security based ranking of biometric techniques starting with retinal scans as the most secure, followed by iris, hand geometry and fingerprint, voice, face recognition, and then a variety of behavioral characteristics.

The problem is that these studies have more to do with “in theory” than “in practice” on a mobile phone, but they never-the-less mislead many companies into thinking that a single biometric can provide the results required. This is really not the case in practice. Most companies will require that False Accepts (error caused by wrong person or thing getting in) and False Rejects (error caused by the right person not getting in) be so low that the rate where these two are equal (equal error rate or EER) would be well under 1% across all conditions. Here’s why the studies don’t reflect the real world of a mobile phone user:

  1. Cost is key. Mobile phone manufacturers will not be willing to invest in the highest end approaches for capturing and measuring biometrics that are used by academic studies. This means less MIPS less memory, and poorer quality readers.
  2. Size matters. Mobile phone manufacturers have extremely limited real estate, so larger systems cannot be properly deployed, and further complicating things is that an extremely fast enrollment and usage is required without a form factor change.
  3. Conditions are uncontrollable. Noisy environments, lighting, dirty hands, oily screens/cameras/readers are all uncontrollable and will affect performance
  4. User compliance cannot be assumed. The careful placement of an eye, finger or face does not always happen.

A great case in point is the fingerprint readers now deployed by Apple and Samsung. These are extremely expensive devices, and the literature would make one think that they are highly accurate, but Apple doesn’t have the confidence to allow them to be used in the iTunes store for ID, and San Jose Mercury News columnist Troy Wolverton says:

“I’ve not been terribly happy with the fingerprint reader on my iPhone, but it puts the one on the S5 to shame. Samsung’s fingerprint sensor failed repeatedly. At best, I would get it to recognize my print on the second try. But quite often, it would fail so many times in a row that I’d be prompted to enter my password instead. I ended up turning it off because it was so unreliable (full article).”

There is a solution to this problem…It’s to utilize sensors already on the phone to minimize cost, and deploy a biometric chain combining face verification, voice verification, or other techniques that can be easily implemented in a user friendly manner that allows the combined usage to create a very low equal error rate, that become “immune” to conditions and compliance issues by having a series of biometric and other secure backup systems.

Sensory has an approach we call SMART, Sensory Methodology for Adaptive Recognition Thresholding that takes a look at environmental and usage conditions and intelligently deploys thresholds across a multitude of biometric technologies to yield a highly accurate solution that is easy to use and fast in responding yet robust to environmental and usage models AND uses existing hardware to keep costs low.

KitKat’s Listening!

November 15, 2013

Android introduced the new KitKat OS for the Nexus 5, and Sensory has gotten lots of questions about the new “always listening” feature that allows a user to say “OK Google” followed by a Google Now search. Here’s some of the common questions:

  1. Is it Sensory’s? Did it come from LG (like the hardware)? Is it Google’s in-house technology? I believe it was developed within the speech team at Android. LG does use Sensory’s technology in the G2, but this does not appear to be an implementation of Sensory. Google has one of the smartest, most capable, and one of the larger speech recognition groups in the industry, and they certainly have the chops to build a key word spotting technology. Actually, developing a voice activated trigger is not very hard. There are several dozens of companies that can do this today (including Qualcomm!). However, making it useable in an “always on” mode is very difficult where accuracy is really important.
  2. The KitKat trigger is just like the one on MotoX, right? Ugh, definitely not. Moto X really has “always on” capabilities. This requires low power operation. The Android approach consumes too much power to be left “always on”. Also, the Moto X approach combines speaker verification so the “wrong” users can’t just take over the phone with their voice. Motorola is a Sensory licensee, Android isn’t.
  3. How is Sensory’s trigger word technology different than others?
    • First of all, Sensory’s approach is ultra low power. We have IC partners like Cirrus Logic, DSPG, Realtek, and Wolfson that are measuring current consumption in the 1.5-2mA range. My guess is that the KitKat implementation consumes 10-100 times more power than this. This is for 2 reasons, 1) We have implemented a “deeply embedded” approach on these tiny DSPs and 2) Sensory’s approach requires as little as 5 MIPS, whereas most other recognizers need 10 to 100 times more processing power and must run on the power hungry Android processor!
    • Second…Sensory’s approach requires minimal memory. These small DSP’s that run at ultra low power allow less RAM and more limited memory access. The traditional approach to speech recognition is to collect tons of data and build huge models that take a lot of memory…very difficult to move this approach onto low power silicon.
    • Thirdly, to be left always on really pushes accuracy, and Sensory is VERY unique in the accuracy of its triggers. Accuracy is usually measured in looking at the two types of errors – “false accepts” when it fires unintentionally, and “false rejects” when it doesn’t let a person in when they say the right phrase. When there’s a short listening window, then “false accepts” aren’t too much of an issue, and the KitKat implementation has very intentionally allowed a “loose” setting which I suspect would produce too many false accepts if it was left “always on”. For example, I found this YouTube video that shows “OK Google” works great, but so does “OK Barry” and “OK Jarvis”
    • Finally, Sensory has layered other technologies on top of the trigger, like speaker verification, and speaker identification. Also Sensory has implemented a “user defined trigger” capability that allows the end customer to define their own trigger, so the phone can accurately and at ultra low power respond to the users personalized commands!