Posts Tagged ‘speech recognition’
June 8, 2017
Since the beginning, Sensory has been a pioneer in advancing AI technologies for consumer electronics. Not only did Sensory implement the first commercially successful speech recognition chip, but we also were first to bring biometrics to low cost chips, and speech recognition to Bluetooth devices. Perhaps what I am most proud of though, more than a decade ago Sensory introduced its TrulyHandsfree technology and showed the world that wakeup words could really work in real devices, getting around the false accept and false reject, and power consumption issues that had plagued the industry. No longer did speech recognition devices require button presses…and it caught on quickly!
Let me go on boasting because I think Sensory has a few more claims to fame… Do you think Apple developed the first “Hey Siri” wake word? Did Google develop the first “OK Google” wake word? What about “Hey Cortana”? I believe Sensory developed these initial wake words, some as demos and some shipped in real products (like the Motorola MotoX smartphone and certain glasses). Even third-party Alexa and Cortana products today are running Sensory technology to wake up the Alexa cloud service.
Sensory’s roots are in neural nets and machine learning. I know everyone does that today, but it was quite out of favor when Sensory used machine learning to create a neural net speech recognition system in the 1990’s and 2000’s. Today everyone and their brother is doing deep learning (yeah that’s tongue in cheek because my brother is doing it too! (http://www.cs.colorado.edu/~mozer/index.php). And a lot of these deep learning companies are huge multi-billion-dollar business or extremely well-funded startups.
So, can Sensory stay ahead now and continuing pioneering innovation in AI now that everyone is using machine learning and doing AI? Of course, the answer is yes!
Sensory is now doing computer vision with convolutional neural nets. We are coming out with deep learning noise models to improve speech recognition performance and accuracy, and are working on small TTS systems using deep learning approaches that help them sound lifelike. And of course, we have efforts in biometrics and natural language that also use deep learning.
We are starting to combine a lot of technologies together to show that embedded systems can be quite powerful. And because we have been around longer and thought through most of these implementations years before others, we have a nice portfolio of over 3 dozen patents covering these embedded AI implementations. Hand in hand with Sensory’s improvements in AI software, companies like ARM, NVidia, Intel, Qualcomm and others are investing and improving upon neural net chips that can perform parallel processing for specialized AI functions, so the world will continue seeing better and better AI offerings on “the edge”.
Curious about the kind of on-device AI we can create when combining a bunch of our technologies together? So were we! That’s why we created this demo that showcases Sensory’s natural language speech recognition, chatbots, text-to-speech, avatar lip-sync and animation technologies. It’s our goal to integrate biometrics and computer vision into this demo in the months ahead:
Let me know what you think of that! If you are a potential customer and we sign an NDA, we would be happy to send you an APK of this demo so you can try it yourself! For more information about this exciting demo, please check out the formal announcement we made: http://www.prnewswire.com/news-releases/sensory-brings-chatbot-and-avatar-technology-to-consumer-devices-and-apps-300470592.html
February 1, 2017
The hands-free personal assistant that you can wake on voice and talk to naturally has significantly gained popularity the last couple of years. This kind of technology made its debut not all that long ago as a feature of Motorola’s MotoX, a smartphone that had always-listening Moto Voice technology powered by Sensory’s TrulyHandsfree technology. Since then, the always-listening digital assistant quickly spread across mobile phones and PCs from several different brands, making phrases like, “Hey Siri,” “Okay Google,” and, “Hey Cortana,” commonplace.
Then, out of nowhere, Amazon successfully tried its hand at the personal assistant with the Echo, sporting a true natural language voice interface and Alexa cloud-based AI. It was initially marketed for music, but quickly expanded domain coverage to include weather, Q&A, recipes, and the ability to answer common questions. On top of that, Amazon also opened its platform up to third-party developers, allowing them to proliferate the skill sets available on the Alexa platform, with now more than 10,000 skills accessible to users. These skills allow Amazon’s Echo, Tap, and Dot, as well as the several new third-party Alexa-equipped products like Nucleus and Triby, to be used to access and control various IoT functions, from reading heart rates on Fitbits to ordering pizzas and controlling lights within the home.
Until recently, always-listening, hands-free assistants required a certain minimum power capability, restricting form factors to table top speakers or appliance devices that had to either be plugged in to an outlet or have a large battery. Also, Amazon’s Echo, Tap, and Dot all required a Wi-Fi connection for communicating with the Alexa AI engine to make use of its available skills. Unfortunately, this meant you were restricted to using Alexa within your home or Wif-Fi network. If you wanted to go on a run, the only way to ask Alexa for your step count or heart rate was to wait until you got back home.
This is changing now with technology like Sensory’s VoiceGenie, an always-listening embedded speech recognizer for wearables and hearables that runs in a low power mode on a Qualcomm/CSR Bluetooth chip. The solution takes a session border controller (SBC) music decoder and intertwines it with a speech recognition system so that while music is playing and the decoder is in-use, VoiceGenie is on and actively listening, allowing the Bluetooth device to listen for two keywords:
To give an example of how this works, a Bluetooth headset’s volume, pairing process, battery strength, or connection status can only be controlled or monitored through the device itself, so VoiceGenie handles those controls with no touching required. VoiceGenie can also read the incoming caller’s name and ask the user if they want to answer or ignore. Additionally, VoiceGenie can call up the phone’s assistant like Google Assistant, Siri, or Cortana, to ask by voice for a call to be made or a song to be played. By saying, “Alexa,” the user can access the Alexa service directly from their Bluetooth headsets while out and about, using their smartphone as the connection to the Alexa cloud.
Today’s consumer wants a personalized assistant that knows them, is convenient to use, keeps their secrets safe, and helps them in their daily lives. This help can be accessing information, getting answers to questions or intelligently controlling your home environment. It’s very difficult to accomplish this for privacy and power reasons solely using cloud-based AI technology. There needs to be embedded intelligence on devices, and it needs to run at low power. A low-power embedded voice assistant that adds an intelligent voice interface to portable and wearable devices, while also adding Alexa functionality to them, can address those needs.
June 11, 2015
Guest post by: Michael Farino
Sensory’s CEO, Todd Mozer joined Alan Taylor, host of Popular Science Radio, in a fun discussion about artificial intelligence, Sensory’s involvement with the Jibo robot development team, and also gave the show’s listeners a look into the past 20 years of speech recognition. Todd and Alan additionally discussed some of the latest advancements in speech technology, and Todd provided an update on Sensory’s most recent achievements in the field of speech recognition as well as a brief look into what the future holds.
Listen to the full radio show at the link below:
Big Bang Theory, Science, and Robots | FULL EPISODE | Popular Science Radio #269
June 3, 2015
When I started Sensory over 20 years ago, I knew how difficult it would be to sell software to cost sensitive consumer electronic OEMs that would know my cost of goods. A chip based method of packaging up the technology made a lot of sense as a turnkey solution that could maintain a floor price by adding the features of a microcontroller or DSP with the added benefit of providing speech I/O. The idea was “buy Sensory’s micro or DSP and get speech I/O thrown in for free”.
After about 10 years it was becoming clear that Sensory’s value add in the market was really in technology development, and particularly in developing technologies that could run on low cost chips and with smaller footprints, less power, and superior accuracy than other solutions. Our strategy of using trailing IC technologies to get the best price point was becoming useless because we lacked the scale to negotiate the best pricing, and more cutting edge technologies were becoming further out of reach; even getting the supply commitments we needed was difficult in a world of continuing flux between over and under capacity.
So Sensory began porting our speech technologies onto other people’s chips. Last year about 10% of our sales came from our internal IC’s! Sensory’s DSP, IP, and platform partners have turned into the most strategic of our partnerships.
Today in the semiconductor industry there is a consolidation that is occurring that somewhat mirrors Sensory’s thinking over the past 10 years, albeit at a much larger scale. Avago pays $37 billion dollars for Broadcom, Intel pays $16.7B for Altera, and NXP pays $12B for Freescale, and the list goes on, dwarfing acquisitions of earlier time periods.
It used to be the multi-billion dollar chip companies gobbled up the smaller fabless companies, but now even the multibillion-dollar chip companies are being gobbled up. There’s a lot of reasons for this but economies of scale is probably #1. As chips get smaller and smaller, there are increasing costs for design tools, tape outs, prototyping, and although the actual variable per chip cost drops, the fixed costs are skyrocketing, making consolidation and scale more attractive.
That sort of consolidation strategy is very much a hardware centered philosophy. I think the real value will come to these chip giants through in house technology differentiation. It’s that differentiation that will add value to their chips, enabling better margins and/or more sales.
I expect that over time the chip giants will realize what Sensory concluded 10 years ago…that machine learning, algorithmic differentiation, and software skills, are where the majority of the value added equation on “smart” chips needs to come from, and that improving the user experience on devices can be a pot of gold! In fact, we have already seen Intel, Qualcomm and many other chip giants investing in speech recognition, biometrics, and other user experience technologies, so the change is underway!
May 1, 2015
Winning on Accuracy & Speed… How can a tiny player like Sensory compete in deep learning technology with giants like Microsoft, Google, Facebook, Baidu and others?
There’s a number of ways, and let me address them specifically:
These 3 items together have provided Sensory with the highest quality embedded speech engines in the world. It’s worth reiterating why embedded is needed, even if speech recognition can all be done in the cloud:
March 30, 2015
At Mobile World Congress, I participated in ZTE’s Mobile Voice Alliance panel. ZTE presented data researched in China that basically said people want to use speech recognition on their phones, but they don’t use it because it doesn’t work well enough. I have seen similar data on US mobile phone users, and the automotive industry has also shown data supporting the high level of dissatisfaction with speech recognition.
In fact, when I bought my new car last year I wanted the state of the art in speech recognition to make navigation easier… but sadly I’ve come to learn that the system used in my Lexus just doesn’t work well — even the voice dialing doesn’t work well.
As an industry, I feel we must do better than this, so in this blog I’ll provide my two-cents as to why speech recognition isn’t where it should be today, even when technology that works well exists:
Deep learning nets have enabled some amazing progress in speech recognition over the last five years. The next five years will see embedded recognition with high performance noise cancelling and beamforming coming to the forefront, and Sensory will be leading this charge… and just like how Sensory led the way with the “always on” low-power trigger, I expect to see Google, Apple, Microsoft, Amazon, Facebook and others follow suit.
February 11, 2015
The advent of “always on” speech processing has raised concerns about organizations spying on us from the cloud.
In this Money/CNN article, Samsung is quoted as saying, “Samsung does not retain voice data or sell it to third parties.” But, does this also mean that your voice data isn’t being saved at all? Not necessarily. In a separate article, the speech recognition system in Samsung’s TVs is shown to be an always-learning cloud-based system solution from Nuance. I would guess that there is voice data being saved, and that Nuance is doing it.
This doesn’t mean Nuance is doing anything evil; this is just the way that machine learning works. There has been this big movement towards “deep” learning, and what “deep” really means is more sophisticated learning algorithms that require more data to work. In the case of speech recognition, the data needed is speech data, or speech features data that can be used to train and adapt the deep nets.
But just because there is a necessary use for capturing voice data and invading privacy, doesn’t mean that companies should do it. This isn’t just a cloud-based voice recognition software issue; it’s an issue with everyone doing cloud based deep learning. We all know that Google’s goal in life is to collect data on everything so Google can better assist you in spending money on the right things. We in fact sign away our privacy to get these free services!
I admit guilt too. When Sensory first achieved usable results for always-on voice triggers, the basis of our TrulyHandsfree technology, I applied for a patent on a “background recognition system” that listens to what you are talking about in private and puts together different things spoken at different times to figure out what you want…. without you directly asking for it.
Can speech recognition be done without having to send all this private data to the cloud? Sure it can! There’s two parts in today’s recognition systems: 1) The wake up phrase; 2) The cloud based deep net recognizer – AND NOW THEY CAN BOTH BE DONE ON DEVICE!
Sensory pioneered the low-power wake up phrase on device (item 1), now we have a big team working on making an EMBEDDED deep learning speech recognition system so that no personal data needs to be sent to the cloud. We call this approach TrulyNatural, and it’s going to hit the market very soon! We have benchmarked TrulyNatural against state-of-the-art cloud-based deep learning systems and have matched and in some cases bested the performance!
October 15, 2014
A couple of news headlines have appeared recently asserting that voice activation is unsafe. I thought it was time for Sensory to weigh in on a few aspects of this since we are the pioneers in voice activation:
October 7, 2013
October 2, 2012
I really enjoyed reading this article interviewing Vlad Sejnoha, Nuance’s CTO. Most people would consider Nuance the leader in speech recognition today, and Vlad is certainly a very smart, thoughtful, and articulate man.
I enjoyed it for a few different reasons. The first and main reason I liked the article is it helps to push the idea Sensory has been championing for the past several years that devices don’t have to be touched to enable voice commands, and that you should be able to just start talking to things like we talk to each other. That’s what Sensory calls TrulyHandsfree, and it’s the technology that showed up in the first Bluetooth carkit that requires no touching (by BlueAnt) AND the first mobile phones that responded to voice without touch (Samsungs Galaxy SII and SIII and Note). Even hit toys like Mattel’s award winning Fijit Friends and Hallmarks Interactive Books use this unique technology that just works when you talk to it. In fact, it really was the TrulyHandsfree feature that made Vlingo so popular, as this Vlingo video nicely states in its comparison between Vlingo and Siri. (Nuance bought Vlingo earlier this year, but the Sensory TrulyHandsfree didn’t come with it!).
The article says “Sejnoha believes that within a year or two you’ll be able to talk to your smartphone even as it lies idle on a desk, asking it questions such as, “When’s my next appointment?” The phone will be able to detect that you are speaking, wake itself up, and accomplish the task at hand.” Check out this Sensory video…this is definitely what Vlad is talking about! Yeah, we can do it today, and it’s REALLY FAST and really accurate.
But is it low power? Well that’s ABSOLUTELY KEY. That’s why Sensory partnered with Tensilica. Tensilica is a leader in low power audio DSP’s for Mobile Phones. Sensory already has its TrulyHandsfree running on chips that run under 5 mW for a COMPLETE audio system. And that’s without having to wake up to understand the task at hand. We can drop by another 1-2mW by not being always on, but turning the recognizer off doesn’t do much. That’s because even if the full recognizer is shut down, you still need to run a mic and preamp, which drives a lot of the current consumption when you have a low power recognizer like TrulyHandsfree (it can run on as little as 7 MIPS!). This means it’s REALLY critical to have a low power recognizer as well, and that’s Sensory’s forte. We are expecting that by next year we will have systems running at 1-3mW!
The article mentions “persistent” listening, but even though I’ve always preached this “always on” concept, I think what will really explode is “intelligent automatic listening”. That is, the device figures out when it needs to listen for what and turns on to listen for it. So it doesn’t always have to be on…it will just seem that way because the devices are so intelligent. For example a certain traveling speed could make a phone listen for car commands or car wake up words. An incoming call could cause the recognizer to wake up and listen for Answer/Ignore. For these to work, the device needs to run not only at very low power but also with VERY high accuracy. You don’t want to have a background conversation triggering the phone call to hang up! Accuracy is another Sensory forte! The combination of accuracy with low power consumption is a difficult mix to conquer! Sensory’s accuracy is not only in noise but also from a distance…that is when a recognizer works well with a poor S/N ratio, that means the signal can be lower (like from distance) and/or the noise can be higher.
So it’s really cool that Nuance is getting on the bandwagon behind Sensory’s innovations like TrulyHandsfree at low power. In fact after Samsungs release on the Galaxy SII with Sensory, Nuance did come out with an always “on and listening mobile device”; for fun we quickly ported our technology onto the same phone to compare…check out this video.
Something interesting we noticed was that after Sensory announced its speaker verification and speaker ID for mobile devices at CTIA this year, Nuance shortly thereafter came out with their own announcement, but there were no demo’s available so we couldn’t do a comparison video.