August 6, 2018
Here’s the basic motivation that I see in creating Voice Assistants…Build a cross platform user experience that makes it easy for consumers to interact, control and request things through their assistant. This will ease adoption and bring more power to consumers who will use the products more and in doing so create more data for the cloud providers. This “data” will include all sorts of preferences, requests, searches, purchases, and will allow the assistants to learn more and more about the users. The more the assistant knows about any given user, the BETTER the assistant can help the user in providing services such as entertainment and assisting with purchases (e.g. offering special deals on things the consumer might want). Let’s look at each of these in a little more detail:
1. Owning the cross platform user experience and collecting user data to make a better Voice Assistants.
Owning the user experience on a single device is not good enough. The goal of each of these voice assistants is to be your personal assistant across devices. On your phone, in your home, in your car, wherever you may go. This is why we see Alexa and Google and Siri all battling for, as an example, a position in automotive. Your assistant wants to be the place you turn for consistent help. In doing so it can learn more about your behaviors…where you go, what you buy, what you are interested in, who you talk to, and what your history is. This isn’t just scary big brother stuff. It’s quite practical. If you have multiple assistants for different things, they may each think of you and know you differently, thereby having a less complete picture. It’s really best for the consumer to have one assistant that knows you best.
For example, let’s take the simple case of finding food when I’m hungry. I might say “I’m hungry.” Then the assistant’s response would be much more helpful the more it knows about me. Does it know I’m a vegetarian? Does it know where I’m located, or whether I am walking or driving? Maybe it knows I’m home and what’s in my refrigerator, and can suggest a recipe…does it know my food/taste preferences? How about cost preferences? Does it have the history of what I have eaten recently, and knows how much variety I’d like? Maybe it should tell me something like “Your wife is at Whole Foods, would you like me to text her a request or call her for you?” It’s easy to see how these voice assistants could really be quite helpful the more it knows about you. But with multiple assistants in different products and locations, it wouldn’t be as complete. In this example it might know I’m home, but NOT know what’s in my fridge. Or it might know what’s in the fridge and know I’m home but NOT know my wife is currently shopping at Whole Foods, etc.
The more I use my assistant across more devices in more situations and over more time, the more data it could gather and the better it should get at servicing my needs and assisting me! It’s easy to see that once it knows me well and is helping me with this knowledge it will get VERY sticky and become difficult to get me to switch to a new assistant that doesn’t know me as well.
2. Entertainment and other service package sales.
3. Selling and recommending products to consumers
It would be really obnoxious if Alexa or Siri or Cortana or Google Assistant suddenly suggested I buy something that I wasn’t interested in, but what if it knew what I needed? For example, it could track vitamin usage and ask if I want more before they run out, or it could know how frequently I wear out my shoes, and recommend a sale for my brand and my size, when I really needed them. The more my assistant knows me the better it can “advertise” and sell me in a way that’s NOT obnoxious but really helpful. And of course making extra money in the process!
July 25, 2018
I have spoken on a lot of “voice” oriented shows over the years, and it has been disappointing that there hasn’t been more discussion about the competition in the industry and what is driving the huge investments we see today. Because companies like Amazon and Google participate in and sponsor these shows, there is a tendency to avoid the more controversial aspects of the industry. I wrote this blog to share some of my thoughts on what is driving the competition, why the voice assistant space is so strategically important to companies, and some of the challenges resulting from the voice assistant battles
In September of 2017 it was widely reported that Amazon had over 5000 employees working on Alexa with more than 1000 more to be hired. To use a nice round and conservative number, let’s assume an average Alexa employee’s fully weighted cost to Amazon is $200K. With about 6,000 employees on the Alexa team today, that would mean a $1.2 billion investment. Of course, some of this is recouped by the Echo’s and Dot’s bringing in profits, but when you consider that Dots sell for $30-$50 and Echos at $80-$100, it’s hard to imagine a high enough profit to justify the investment through hardware sales. For example, if Amazon can sell 30 million Alexa devices and make an average of $30 per unit profit, that only covers 75% of the cost of the conservative $1.2 billion investment.
Other evidence supporting the huge investments being made in voice assistants is the battle in advertising. Probably the most talked about thing at 2018’s CES show was the enormous position Google took in advertising the Google Assistant. In fact, if you watch any of the most expensive advertising slots on TV (SuperBowl, NBA finals, World Cup, etc.) you will see a preponderance of advertisements with known actors and athletes saying “Hey Google,” “Alexa,” or, “Hey Siri.” (Being in the wakeword business, I particularly like the Kevin Durant “Yo Google” ad!)
And it’s not just the US giants that are investing big into assistants: Docomo, Baidu, Tencent, Alibaba, Naver, and other large international players are developing their own or working with 3rd party assistants.
So what is driving this huge investment companies are making? It’s a multitude of factors including:
In my next blog, I’ll discuss these three factors in more detail, and in a final blog on this topic I will discuss the challenges being faced by consumer OEMs and service providers that must play in the voice assistant game to not lose out to service and hardware competition from Apple, Amazon, Google, and others.
April 3, 2018
Santa Clara, Calif., April 3, 2018 – Sensory’s TrulyHandsfree speech recognition has been re-engineered to run ultra-low-power on Android and iOS mobile applications without special hardware
Sensory, a Silicon Valley-based company focused on improving the user experience and security of consumer electronics through state-of-the-art embedded AI technologies, today announced that it has made a significant breakthrough in running its TrulyHandsfree™ wake word and speech recognition AI engine directly on Android and iOS smartphone applications at low-power. As a software component, TrulyHandsfree can be adapted to any app without requiring special purpose hardware or DSPs to capture efficiencies in computing.
Introduced in 2009, TrulyHandsfree paved the way for the hands-free operation we have come to expect with today’s always-listening personal assistant solutions. When released it revolutionized voice user interfaces by offering the first commercially successful always-listening low power wake word. With each succeeding generation, TrulyHandsfree has continually upped the benchmark for always-listening speech recognition performance, by increasing accuracy, lowering power consumption, and running across an increasing number of hardware platforms at ultra-low-power consumption.
TrulyHandsfree has seen large commercial success by running on special purpose hardware for low-power operation. Companies like Avnera, Cirrus Logic, Conexant/Synaptics, CSR/Qualcomm, DSP Group, Knowles, QuickLogic, Realtek, XMOS and many others have penetrated the market for voice assistants using Sensory TrulyHandsfree technology. This specialized hardware approach has worked well for Sensory’s customers like Samsung, Huawei, LG, Motorola and other Android mobile providers who design their own phones and wearables with their choice of hardware.
Until now, always-listening wake word solutions for apps required too much power to be practical, especially for apps that remain open and active in the background. Additionally, having to maintain the same user experience across operating systems, and across all different devices added an extra layer of complexity. However, this isn’t the case anymore. TrulyHandsfree streamlines the implementation and coding process, allowing developers to quickly and easily deploy apps with power-efficient always-listening wake word and command set capabilities across all popular mobile and PC operating systems.
In 2017 Sensory embarked on investigations of using Qualcomm and ARM as more standard cross-platform solutions to figure out how to lower power consumption for wake words used across mobile platforms. Sensory came up with a series of independent actions that when combined could lower power consumption on a mobile app using a wake word by more than 80%, or a reduction of approximately 200mAh in a 12-hour day. That enables a mobile app wake word to consume approximately one-percent of the smartphone battery in 12 hours. To achieve this outstanding reduction in power consumption, Sensory utilized an approach known as “little-big,” which uses a very small model to identify an interesting event and then revalidates the event on a large model (both events are processed on the Application Processor). This method provides the optimal user experience of the big model only when needed, while maintaining the power consumption of the little model most of the time. Frame stacking approaches further cut certain wake word model processing functions’ MIPS in half with negligible accuracy impact. Additionally, multithreading has been deployed to allow more efficient processing of speech recognition and can significantly improve the speed of execution for larger wake word models.
“Hands-free operation for voice control has become the norm, and application developers are now looking to create hands-free wake words for their own apps,” said Todd Mozer, CEO of Sensory. “For example, we recently helped Google’s Waze accept hands-free voice commands by supplying them with Sensory’s ‘OK Waze’ wake word that runs when the app is open. With previous versions of TrulyHandsfree, having our always-on wake word engine listening for the OK Waze wake word during a short trip would have had minimal effect on a smartphone’s battery, but for longer trips a more efficient system was desired – so we created it. Sensory is excited to now offer TrulyHandsfree with excellent low-power performance to all app developers!”
TrulyHandsfree is the most widely deployed embedded speech recognition engine in the world, having enabled a hands-free voice user experience on more than two billion devices from leading brands worldwide. TrulyHandsfree offers support for every voice UI application with several types of wake word options, such as independent fixed wake words, user enrolled fixed wake words, and user defined wake words. Sensory offers off-the-shelf wake word models for all major Assistant services, including Alexa, Hey Siri, OK Google, Hey Cortana, as well as wake word models for third-party devices that support cloud AI systems from Baidu, Alibaba and Tencent. Sensory can also combine multiple wake words into one solution and is the only supplier to have deployed numerous cross-assistant wake word solutions to the market.
Sensory’s TrulyHandsfree currently supports US English, UK English, Australian English, Indian English, Arabic, Dutch, French (EU and Canadian), German, Italian, Japanese, Korean, Mandarin, Portuguese (EU and Brazil), Russian, Spanish (EU, Latin America and US), Swedish and Turkish. An SDK for TrulyHandsfree is available for Android, iOS, Linux, Mac OS, QNX and Windows. Sensory provides developer support for cloud service interfaces on Android, iOS, Linux, Mac OS, Windows as well as support for dozens of proprietary DSPs, microcontrollers, smart microphones and other low-power embedded devices. SDK updates taking advantage of lower power TrulyHandsfree are now being rolled out for Android and iOS in Q2 2018.
TrulyHandsfree is a trademark of Sensory Inc.
October 12, 2017
Amazon, Google, Sonos, and LINE all introduced smart speakers within a few weeks of each other. Here’s my quick take and commentary on those announcements.Amazon now has the new Echo, the old Echo, the Echo Plus, Spot, Dot, Show, and Look. The company is improving quality, adding incremental features, lowering cost, and seemingly expanding its leadership position. They make great products for consumers, have a very strong eco-system, and make very tough products to compete with for both their competitors and their many platform partners that use Alexa. Seems that their branding strategy is to use short three- or four-letter names that have Os. The biggest thing that was missing was speaker identification to know who’s talking to it. Interestingly, Amazon just added that capability.
Google execs wore black shirts and jeans in a very ironic-seeming Steve Jobs fashion. They attacked the Amazon Dot with their Mini, and announced the Max to compete with the quality expectations of Sonos and Apple. I didn’t find much innovation in the product line or in their dress, but I’d still rank the Google Assistant as the most capable assistant I’ve used. Of course, Google got caught stealing data, so it makes sense they have more knowledge about us and can make a better assistant.
Sonos invented the Wi-Fi speaker market and has always been known for quality. They announced the Sonos One at a surprisingly aggressive $199 price point. Their unique play is to support Alexa, Assistant, and Siri, starting first with Alexa. Now this would put price pressure on Apple’s planned $349 HomePod, but my guess is that Apple will aggressively sell this into its captive, and demographically wealthy market before they allow Sonos to incorporate Siri. Like Apple, Sonos will have a nice edge in being able to sell into its existing customer base who will certainly want the added convenience and capability of voice control, with their choice of assistant.
American readers might be familiar with LINE, but the company offers a hugely popular communications app that’s been downloaded by about a billion people. They’re big in Japan and owned by Naver, an even bigger Korean company that’s also working on a smart speaker.
Most notable about LINE (besides the unique looking speaker that resembles a cone with the top cut off) is that it appears that they’re not only beating Amazon, Google, Apple, and Sonos to Japan, but they’re also getting there before the Japanese giants like Docomo, Sony, Sharp, and Softbank. And all of these companies are making smart speakers.
Then, there are the Chinese giants who are all making smart speakers, and the old-school speaker companies who are trying to get into the game. It’s going to be crowded very quickly, and I’m very excited to see quality going up and costs staying low.
September 28, 2017
Finovate is one of those shows where you get up on stage and give a short intro and live demo. They are selective in who they allow to present and many applicants are rejected. Sensory demonstrated some really cutting-, perhaps bleeding-, edge stuff by combining animated talking avatars, with text-to-speech, lip movement synchronization, natural language speech recognition and face and voice biometrics. I don’t know of any company ever combining so many AI technologies into a single product or demo!
Speech recognition has a long history of failing on stage, and one of the ways Sensory has always differentiated itself, is that our demos always work! And all our AI technologies worked here too! Even with bright backlighting, our TrulySecure face recognition was so fast and accurate some missed it. With the microphones and echo’s in the large room, our TrulyNatural speech recognition was perfect! That said, we did have a user-error… before Jeff and I got on stage he put his demo phone in DND mode, which cut our audio output – but quickly recovered from that mishap.
September 25, 2017
Several hundred articles have been written about Amazon’s new moves into Smart Glasses with the Alexa assistant. And it’s not just TechCrunch, Gizmodo, The Verge, Engadget, and all the consumer tech pubs doing the writing. It’s also places like but CNBC, USA Today, Fox News, Forbes, and many others.
I’ve read a dozen or more and they all say similar things about Amazon (difficulties in phone hardware), Google (failure in Glass), bone conduction mics, mobility for Alexa, strategy to get Alexa Everywhere, etc. But something big got lost in the shuffle.
Here’s your clue—the day before the Alexa Smart Glasses was announced, Amazon released details of a Fire Tablet upgrade, with one of the key features being a way to make Alexa Handsfree. That’s right, in both the glasses and the Fire Tablet, we have Alexa implementations running on batteries.
This is a REALLY big deal! This means that Amazon has already caught up to Google in being able to implement low-power devices with its handsfree Alexa Assistant. Is this important? Yes, it is. It may be the most important battle to be waged in the Assistant wars. This is because the assistant we want is the invisible assistant that’s embedded into our bodies and our clothing. This assistant would be so small that it enables a seamless experience to augment our intelligence and capabilities without anyone even knowing. This assistant has to be low power, and handsfree Alexa is now enabled in extremely power sensitive modes. Kudos to Amazon!
September 15, 2017
On the same day that Apple rolled out the iPhone X on the coolest stage of the coolest corporate campus in the world, Sensory gave a demo of an interactive talking and listening avatar that uses a biometric ID to know who’s talking to it. In Trump metrics, the event I attended had a few more attendees than Apple.
Interestingly, Sensory’s face ID worked flawlessly, and Apple’s failed. Sensory used a traditional camera using convolutional neural networks with deep learning anti-spoofing models. Apple used a 3D camera.
There are many theories about what happened with FaceID at Apple. Let’s discuss what failure even means and the effects of 2D versus 3D cameras. There are basically three classes of failure: accuracy, spoofability, and user experience. It’s important to understand the differences between them.
It’s easy to reach one in a million or one in a billion FAs by making it FR all of the time. For example, a rock will never respond to the wrong person… it also won’t respond to the right person! This is where Apple failed. They might have had amazing false accepts rates, but they hit two false rejects on stage!
I believe that there is too much emphasis placed on FA. The presumption is random users trying to break in, and 1 in 50,000 seems fine. The break-in issue typically relates to spoofability, which needs to be thought of in a different way – it’s not a random face, it’s a fake face of you.
Every biometric that gets introduces gets spoofed. Gummy bears, cameras, glue, and tape were all used to spoof fingerprints. Photos, masks, and videos have been used to spoof faces.
To prevent this, Sensory built anti-spoof models that weaken the probability of spoofing. 3D cameras also make it easier to reduce spoofs, and Apple moved in the right direction here. But the real solution is to layer biometrics, using additional layers when more security is needed.
Apple misfires on UX?
Apple set the FA so high on FaceID that it hurt the consumer experience by rejecting too much, which is what we saw on stage. But there’s more to it in the tradeoffs.
The easiest way to prevent spoofing is to get the user to do unnatural things, live and randomly. Blinking was a less intrusive version that Google and others have tried, but a photo with the eyes cut out could spoof it.
Having people turn their face, widen their nostrils, or look in varying directions might help prevent spoofing, but also hurt the user experience. The trick is to get more intrusive only when the security needs demand it. Training the device is also part if the user experience.
August 30, 2017
A few days ago I wrote a blog that talked about assistants and wake words and I said:
“We’ll start seeing products that combine multiple assistants into one product. This could create some strange and interesting bedfellows.”
Interesting that this was just announced:
Here’s another prediction for you…
All assistants will start knowing who is talking to them. They will hear your voice and look at your face and know who you are. They will bring you the things you want (e.g. play my favorite songs), and only allow you to conduct transaction you are qualified for (e.g. order more black licorice). Today there is some training required but in the near future they will just learn who is who much like a new born quickly learns the family members without any formal training.
August 28, 2017
Ten years ago, I tried to explain to friends and family that my company Sensory was working on a solution that would allow IoT devices to always be “on” and listening for a key wake up word without “false firing” and doing it at ultra-low power and with very little processing power. Generally, the response was “Huh?”
Today, I say, “Just like Hey Siri, OK Google, Alexa, Hey Cortana, and so on.” Now, everybody gets it and the technology is mainstream. In fact, next year, Sensory will have technology that’s embedded in IoT devices that listens all those things (and more). But that’s not good enough.
Here are some of the things that will be appearing over the next 10 (or more) years to make always listening better and different:
June 26, 2017
Setting aside the question of whether rogue robots will create a dystopian future, there is one area that artificial intelligence (AI) in movies all seem to coalesce on: biometrics will take over for keys and passwords. There are over 200 movies that show the use of biometrics – here’s a list of 184 of them, and here’s a compilation of clips from several dozen movies.
Whether its fingerprint, voiceprint, iris, retina, face, or other biometrics, there always seems to be some sort of physical scanner in Hollywood depictions of biometrics in action. They have to hold their face or hand up to a device and the device often shines a laser and makes a noise. When they speak, a pass phrase like, “My voice is my password,” is typically required. In other words, the biometrics aren’t particularly fast or easy. The devices don’t just know who people are; they need to be queried and some sort of physical analysis needs to happen after the query.
That’s not how it’s going to play out. In fact, it’s not going to be one biometric that gets a person entrance. It will be a layering of biometrics. They won’t all happen right when you want to open a door. Some will follow you around, maintaining an ongoing assessment of who you are. Other biometrics will be seamlessly assessed from cameras or other sensors in your environment, and still other biometric elements can be added by pinging your phone and asking the phone’s opinion on who you are.
One thing Hollywood got right, though, is how spoof-able biometrics tend to be, whether it’s by removing body parts, taking pictures or videos, or capturing a fingerprint with glue or gummy bears. In one scene in the movie The 6th Day, Adam Gibson, played by Arnold Schwarzenegger, is prevented from entering a restricted area when a scanner rejects his thumbprint. When a security guard approaches asking if he can help, Schwarzenegger holds the guard at gunpoint and says, “Yeah, you can stick your thumb in that.” The guard complies, which gains Schwarzenegger access. Spoofing isn’t necessarily easy – biometric vendors try to make it hard – but most single biometrics are spoof-able, and the movies we watch certainly convey that.
We will see more of these biometric implementations with a mixture of face, voice, and behavioral biometrics combined with hand, eye, or other scans that are seamlessly taken and associated with a given person. This approach substantially increases the difficulty in spoofing, yet it can be done in a completely un-intrusive manner without wasting time. Of course, in a movie it would look like people gain access without doing anything special, and that may take away from some of the “cool factor” in watching biometrics work.