HEAR ME -
Speech Blog
HEAR ME - Speech Blog  |  Read more June 11, 2019 - Revisiting Wake Word Accuracy and Privacy
HEAR ME - Speech Blog

Archives

Categories

Archive for the ‘Voice Control’ Category

Assistant vs Alexa: 8 things not discussed (enough)

October 14, 2016

I watched Sundar and Rick and the team at Google announce all the great new products from Google. I’ve read a few reviews and comparisons with Alexa/Assistant and Echo/Home, but it struck me that there’s quite an overlap in the reports I’m reading and some of the more interesting things aren’t being discussed. Here are a few of them, roughly in increasing order of importance:

  1. John Denver. Did anybody notice that the Google Home advertisement using John Denver’s Country Road song? Really? Couldn’t they have found something better? Country Roads didn’t make PlayBuzz’s list of the 15 best “home” songs or Jambase’s top 10 Home Songs Couldn’t someone have Googled “best home songs” to find something better?
  2. Siri and Cortana. With all the buzz about Amazon vs. Google, I’m wondering what’s up with Siri and Cortana? Didn’t see much commentary on that.
  3. AI acquisitions. Anybody notice that Google acquired API.ai? API.ai always claimed to have the highest rated voice assistant in the playstore. They called it “Assistant.” Hm. Samsung just acquired VIV – that’s Adam, Dag, Marco, and company that were behind the original Siri. Samsung has known for a while that they couldn’t trust Google and they always wanted to keep a distance.
  4. Assistant is a philosophical change. Google’s original positioning for its voice services were that Siri and Cortana could be personal assistants, but Google was just about getting to the information fast, not about personalities or conversations. The name “assistant” implies this might be changing.
  5. Google: a marketing company? Seems like Google used to pride itself of being void of marketing. They had engineers. Who needs marketing? This thinking came through loud and clear in the naming of their voice recognizer. Was it Google Voice, Google Now, OK Google? Nobody new. This historical lack of marketing and market focus was probably harmful. It would be fatal in an era of moving more heavily into hardware. That’s probably why they brought on Rick Osterloh, who understands hardware and marketing. Rick, did you approve that John Denver song?
  6. Data. Deep learning is all about data. Data that’s representative and labeled is the key. Google has been collecting and classifying all sorts of data for a very long time. Google will have a huge leg up on data for speech recognition, dialogs, pictures, video, searching, etc. Amazon is relatively new to the voice game, and it is at quite a disadvantage in the data game.
  7. Shopping. The point of all these assistants isn’t about making our lives better; it’s about getting our money. Google and Amazon are businesses with a profit motive, right? Google is very good at getting advertising dollars through search. Amazon is, among other things, very good at getting shoppers money (and they probably have a good amount of shopping data). If Amazon knows our buying habits and preferences and has the review system to know what’s best, then who wants ads? Just ship me what I need and if you get it wrong, let me return it hassle free. I don’t blame Google for trying to diversify. The ad model is under attack by Amazon through Alexa, Dash, Echo, Dot, Tap, etc.
  8. Personalization, privacy, embedded. Sundar talked a bit about personalization. He’s absolutely right that this is the direction assistants need to move (even if speaker verification isn’t built into the first Home units). Personalization occurs by collecting a lot of data about each individual user – what you sound like, how you say things, what music you listen to, what you control in your house, etc. Sundar didn’t talk much about privacy, but if you read user commentary on these home devices, the top issue by far relates to an invasion of privacy, which directly goes against personalization. The more privacy you give up, the more personalization you get. Unless… What if your data isn’t going to the cloud? What if it’s stored on your device in your home? Then privacy is at less risk, but the benefits of personalization can still exist. Maybe this is why Google briefly hit on the Embedded Assistant! Google gets it. More of the smarts need to move onto the device to ensure more privacy!

Sensory Earns Two Coveted 2016 Speech Tech Magazine Awards

August 22, 2016

Sensory is proud to announce that it has been awarded with two 2016 Speech Tech Magazine Awards. With some stiff competition in the speech industry, Sensory continues to excel in offering the industry’s most advanced embedded speech recognition and speech-based security solutions for today’s voice-enabled consumer electronics movement.

The 2016 Speech Technology Awards include:

sla2016

Speech Luminary Award – Awarded to Sensory’s CEO, Todd Mozer

“What really impresses me about Todd is his long commitment to speech technology, and specifically, his focus on embedded and small-footprint speech recognition,” says Deborah Dahl, principal at Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interactions Working Group. “He focuses on what he does best and excels at that.”

spa2016

Star Performers Award – Awarded to Sensory for its contributions in enabling voice-enabled IoT products via embedded technologies

“Sensory has always been in the forefront of embedded speech recognition, with its TrulyHandsfree product, a fast, accurate, and small-footprint speech recognition system. Its newer product, TrulyNatural, is ground- breaking because it supports large vocabulary speech recognition and natural language understanding on embedded devices, removing the dependence on the cloud,” said Deborah Dahl, principal at Conversational Technologies and chair of the World Wide Web Consortium’s Multimodal Interactions Working Group. “While cloud-based recognition is the right solution for many applications, if the application must work regardless of connectivity, embedded technology is required. The availability of TrulyNatural embedded natural language understanding should make many new types of applications possible.”

– Guest Blog by Michael Farino

 

Speaking the language of the voice assistant

June 17, 2016

Hey Siri, Cortana, Google, Assistant, Alexa, BlueGenie, Hound, Galaxy, Ivee, Samantha, Jarvis, or any other voice-recognition assistant out there.

Now that Google and Apple have announced that they’ll be following Amazon into the home far-field voice assistant business, I’m wondering how many things in my home will always be on, listening for voice wakeup phrases. In addition, how will they work together (if at all). Let’s look at some possible alternatives:

Co-existence. We’re heading down a path where we as consumers will have multiple devices on and listening in our homes and each device will respond to its name when spoken to. This works well with my family; we just talk to each other, and if we need to, we use each other’s names to differentiate. I can have friends and family over or even a big party, and it doesn’t become problematic calling different people by different names.

The issue for household computer assistants all being on simultaneously is that false fires will grow in direct proportion to the number of devices on and listening. With Amazon’s Echo, I get a false fire about every other day, and Alexa does a great job of listening to what I say after the false fire and ignoring if it doesn’t seem to be an intended command. It’s actually the best performing system I’ve used and the fact that its starts playing music or talking every other week is a testament to what a good job they have done. However, interrupting my family every other week is not good enough. And if I have five always-listening devices interrupting us 10 times a month, that becomes unacceptable. And if they don’t do as good a job as Alexa, and interrupt more frequently, it becomes quite problematic.

Functional winners. Maybe each device could own a functional category. For example, all my music systems could use Alexa, my TV’s use Hi Galaxy, and all appliances are Bosch. Then I’d have less “names” to call out to and there would be some big benefits: 1) The devices using the same trigger phrase could communicate and compare what they heard to improve performance; 2) More relevant data could be collected on the specific usage models, thus further improving performance; and 3) With less names to call out, I’d have fewer false fires. Of course, this would force me as a consumer to decide on certain brands to stick to in certain categories.

Winner take all. Amazon is adopting a multi-pronged strategy of developing its own products (Echo, Dot, Tap, etc.) and also letting its products control other products. In addition, Amazon is offering the backend Alexa voice service to independent product developers. It’s unclear whether competitors will follow suit, but one thing is clear—the big guys want to own the home, not share it.

Amazon has a nice lead as it gets other products to be controlled by Echo. The company even launched an investment fund to spur more startups writing to Alexa. Consumers might choose an assistant we like (and we think performs well) and just stick with that across the household. The more we share with that assistant, the better it knows us, and the better it serves us. This knowledge base could carry across products and make our lives easier.

Just Talk. In the “co-existence” case previously mentioned, there six people in my household, so it can be a busy place. But when I speak to someone, I don’t always start with their name. In fact, I usually don’t. If there’s just one other person in the room, it’s obvious who I’m speaking to. If there are multiple people in the room, I tend to look at or gesture toward the person I’m addressing. This is more natural than speaking their name.

An “always listening” device should have other sensors to know things like how many people are in the room, where they’re standing and looking at, how they’re gesturing, and so on. These are the subconscious cues humans use to know who is talking to us, and our devices would be smarter and more capable if they could do it.

Google Assistant vs. Amazon’s Alexa

June 15, 2016

“Credit to the team at Amazon for creating a lot of excitement in this space,” Google CEO Sundar Pichai. He made this comment during his Google I/O speech last week when introducing Google’s new voice-controlled home speaker, Google Home which offers a similar sounding description to Amazon’s Echo. Many interpreted this as a “thanks for getting it started, now we’ll take over,” kind of comment.

Google has always been somewhat marketing challenged in naming its voice assistant. Everyone knows Apple has Siri, Microsoft has Cortana, and Amazon has Alexa. But what is Google’s voice assistant called? Is it Google Voice, Google Now, OK Google, Voice Actions? Even those of us in the speech industry have found Google’s branding to be confusing. Maybe they’re clearing that up now by calling their assistant “Google Assistant.” Maybe that’s the Google way of admitting it’s an assistant without admitting they were wrong by not giving it a human sounding name.

The combination of the early announcement of Google Home and Google Assistant has caused some to comment that Amazon has BIG competition at best, and at worst, Amazon’s Alexa is in BIG trouble.

Forbes called Google’s offering the Echo Killer, while Slate said it was smarter than Amazon’s Echo.

I thought I’d point out a few good reasons why Amazon is in pretty good shape:

  1. Google Home is not shipping. Google has a bit of a chicken-and-egg issue in that it needs to roll out a product that has industry support (for controlling third-party products by voice). How do you get industry partners without a product? You announce early! That was a smart move; now they just need to design it and ship it…not always an easy task.
  2. It’s about Voice Commerce. This is REALLY important. Many people think Google will own this home market because it has a better speech recognizer. Speech recognition capabilities are nice but not the end game. The value here is having a device that’s smart and trusted enough to take money out of our bank accounts and deliver us goods and services that we want when we want them. Amazon has a huge infrastructure lead here in products, reviews, shipping, and other key components of Internet commerce. Adding a convenient voice front end isn’t easy, but it’s also NOT the hardest part of enabling big revenue voice commerce systems.
  3. Amazon has far-field working and devices that always “talk back.” I admit the speech recognition is important, and Google has a lot of data, experience, and technologists in machine learning, AI, and speech recognition. But most of the Google experience is through Android and mobile-phone hardware. Where Amazon has made a mark is in far-field or longer distance recognition that really works, which is not easy to do. Speech recognition has always been about signal/noise ratios and far-field makes the task more difficult and requires acoustic echo cancellation, multiple microphones, plus various gain control and noise filtering/speech focusing approaches. Also, the Google recognizer was established around finding data through voice queries, most of such data being displayed on-screen (and often through search). The Google Home and Amazon Echo are no-screen devices. Having them intelligently talk back means more than just reading the text off a search. Google can handle this, of course, but it’s one more technical barrier that needs to be done right.
  4. Amazon has a head start and already is an industry standard. Amazon’s done a nice job with the Echo. It’s follow-on products, Tap and Dot, were intelligent offshoots. Even its Fire TV took advantage of in-house voice capabilities. The Alexa Voice Services work well and already are acting like a standard for voice control. Roughly three million Amazon devices have already sold, and I’d guess that in the next year, the number of Alexa connected devices will double through both Amazon sales and third parties using AVS. This is not to mention the tens of millions of devices on the market that can be controlled by Echo or other Amazon hardware. Amazon is pretty well entrenched!

Of course, Amazon has its challenges as well, but I’ll leave that for another blog.

IoT Roadshow with Open Systems Media

May 6, 2016

Rich Nass and Barbara Quinlan from Open Systems Media visited Sensory on their “IoT Roadshow”.

IoT is a very interesting area. About 10 years ago we saw voice controlled IoT on the way, and we started calling the market SCIDs – Speech Controlled Internet Devices. I like IoT better, it’s certainly a more popular name for the segment! ;-)

I started our meeting off by talking about Sensory’s three products – TrulyHandsfree Voice Control, TrulySecure Authentication, and TrulyNatural large vocabulary embedded speech recognition.

Although TrulyHandsfree is best known for its “always on” capabilities, ideal for listening for key phrases (like OK Google, Hey Cortana, and Alexa), it can be used a ton of other ways. One of them is for hands-free photo taking, so no selfie stick is required. To demonstrate, I put my camera on the table and took pictures of Barbara and Rich.  (Normally I might have joined the pictures, but their healthy hair, naturally good looks, and formal attire was too outclassing for my participation).

 

IoT pic 1IoT pic 2

 

 

 

 

 

 

 

 

There’s a lot of hype about IoT and Wearables and I’m a big believer in both. That said, I think Amazon’s Echo is the perfect example of a revolutionary product that showcases the use of speech recognition in the IoT space and am looking forward to some innovative uses of speech in Wearables!

Here’s the article they wrote on their visit to Sensory and an impromptu video showing TrulyNatural performing on-device navigation, as well as a demo of TrulySecure via our AppLock Face/Voice Recognition app.

IoT Roadshow, Santa Clara – Sensory: Look ma, no hands!

Rich Nass, Embedded Computing Brand Director

If you’re an IoT device that requires hands-free operation, check out Sensory, just like I did while I was OpenSystems Media’s IoT Roadshow. Sensory’s technology worked flawlessly running through the demo, as you can see in the video. We ran through two different products, one for input and one for security.

Sensory’s CEO, Todd Mozer, interviewed on FutureTalk

October 1, 2015

Todd Mozer’s interview with Martin Wasserman on FutureTalk

TrulyHandsfree 4.0… Maintaining the big lead!

August 6, 2015

We first came out with TrulyHandsfree about five years ago. I remember talking to speech tech executives at MobileVoice as well as other industry tradeshows, and when talking about always-on hands-free voice control, everybody said it couldn’t be done. Many had attempted it, but their offerings suffered from too many false fires, or not working in noise, or consuming too much power to be always listening. Seems that everyone thought a button was necessary to be usable!

In fact, I remember the irony of being on an automotive panel, and giving a presentation about how we’ve eliminated the need for a trigger button, while the guy from Microsoft presented on the same panel the importance of where to put the trigger button in the car.

Now, five years later, voice activation is the norm… we see it all over the place with OK Google, Hey Siri, Hey Cortana, Alexa, Hey Jibo, and of course if you’ve been watching Sensory’s demos over the years, Hello BlueGenie!

Sensory pioneered the button free, touch free, always-on voice trigger approach with TrulyHandsfree 1.0 using a unique, patented keyword spotting technology we developed in-house– and from its inception, it was highly robust to noise and it was ultra-low power. Over the years we have ported it to dozens of platforms, Including DSP/MCU IP cores from ARM, Cadence, CEVA, NXP CoolFlux, Synopsys and Verisilicon, as well as for integrated circuits from Audience, Avnera, Cirrus Logic, Conexant, DSPG, Fortemedia, Intel, Invensense, NXP, Qualcomm, QuickLogic, Realtek, STMicroelectronics, TI and Yamaha.

This vast platform compatibility has allowed us to work with numerous OEMs to ship TrulyHandsfree in over a billion products!

Sensory didn’t just innovate a novel keyword spotting approach, we’ve continually improved it by adding features like speaker verification and user defined triggers. Working with partners, we lowered the draw on the battery to less than 1mA, and Sensory introduced hardware and software IP to enable ultra-low-power voice wakeup of TrulyHandsfree. All the while, our accuracy has remained the best in the industry for voice wakeup.

We believe the bigger, more capable companies trying to make voice triggers have been forced to use deep learning speech techniques to try and catch up with Sensory in the accuracy department. They have yet to catch up, but they have grown their products to a very usable accuracy level, through deep learning, but lost much of the advantages of small footprint and low power in the process.

Sensory has been architecting solutions for neural nets in consumer electronics since we opened the doors more than 20 years ago. With TrulyHandsfree 4.0 we are applying deep learning to improve accuracy even further, pushing the technology even more ahead of all other approaches, yet enabling an architecture that has the ability to remain small and ultra-low power. We are enabling new feature extraction approaches, as well as improved training in reverb and echo. The end result is a 60-80% boost in what was already considered industry-leading accuracy.

I can’t wait for TrulyHandsfree 5.0…we have been working on it in parallel with 4.0, and although it’s still a long ways off, I am confident we will make the same massive improvements in speaker verification with 5.0 that we are doing for speech recognition in 4.0! Once again further advancing the state of the art in embedded speech technologies!

Sensory Talks AI and Speech Recognition With Popular Science Radio Host Alan Taylor

June 11, 2015

Guest post by: Michael Farino

Pop Science Radio

 

 

 

 

 

 

 

Sensory’s CEO, Todd Mozer joined Alan Taylor, host of Popular Science Radio, in a fun discussion about artificial intelligence, Sensory’s involvement with the Jibo robot development team, and also gave the show’s listeners a look into the past 20 years of speech recognition. Todd and Alan additionally discussed some of the latest advancements in speech technology, and Todd provided an update on Sensory’s most recent achievements in the field of speech recognition as well as a brief look into what the future holds.

Listen to the full radio show at the link below:

Big Bang Theory, Science, and Robots | FULL EPISODE | Popular Science Radio #269
Ever wondered how accurate the science of the Big Bang Theory TV series is? Curious about how well speech recognition technology and robots are advancing? We interview two great minds to probe for these answers

Going Deep Series – Part 3 of 3

May 1, 2015

Going Deep Banner small

 

 

Winning on Accuracy & Speed… How can a tiny player like Sensory compete in deep learning technology with giants like Microsoft, Google, Facebook, Baidu and others?

There’s a number of ways, and let me address them specifically:

  1. Personnel: We all know it’s about quality, not quantity. I’d like to think that at Sensory we hire higher-caliber engineers than they do at Google and Microsoft; and maybe to an extent that is true, but probably not true when comparing their best with our best. We probably do however have less “turnover”. Less turnover means our experience and knowledge base is more likely to stay in house rather than walk off to our competitors, or get lost because it wasn’t documented.
  2. Focus and strategy: Sensory’s ability to stay ahead in the field of speech recognition and vision is because we have remained quite focused and consistent from our start. We pioneered the use of neural networks for speech recognition in consumer products. We were focused on consumer electronics before anyone thought it was a market…more than a dozen years before Siri!
  3. “Specialized” learning: Deep learning works. But Sensory has a theory that it can also be destructive when individual users fall outside the learned norms. Sensory learns deep on a general usage model, but once we go on device, we learn shallow through a specialized adaptive process. We learn to the specifics of the individual users of the device, rather than to a generalized population.

These 3 items together have provided Sensory with the highest quality embedded speech engines in the world. It’s worth reiterating why embedded is needed, even if speech recognition can all be done in the cloud:

  1. Privacy: Privacy is at the forefront of todays most heated topics. There is growing concern about “big brother” organizations (and governments) that know the intimate details of our lives. Using embedded speech recognition can help improve privacy by not sending personal data for analysis into the cloud.
  2. Speed: Embedded speech recognition can be ripping fast and consistently available. Accessing online or cloud based recognition services can be spotty when Internet connections are unstable, and not always available.
  3. Accuracy: Embedded speech systems have the potential advantage of a superior signal to noise ratio and don’t risk data loss or performance issues due to a poor or non-existent connection.

 

Going Deep Series – Part 2 of 3

April 22, 2015

Going Deep Banner small

 

 

How does Big Data and Privacy fit into the whole Deep Learning Puzzle?

Privacy and Big Data have become big concerns in the world of Deep Learning. However, there is an interesting relationship between the Privacy of personal data and information, Big Data, and Deep Learning. That’s because a lot of the Big Data is personal information used as the data source for Deep Learning. That’s right, to make vision, speech and other systems better, many companies invade users’ personal information and the acquired data is used to train their neural networks. So basically, Deep Learning is neural nets learning from your personal data, stats, and usage information. This is why when you sign a EULA (end user license agreement) you typically give up the rights to your data, whether its usage data, voice data, image data, personal demographic info, or other data supplied through the “free” software or service.

Recently, it was brought to consumers’ attention that some TVs and even children’s toys were listening in on consumers, and/or sharing and storing that information to the cloud. A few editors called me to get my input and I explained that there are a few possible reasons for devices to do this kind of “spying” and none of which are the least bit nefarious: The two most common reasons are 1) The speech recognition technology being used needs the voice data to train better models, so it gets sent to the cloud to be stored and used for Deep Learning and/or 2) The speech recognition needs to process the voice data in the cloud because it is unable to do so on the device. (Sensory will change this second point with our upcoming TrulyNatural release!)

The first reason is exactly what I’ve been blogging about when we say Deep Learning. More data is better! The more data that gets collected, the better the Deep Learning can be. The benefits can be applied across all users, and as long as the data is well protected and not released, then it only has beneficial consequences.

Therein lies the challenge: “as long as the data is well protected and not released…” If banks, billion dollar companies and governments can’t protect personal data in the cloud, then who can, and why should people ever assume their data is safe, especially from systems where there is no EULA is place and data is being collected without consent (which happens all the time BTW)?

Having devices listen in on people and share their voice data with the cloud for Deep Learning or speech recognition processing is an invasion of privacy. If we could just keep all of the deep neural net and recognition processing on device, then there would be no need to risk the security of peoples’ personal data by sharing and storing it on the cloud… and its with this philosophy that Sensory pioneered an entirely different, “embedded” approach to deep neural net based speech recognition which we will soon be bringing to market. Sensory actually uses Deep Learning approaches to train our nets with data collected from EULA consenting and often paid subjects. We then take the recognizer built from that research and run it on our OEM customers’ devices and because of that, never have to collect personal data; so, the consumers who buy products from Sensory’s OEM customers can rest assured that Sensory is never putting their personal data at risk!

In my next blog, I’ll address the question about how accurate Sensory can be using deep nets on device without continuing data collection in the cloud. There are actually a lot of advantages for running on device beyond privacy, and it can include not only response time but accuracy as well!

« Older EntriesNewer Entries »