Recognizing Kids Voices: Advancements in Speech Recognition for Children

February 28, 2023

Since the early 2000’s a number of universities have conducted cutting-edge speech technology research aimed at improving reading fluency and comprehension in children. At Carnegie Mellon University (CMU), Project LISTEN’s virtual tutor allowed children and computers to take turns reading stories aloud. Similarly, the University of Colorado’s Literacy Tutor tracked the child’s reading position on screen while measuring and monitoring events such as oral reading miscues. Complex research systems such as these have offered the possibility of a tireless and engaging tutor focused on a variety of foundational literacy skills. More recently, researchers have focused on even more advanced and richer interactions.  For example, Boulder Learning’s My Science Tutor (MyST) that uses a virtual character to help teach science experiments and CMU’s RoboTutor entry into the XPRIZE literacy challenge aims to improve reading, writing and math skills across the world.

Despite the opportunities to use speech technology to improve education, children’s voices have always presented difficulty for automatic speech recognition. Higher error rates are often attributed to variabilities in vocal tract length, formant frequency and pronunciation. Children often have a smaller vocabulary and less developed grammar in addition to different pitches and prosody making accurate computer interpretation difficult. Children are also more likely to have verbal stumbles, including pauses, repeats, and stutters.

As part of the Colorado Literacy Tutor Project in the mid-2000s, Drs. Bryan Pellom (now VP, Emerging Technologies @ Sensory) and Dr. Andreas Hagen (now Director, Machine Learning @ Sensory) studied the errors made by speech recognition systems when processing highly disfluent and spontaneous children’s speech. Their work proposed a novel speech recognition approach based on using a more granular set of subword units. They hypothesized that such units would allow better modeling of disfluencies, mispronunciations and restarts (e.g., “it was the fi- first day of sum- sum-mer vacation.”) which are often encountered as a child first learns to read.  In their work, they determined optimal subword pieces using a data-driven analysis of grade-appropriate books.  Their research was later highlighted in the 2007 paper entitled “Highly Accurate Children’s Speech Recognition for Interactive Reading Tutors using Sub Word Units”.  Hagen and Pellom found children’s speech recognition based on subword units to be more robust and accurate to a wide range of variability. It’s important to note that the best speech recognition systems today have universally adopted the use of subword units!

At Sensory, we are continuing to advance the state-of-the-art in children’s speech recognition.  Our SensoryCloud.ai speech-to-text solution offers custom tailored models targeting children’s voices. Such models offer between 30 and 50% less error compared to traditional adult-centric speech models. In fact, in a recent test of our solution on the Boulder My Science Tutor (MyST) corpus, we achieved a word error rate of 11.5% or roughly one-third the errors reported by other systems (Southwell et. al, 2022). 

Based on comparison with earlier work from the Colorado Literacy tutor project, we estimate that Sensory’s solution today provides nearly a 75% relative reduction in word error rate compared to earlier research in kids speech recognition.

As we look to the future, privacy and security for children’s speech applications is paramount. Regulations such as COPPA, EU’s GDPR, and the California Consumer Privacy Act highlight the importance and value that parents and educators put on privacy and security when designing products for children. To that end, we pride ourselves in creating speech technologies that will meet this growing market need.  Our cloud platform, for example, has been built from the ground up with data security and consumer privacy in mind and has been SOC2 Type 2 certified for data and security compliance.  Our cloud technologies can even be deployed in a wide variety of configurations (Sensory-hosted, customer-hosted, or even on-premise) and in ways that ensure and enhance voice data privacy. Teams at Sensory have also addressed privacy and security needs by shrinking the size of today’s speech recognition systems to fit on a wide variety of platforms ranging from embedded operating systems to microcontrollers and DSPs.

Speech recognition technology has the potential to be a valuable tool for children in a variety of settings, from education to healthcare. However, it is important to recognize the unique challenges and privacy concerns associated with recognizing children’s speech. By developing accurate and transparent speech recognition systems that prioritize privacy, Sensory can help ensure that this technology is used in a responsible and effective way, while providing the most accurate results achievable today!