The Speech-to-Text Revolution

First, many schools decided to stop teaching cursive since digital devices text presents in print. Now, some parents are campaigning to end handwriting lessons entirely in favor of earlier typing education. Soon enough, kids might not learn any form of writing; they’ll just speak, instead.

This might seem like a regression – after all, written language is largely believed to be what catapulted the human race forward into civilization. However, thanks to advancements in speech recognition software, returning to spoken communication only may be the next big step into the future.

The History of Speech Recognition

As is always the case with technology, the first speech recognition machines were extremely limited in what they could understand. In 1952, “Audrey” – the first ancestor to today’s “Siri” and “Alexa” – could recognize numbers when they were spoken by a single, familiar voice. Ten years later, “Shoebox” could pick out a total of 16 English words. More than 10 years after that, DARPA built “Harpy,” which had roughly the vocabulary of a 3-year-old – but could search faster and more efficiently than any system before it.

Indeed, advancements in speech recognition have largely been associated with advances in search technology and methods because speech recognition machines must be able to match perceived sounds with possible meanings exceedingly quickly. Google has excelled at producing speech recognition software for mobile devices because its core product is a powerful web search that can discern meaning regardless of spelling or ambiguous phrasing.

During the 1970s and 1980s, technological innovation in the field of speech recognition came fast and heavy. Bell Laboratories developed a system that could interpret multiple voices, and mathematicians developed a new search structure called the hidden Markov model which relied on probability of sound patterns rather than word templates. With this innovation, speech recognition machines began entering the consumer marketplace, as dictation aids (for adults) or responsive toys (for kids).

However, the systems were significantly hampered by one serious flaw in most people’s speech: poor enunciation. For machines to understand sounds, speakers had to talk unbearably slowly, which made manual writing or human-to-human dictation more practical. This slightly improved over time: In the ‘90s, “NaturallySpeaking” dictation software allowed speakers to talk at a rate of 100 words per minute. Yet, by the mid-‘00s, there was not much progress, and the demand for speech-to-text programs was low.

Until the smartphone. One of the primary restraints on the development of speech recognition technology was the availability of speech data, so machines had little information to help them learn what speakers were probably saying. With smartphones, Google and other speech recognition developers gained an overabundance of data; soon, the addition of voice search on computers added to the wealth of sound files computers could analyze and use. Today, advanced voice-to-text software knows more than 230 billion words – a massive jump from the original 16.

Voice Tech of the Future

Speech recognition has improved enough to make it a useful technology for everyday life, and the masses are now clamoring for more voice-controlled options on every device. It seems that developers are complying with enthusiasm. Samsung, Apple, Google, and other smartphone and mobile device manufacturers are racing to produce the smoothest speech recognition apps on the market to help users avoid the labor of typing once and for all.

If voice technology isn’t already ubiquitous, it will be fairly soon. Speech recognition software is becoming exceedingly natural and intelligent, able to function in noisy settings, comprehending multiple languages, discerning different speakers, and responding with lifelike (and customizable) speech of its own. Alongside the development of speech recognition, engineers have worked diligently to build smart networks. Therefore, voice will be the primary means users soon use to interact and change their physical environments: close the blinds, raise the temperature, play a new song, lock the doors, etc. As processors shrink, powerful wearable tech will begin recognizing and reacting to speech. Even cars, which will soon be autonomous anyway, will likely respond to voice commands rather than diligent mechanical handling.

Voice is the oldest of humankind’s myriad tools – and it is arguably the most influential. It should come as no surprise that after centuries of emphasizing the written word, we are now returning to a natural and easy means of communicating and impacting the world around us.