After months of speculation, last week Amazon finally announced its entry into the living room with the Fire TV. It arrived with a bunch of fanfare, at a price point equal to that of the Apple TV and Roku 3 but with a number of extras including (moderately) high performance gaming. From my perspective, the most notable feature was not one that that you saw but one that you heard. See, unlike the simple voice commands of electronics past, the Fire TV includes a voice interface that combines complex command interpretation with full speech-to-text capabilities. It can do amazing things like search for all titles featuring Mila Kunis by just speaking her name.
Voice interfaces this sophisticated have been a long time in the works in a story that has played out over three discrete phases.
Phase 1: The PC Era
The early days of consumer-grade speech recognition came in the mid 1990s with Dragon NaturallySpeaking. For $695, a fully loaded PC in 1997 could, barely, hear what you were speaking in a perfect environment and translate roughly 90% of it into text after a 45 minute training session. As processor speeds increased and voice recognition research advanced, so did the ability for full blown computers to take voice and translate it into text. But given the ubiquity of a mouse and keyboard in the computer world, the use of voice as an interface never really took off. In the same amount of time or less that you could say “search for all movies directed by Wes Anderson” to your computer with a headset on, you could type “Wes Anderson movies” into a search box and get the same results.
Phase 2: The Mobile Era
By 2011, compute technology and cloud access had advanced such that voice could finally take on a new form factor while maintaining the accuracy needed for consumer adoption. That form factor: the mobile phone. In the interface-light world of a smart phone, voice as an interface made lots of sense. With Apple’s internally developed dual-core A5 processor and massive North Carolina data center, a mobile device finally had enough power and connectivity to handle complex voice tasks and Siri was born. Siri could listen for a set of keywords, like “call” or “text,” and then immediately voice-to-text transcribe the next words, passing them to the appropriate application. Texting your mom from the car no longer felt like flirting with death - it felt magic. Google quickly realized this as well, releasing Google Search (now Google Now) on Android phones using similar high-power processors and cloud connectivity. Even Microsoft is coming to the party with Cortana.
Phase 3: The Standalone Era
The latest phase of voice interface advancement arrived just a year ago. Thanks to the accelerating sophistication of processors and low cost cloud connectivity, even inexpensive hardware devices are now beginning to support advanced voice control and interfaces. Google Glass was one of the first examples of this, allowing users to say “OK Glass” (a set trigger) followed by a variety of set commands (like “search” or “email”) followed by any freeform text. At $1,500, however, it was hardly low cost. Soon Microsoft followed suit with the Xbox One at a more reasonable $499. With the Xbox One, users can not only control key features of the system (like tuning TV channels and recording shows) but they can also interact with games in new ways. Dead Rising 3, for example, requires voice commands to play.
The Fire TV brings this sophisticated voice interface to a whole new price point. At $99, complex voice as an interface technology is finally easily accessible. Powered by a Qualcomm Snapdragon 600-series processor and wired/wireless internet connectivity to AWS, the Fire TV is the latest innovation bringing voice interface to a whole new class of devices.
With the Fire TV, the speed and connectivity needed for complex voice has finally hit mainstream, and that’s something that has the potential to change the way we interact with devices all around us.