Speech recognition and artificial intelligence are changing how we interact with everything. Voice as a user interface has been a topic for discussion for years, but we believe that the technology finally is accurate enough with sufficiently low latency to provide an experience similar to other forms of interaction. A leading indicator for this is how the adoption of smart assistants from Apple, Google, Amazon, and others has gone mainstream with smart speaker penetration at 34% in the US. Smartphones are already embedded with algorithms enabling voice-based interaction.
Around half of adult Internet users are now using voice technology in some way – whether it’s through voice assistants on their smartphones, in their vehicles, or in their homes.
The advantages of voice
We think the addition of a voice user interface draws many parallels with prior UI changes, for example, the ways in which our interactions with technology have changed with the addition of touch screens, graphical interfaces, and keyboards. Voice has a few key advantages over these technologies, and we expect voice to become a core user interface, coexisting with existing technologies.
- Productivity: Americans type at an average rate of 40 words per minute, but speak at an average of 150 words per minute – more than three times faster. This means that voice-enabling tasks that would normally require a keyboard can result in huge productivity increases, not to mention the simplification and streamlining of numerous daily tasks.
- Improved user experience: Enabling analytics, for example on customer and sales calls, can significantly improve processes as well as the overall user experience.
- Safety: Enabling hands-free engagement is particularly important in environments like industrial work sites and vehicles, where screen engagement is limited or impossible.
Spoken versus conversational language
Key speech and natural language processing technologies include attention detection. This is a lightweight, locally processed command (e.g., "Alexa" or "Hey Siri") that 'wakes up' the application or technology. They also include speech recognition, which recognizes and translates spoken language; natural-language understanding (NLU), and/or interpretation (NLI), which deals with 'intent' processing or comprehension; and speech synthesis, which is the ability to produce human speech.
Currently, even the most sophisticated smart assistants only understand a small subset of user intent. For example, Amazon Alexa’s skills are, for the most part, manually programmed and solve one specific command or 'intent' at a time. Conversational language is much more complex, and natural language processing is considered a particularly 'AI-hard' problem to solve.
Nevertheless, deep learning, and increased computational power and connectivity have significantly improved voice recognition. In 2017, Google's machine learning word accuracy in US English reached 95%, the threshold for human accuracy. By late 2019, Amazon's Alexa smart assistant had racked up more than 100,000 'skills'. And the use of voice assistants is expected to triple over the next few years, to 8 billion by 2023.
Value-added use cases for voice-enabled technology
Despite the hype around smart speakers and smartphone-embedded voice assistants, we believe that a majority of value is in B2B use cases, where significant economic value can be unlocked through voice. The vast majority of consumer use cases remain ‘nice to have’ rather than ‘must have’ and do not necessarily improve our daily lives in a meaningful way. A few examples of what we believe are use cases with strong value propositions include:
- Customer service. A voice interface can provide automatic, conversational responses to a growing subset of customer service calls, such as appointment bookings and common support/helpdesk queries, reducing the need for human interaction to handle repetitive tasks and calls. Conversational transparency can be a powerful tool for discovering new customer-driven product recommendations and improving processes, which is an opportunity that companies like Chorus, Speechmatics, and i2x are addressing.
- Integration of voice with AR/VR solutions. This provides an even deeper level of productivity improvement. One example of this is the Varjo VR display, which is being integrated with voice-assistant technology to provide a totally immersive environment for industrial design and engineering. Another is Realwear, whose flagship product, a head-mounted, wearable, Android-class tablet computer, is safely controlled with voice and thus frees a worker’s hands for dangerous jobs.
- Workflow augmentation. Consider professions like healthcare and field services. These two professions are very different in nature, but both involve diagnosis and treatment/maintenance. A lot of time is spent diagnosing the problem and recording the issue. Combining these workflows saves a considerable amount of time and reduces the risk of human error. Medical professionals currently spend 1-2 hours per day manually entering data into health record systems. Companies like Corti are combining voice with AI, providing a digital assistant that can analyze patient interviews, and provide support for emergency calls. On the consumer side of healthcare, one example of re-engineering an existing application for voice is a feature in the Lifesum Health app that enables Google Assistant users to log meals into the app using voice only (Lifesum is an NGP Capital portfolio company).
Voice – the new future of interaction?
Technology challenges (and opportunities) with voice remain in a number of areas. While training general-purpose speech-recognition solutions require thousands of hours of data, low-resource languages (e.g., Haitian, Zulu, Assamese) and certain domain-specific applications have significantly lower data requirements, providing an opportunity for applying deep learning techniques to build solutions for these scenarios. Small-footprint devices will need technologies that ensure industry-level accuracy and voice quality in devices with lightweight processors, memory, and power sources.
Conversational UIs have the potential to change the way people interact with technology at home and in businesses, but it will take years before we get to a generalized conversational interface.
In the meantime, a host of companies, large and small, are focusing their efforts on the difficult technical challenges that still need to be overcome, and on the specific needs of those vertical and horizontal applications that would benefit most from a voice-enabled user interface.
We believe that voice, as a user interface, is here to stay, along with touch screens, GUIs, keyboards, and the like. However, we still lack the killer apps. Maybe that is why Alibaba announced
last month a further $1.4B investment in its smart speaker ecosystem.