Beginners Guide to Voice Services In the Age of Digital Assistants

October 10, 2017

Leveraging your voice to initiate a command, start a service or even change the channel has been around for years. I remember in my grade school days hearing the voice of my parents or grandparents to fetch them something from the other room, change the channel on the television or even just to be quiet. These days, kids have a lot of distractions, from tablets to noise canceling headphones, rendering those ‘old school’ personal assistants no longer available.

All jokes aside, digital assistants and voice driven interactions are becoming commonplace as voice services increasingly improve. BlueFletch has been fortunate enough to work with clients across an array of industry verticals to explore how voice services will impact their organizations. We’ve helped our clients leverage voice services for more impactful engagement with their customers, allowing them to engage customers in new ways and capturing metadata that would be lost during a normal human interaction. Other clients are increasing the efficiency of their workforce by adding voice services in order to allow a worker to be hands free, keeping attention on critical actions and creating a safer work environment.

A (Recent) History of Voice Services

In recent years, we are yelling less at kids and more at Alexa, Bixby, Cortana, Google and Siri. Every major tech company is working on its own digital assistant that is powered by a set of voice services.

- - Apple released Siri in 2011 and now Siri is apart of most of Apple’s operating systems (i.e. iOS, watchOS, macOS and tvOS)
  - Amazon Alexa was made popular by the Amazon Echo and the Amazon Echo Dot. Fun fact, the name Alexa was chosen due to the fact that it has a hard consonant with the X and therefore could be recognized with higher precision.
  - Microsoft lifted its character from the Halo video game franchise to launch their intelligent personal assistant Cortana.
  - Google Assistant was unveiled just last year. Different from Google Now, the Google Assistant can engage in two-way conversations.

This year has been the year of the voice for BlueFletch. We have been a part of many workshops and brainstorming sessions where voice is the centerpiece of the User’s experience. For those new to the space, I wanted to talk about the three technologies that are key to bringing these voice experiences to life.

Three Parts of Voice Interaction, An Explanation

Voice/Speech Processing

The first part of recognition is to efficiently and cleanly capture one’s voice. Overcoming barriers like speaking style, background noise and tones. Advancements in sound processing, thanks to more powerful devices, has allowed for tremendous improvements in sound processing. Most modern devices today have at least two microphones which allows for real time signal processing and noise canceling.

It is important to first capture a clean audio sample so that it can be processed with a higher degree of accuracy. Also, with the advancement of network speed voice processing can happen in real time. Meaning that as you record, you are sending your voice data directly to the server, which translates into faster interactions with systems.

Voice/Speech Recognition

Speech recognition has been around for quite some time. When we speak our voices produce sound waves. These sound waves are made up from the tones and sounds that are a part of the language we are speaking. The inflections in our voices, e.g. the way you use your bottom lip and tongue to pronounce the word ‘flip’ have patterns that are used to understand what you are saying.

Deep learning has finally made speech recognition accurate enough to be useful in broad everyday scenarios. Combining these speech patterns with patterns in dialect and grammar rules feed algorithms that continue to learn and improve the accuracy of transcriptions. In short, our voices or speech are converted into text/words. Natural Language Services

Natural Language Processing or NLP is a way for computers to analyze, understand, and derive meaning from human language in a smart and useful way. The words that are captured as part of the transcription service need to be understood by the application. To do that, the words are broken down into two parts; intents and entities.

- - - Intents are mean to define the purpose of the phase.
    - Entities provide context. If intents are the ‘what’ of the phrase the entities are the ‘who’ and some times ‘where’.

A simple command such as, “Tell me about today’s weather” is separated into intents and entities. The intent is weather. The entities are today and current location. With this information the application understands to provide today’s weather information at a specific location.

This seems easy but there are a lot of power algorithms that are able to quickly process this information. These algorithms and the data that is available are the differentiators between the available platforms. When you combine the processing/capture, recognition and natural language of one’s voice things can start to happen.

Hopefully without being too technical I was able to explain what is happening behind the scenes when you are yelling at your favorite voice assistant.