Vimal Niroshan

Building Voice Assisted Web Applications

Speech Recognition and Speech Synthesis are interesting topics in Artificial Intelligence (AI). 12 years back I used to wonder "How practical these solutions are going to be ?" then,
I used to think how quickly an audio can be processed to understand the meaning to respond? But its no more a problem as the hardware throughput is much higher compared to decade ago. Here comes the age of voice assisted devices and applications.

I started experimenting Amazon Echo when its launched and tried writing some skills for it. Amazon Echo has its own hardware and software to provide voice assistance. I started thinking on "How this can be used in Web Applications ?", which uses a thin client and does not have control over the hardware. Then I saw, Google's "Ok Google" services are available in its web page. "How does it do the processing ?" "Does it send the audio all the way to the server and to recognize and do the search ?" Then I found this: If you have noticed this service of Google is available only in Chrome browser and not in any other browser.

On further research on the subject, I got to know that there is a Web Speech API specification by W3C in draft version. Google is ahead of the game, Chrome already supports this specification and has a working version of it. Firefox and Safari are working on their Web Speech API spec. implementation but, not yet ready. So currently Chrome is the only browser supports both SpeechRecognition and SpeechSynthesis. Though, this specification is not final and these APIs implementation having bugs they are good enough to plan on this advancement and experiment.

Here, I'm going to show you: what I was able to achieve with these APIs and how Web Applications can leverage these API. Here is a demo of some features I experimented:

The examples are available in my github repository for your reference.

You can try the example by your self with the following URLs:

1. Web Based Voice Assisted Conversation

http://voiceassistant.vimalniroshan.com/index.html

You could ask the following:

Hey how are you doing today ?
Open google search
What is the time?
What is the date?
Take down my phone number
Take down my address
What is my phone number?
What is my address?

2. Voice Assisted Web Page

http://voiceassistant.vimalniroshan.com/example-app.html

You could do the following:

To fill a form field say form field label followed by value. For example to fill
i. phone number say "phone number 756-567-5676"
ii. gender say "gender male" or "gender female"
To switch menu say "open [menu-name] menu" or "navigate to [menu-name]".
To Agree to the Terms and Condition say "I agree to the Terms and Conditions".
To Submit say "submit".
To reset or clear form say "clear" or "reset".
To read out a selected text, select the text and say "read selected text" or "read selected" or "read".

How to Build Voice Assisted Web Application ?

When I tried to use the Web Speech API, I found that they are raw, SpeechRecognition and SpeechSynthesis are independent APIs. When I tried to use both of them faced problem in synching them and controlling them. Also there were some minor bugs like: SpeechSynthesis cannot speak longer utterances and it crashes.

For above examples I had to write repeated code for handling a voice requests and responses. Then I tried to generalize and reuse some code and it derived me to a general voice assistant controller. I'm calling it VoiceAssistant, its a controller for handling voice requests and responses in a web page and uses Web Speech API to do the actual job. This will make the voice assisted Web Application building process simple and quick.

VoiceAssistant combines SpeechRecognition and SpeechSynthesis of Web Speech API to simplify common Web Application specific usage and act as a controller. It takes care of the following:

Initializes SpeechRecognition and SpeechSynthesis and synchronizes the calls to both the API to ensure they don't overlap.
VoiceAssistant allows you to configure request and actions similar to MVC controller. By taking care of matching the request and calling the corresponding action.
Applications just have to provide String based utterances to speak or to listen. VoiceAssistant feed them to Web Speech API with other common information and objects.
On load of VoiceAssitant.js it initializes and creates an global object with the variable name voiceAssistant which can be used through out the page to control the voice assistant.
Handles some of the bugs exist in these APIs like: SpeechSynthesis currently crashes for longer utterances.

Let's see how to build Voice Assisted Web Applications.

Hello World with VoiceAssistant

Add the `VoiceAssistant.js` to your web page like any other javascript file:

On load of `VoiceAssistant.js` it will create a global `voiceAssistant` object. Which can be used across the page. You can configure and customize voiceAssistant by updating config variables, providing call back methods and setting VoiceRequestHandlers for your application specific requests.

Following is a simple configuration for an Hello World Program:

Above piece of code, passes configurations to `voiceAssistant.configure()` to setup the voiceAssistant specific to the current Web Application.

Lets see what are these configurations:

`listenContinuously` - setting to `true` makes the voiceAssistant to listen continuously.

`requestHandlers` - list of VoiceRequestHandlers to handle users voice commands or utterances.

`callBackAfterReady` - call back method to be called once the current application specific configuration is completed and voiceAssistant object is ready for use by the application.

Refer Voice Assistant README page for complete list of configurations available.

`VoiceRequestHandler` allows you to define a voice request and a callback method to be called on encountering such request. The `VoiceRequestHandler` constructor mandates two arguments to be passed and they are:

`utterances` - list of utterances reflecting the same voice request.
`action` - callback method to be called on matching one of the utterance in the list.

In the above example: On saying one of these statements `"Hello", "Hey" or "How are you"`, voiceAssistant will respond with the message "Hello! Happy to hear from you! how are you?".

`requestHandlers` is a list and you can define multiple `VoiceRequestHandler`s for an application's each specific request and corresponding action for such requests. Let's see to setup multiple request handler

Now will add another request handler to the above configuration to get the date when user asks for it:

This will make the voiceAssistant to get the date when user says one of the following:

What is the date

Get me the date

Tell me the date

date please

Defining VoiceRequestHandler for you requests

VoiceRequestHandler is significant to make/configure VoiceAssistant to react to your application's specific voice request utterances. VoiceRequestHandler encapsulates Array of utterances that user should say to convey an intent and a function to act/react to that intent. Apart from these two, VoiceRequestHandler also encapsulates another default function match() to match the Array of utterances when user speaks to convey an intent. But this functionality can be overridden with your version of matcher function as a third argument. However, the default match() function is good enough for almost all cases.

`new VoiceRequestHandler([utterance1, utterance2, ...], function(matchGroups){}[, function(requestUtterance){}]);`

Constructor arguments represents:

Array of utterances to be matched to identify the request handler for the user's speech utterance.

Action function be executed to act/react to users intent when successfully matching the one of the utterances.

Optional custom request matcher function to override default String or RegExp based match function match().

Let see each of these arguments in detail:

Array of utterances [utterance1, utterance2, ...] can be String or RegExp.

For example:
[
"What is 5 plus 2",
"What is 10 plus 20"
]
OR
[
/^What is (\d+) plus (\d+)$/gi,
/^What is (\d+) minus (\d+)$/gi
]
OR
[
/^What is (\d+) plus (\d+)$/gi,
"What is infinity plus infinity"
]

The action function function(matchGroups){} on callback will be provided with an argument called matchGroups. The matchGroups provide action with information from actual utterance user spoke.

For example:

The utterances configured in handler are :

[

/^What is (\d+) plus (\d+)$/gi,

"What is infinity plus infinity"

]

if user speaks "What is 10 plus 20" the callback to the action function will pass the matchGroups as:

action(["What is 10 plus 20", "10", "20"]);

if user speaks "What is infinity plus infinity" the callback to the action function will pass the matchGroups as:

action(["What is infinity plus infinity"]);

The default match() function identifies the utterance type and does exact String match or RegExp pattern match and calls the action with matchGroups.

Now lets see how to create an instance of VoiceRequestHandler with the above:

Refer Voice Assistant README page for latest details on VocieAssistant API definition.

Sunday, November 13, 2016

Voice Assisted Web Applications