Friday, 14 September 2018

Connecting IBM Watson Speech services to the public phone system

Many use cases for IBM Watson speech services involve connecting phone calls. This can be tricky so I decided that it might be useful to publish a sample which shows such a connection in action. This application is very simple, it uses Speech to Text (STT) to understand what the caller says and when the caller pauses it uses Text to Speech (TTS) to read it back to them. This simple application can easily be used as a starting point for building a more complex application which does more with the received speech.

Flow Diagram


I chose to use the NEXMO service because it is the easiest way to connect a phone call to a websocket. You can visit their documentation site if you want to learn details of how this works. The short summary is that it acts as a broker between the phone system and a web application of your choice. You need to provide two URLs that define the interface. Firstly nexmo will do a GET on the '/answer" URL  every time a call comes in to the number you configure - the way your application handles this request is the key part of the application experience. Secondly nexmo will do a POST to your '/events' URL anytime anything happens on your number (e.g. a call comes in or the person hangs up) in our case we don't do anything interesting with these except to write them to the log for debugging purposes.

To get this working for yourself, the first thing you need to do is deploy my sample application. You can get the code from this GIT repository. Before you deploy it to IBM Bluemix, you need to edit the manifest.yml file and choose a unique URL for your instance of the application. You also need to create instances of the IBM Watson STT and TTS services and bind them to your application.

Next you need to configure nexmo to connect to your application. You need to login to the nexmo website and then click on the 'Voice' menu on the left side and then 'Create Application. This pops up a form where you can enter details of the /event and /answer URLs for the web application you just deployed. After you fill in this form, you will get an nexmo application id.


Unfortunately connecting to the phone system costs money. Nexmo charges different amounts of money for numbers depending upon what country they are associated with. In my case I bought the number +35315134721 which is a local number in Dublin, Ireland. This costs me €2.50 per month so I might not leave it live too long or maybe swap for a US based number at a reduced cost of US$0.67 per month.

Once you get out your credit card and buy a number, you must tell nexmo which application id you want to associate with the number. Visit the your numbers page and enter the details (like you see below).



Having done this, you can now ring the number and see it in action. When we receive a call, we open a websocket interface to the STT service and start echoing all audio received from the phone line to the STT service. Although the TTS service supports a websocket interfacer, we don't use the websocket interface because the chunks of audio data from the TTS service won't necessarily be returned evenly spaced the nexmo service will produce crackly audio output. Instead we use the REST interface and we write the returned audio into a temporary file before streaming it back into the phone call as a smooth rate.

The bulk of the code is in a file named index.js and it is fairly well explained in the comments, but here are a few more explanatory notes:

  • The first 70 lines or so are boilerplate code which should be familiar to anyone who has experience of deploying node.js application to BlueMix. First we import the required libraries that we use and then we try and figure out the details of the Watson service instances that we are using. If running on the cloud, this will be parsed from the environment variables. However, if you want to run it locally, you will need a file named vcap-local.json that contains the same information. I have included a file named vcap-local-sample.json in the repository to show you the required structure of the file.
  • Next comes a function named tts_stream which acts as an interface to the TTS service. It takes two parameters, the text to synthesise and the socket on which to play the result. We use the REST interface instead of opening a websocket to the TTS service (like we do with the STT service). The reason for this choice is that it results in crackly audio as the audio chunks coming back from the TTS service are not evenly spaced. The way it works is that it saves the audio to a temporary file and then pipes the file smoothly into the nexmo socket before deleting the temporary file. This approach introduces a slight delay because we need to wait for the entire response to be synthesised before we start playing. However, the problem is not as bad as you might think because a 15 second audio response might get sent back in under a second.
  • Next comes the two functions which respond to the /events and /answer URLS. As mentioned earlier the  /event  handler is very simple because it just echos the POST data into the log.  The /answer function is surprisingly simple also. Firstly it creates a websocket and then it sends a specially formatted message back to nexmo to tell it you want to connect the incoming phone call into the new websocket.
  • The real meat of the code is in the on connect method which we associate with the websocket that we created.
    • The first thing we do is stream audio from a file named greeting.wav which explains to the user what to do. While this message is helpful to the user, it also gives the application some breathing room because it might take some time to initialise the various services and the greeting will stop the user talking before we are ready.
    • Next we create a websocket stt_ws which is connected to the Watson STT service. 
      • As soon as the connection is established, we send a special JSON message to the service to let it know what type of audio we will send and what features we want to enable.
      • When the connection to STT is started, a special first message is sent back saying that it is ready to receive audio. We use a boolean variable stt_connected to record whether or not this message is received. This is because attempting to send audio data to the websocket before it is ready will cause errors.
      • When starting the STT service, we specify that we would like to receive interim results i.e. when it is transcribing some audio and it thinks it knows what was said, but it does not yet consider the results to be final (because it might change its mind when it hears next).  We do this because we want to speed up responses, but we don't want to echo back a transcription which might later be revised. For this reason we check the value of the final variable in the returned JSON and only call the tts_stream function when the results are final.
    • For the nexmo websocket, we simply say that all input received should automatically be echoed to the STT websocket (once we have received confirmation that the STT service link has been initialised properly.
    • When the nexmo websocket closes we also try to close the STT web socket
To give credit, I should point out that my starting point was this sample from nexmo. I should also point out that the code is currently only able to deal with only one call at a time. It should be possible to solve this problem, but I will leave this as a learning exercise for some reader of the blog.

Friday, 7 September 2018

Watson Speech to Text supports language and acoustic customizations, but do you need both?

The Watson Speech to Text service has recently released a feature whereby customers can customize the service so that it works better for their specific domain. These customizations can be in the form of a Customized Language Model and/or an Customized Acoustic Model. Customers sometimes get confused by these two different model types and they wonder which they need or if they need both.

The quick summary is that the language customization model tells Watson how the words spoken in your domain are different from normal English (or whatever other base language you are using). For example you might be transcribing speech where the speakers use a specialised terminology. On the other hand, the acoustic customization model tells Watson that the words spoken in your domain might be spoken quite differently than they were spoken in the corpus initially used to train Watson STT. For example, you might have audio samples where the users are using a strong regional accent.

Depending upon your domain, you may need both types of customization, but lets look at them in more detail first.

Customized Language Models

Customized language models tell Watson what words are likely to occur in your domain. For example, when you specify that you wand to use the en-US language, Watson will have a very large (but fixed) list of possible words that can be spoken in US English. However, in your domain, the users might use a specialised vocabulary.  The purpose of the customized language model is to teach Watson how the language in your domain is different from normal English.

The way you build a customized language model is that you provide one or more corpora which are simple text files containing a single utterance per line. It is important to give complete utterances, because accurate speech to text transcription requires that the service knows not only what words might be seen but also in what context the words are likely to occur.

If you are building a model to be used in transcribing film reviews, your corpus might include words like movie release star and blockbuster. These words are already in the Watson standard dictionary, but including them in your model tells Watson that these words are more likely to occur in your domain than normal (which increases the chance that they will be recognised).

  • You might also include the word umpa lumpas in your corpus since people will be discussing them and you need to tell Watson it is a valid word. Since this word is pronounced like it is written, all you need to do is tell Watson that it is a valid word. 
  • However, if you are interested in Irish movies, it is likely that people will speak about an actress named Ailbhe or Caoimhe. These common Irish forenames wouldn't be in the Watson dictionary, but it is not enough to tell Watson that they exist. You also need to tell Watson that Ailbhe is pronounced like Alva and Caoimhe is pronounced like Keeva.

The building of this customized language model is probably a relatively simple job. Nevertheless, this customization will probably bring about a dramatic reduction in word error rate. If your audio contains examples of people saying words not in the standard Watson dictionary, then you will never transcribe these properly without a customized language model. In addition, when your speakers say words in the Watson dictionary, the language customization model will increase the chances of these being properly transcribed.

Many users find that the language customization by itself will meet their needs and there is not necessarily any need to combine it with an acoustic model.

Customized Acoustic Models

Customized acoustic models allow you to tell Watson what words sound like in your domain. For example, if speakers in your  region consistently pronounce the word there as if they are saying dare you might need need to build a customized acoustic model to account for this.

At one level building a customized acoustic model is even easier than building a customized language model. All you need to do is upload between 10 minutes and 50 hours of sample audio which is typical of the type of speech that you will be  trying to transcribe with the model that you are building. And then you train the model.

However, if you read the documentation carefully you will see that they say "you will get especially good results if you train with a language model build from a transcription of the audio".  Transcribing 50 hours of speech is a lot of work and so many people ignore this advice. However, I think the advice should read "you are extremely unlikely to get good results unless you train with a language model build from a transcription of the audio". In my experience, training without a language model containing the transcription can very often produce a model whose word error rate (WER) is significantly worse than having no model at all.

To understand why this is the case, you need to look a little closer at how acoustic model training works. For illustration purposes, assume that the problem you are trying to solve is that the utterance a-b-c is sometimes being erroneously transcribed at x-y-z.

  • If you train with a language model, Watson will encounter an ambiguous utterance in your training data which it thinks is 70% likely to be x-y-z and 55% likely to be a-b-c. Since your language model doesn't contain x-y-z it will know that it must be a-b-c and it will make adjustments to the neural network to make it more likely that this utterance will be transcribed as a-b-c in the future. Hence, the model gets better.
  • On the other hand, if you train with a language model. Watson will encounter an ambiguous utterance in your training data which it thinks is 70% likely to be x-y-z and 55% likely to be a-b-c. Since it has no other information will assume that it must be x-y-s. However, the confidence score is not very high so it will make adjustments to the neural network to make it even more likely that this utterance will be transcribed as x-y-z in the future. Hence, the model gets worse.

Of course the chance of such an error happening is related to the word error rate. In my experience, users rarely put the effort into building a customized model when the WER is low. Mostly people build customized models when they are seeing very hight WER and hence they often see carelessly build acoustic models making the problem even worse.

Another problem people encounter is building an acoustic model from speech which is not typical of their domain. For example, user might be tempted to get a single actor to read out their entire script and get then to record the samples in a recording studio with a good microphone. When their application goes live they might have to deal with audio recorded on poor phone lines, in a noisy environment by people with different regional accents.

Summary

Language and acoustic customizations serve a different purpose - the first deals with non-standard vocabulary while the other deals with non-standard speech sounds. It is possible that you can build a customized language model very easily and this may be enough for your domain. An acoustic model can improve your WER even further, but you should be careful to ensure you build a good one. In particular you should use transcribed data rather than just collecting random samples.

Monday, 23 July 2018

Watson Asynchronous Speech to Text

Speech to Text (STT) transcription can take a long time. For this reason the Watson Speech to Text service offers an Asynchronous API where the caller doesn't need to wait around while transcription is happening. Instead the person requesting the transcription provides details of a callback server which should be notified when the transcription is complete.

This programming style is not too difficult when you get used to it, but there are a number of concepts to learn and one thing that slows down people is that you need a fully functional callback server before you can see any of the other components in action.

In order to help people get started I decided to write a very basic callback server which interacts with Watson STT service, It can help you understand the interaction between the various components and also can also serve as a starting point for a fully functional callback server.

My callback server is implemented in node.js and because it is small, all of the code is in a single file called app.js. Like all node,js programs, it starts with a list of the dependencies which we will use:

const express = require('express');
const crypto = require('crypto');
const bp = require("body-parser");
var jsonParser = bp.json()
const app = express();
const port = process.env.PORT || 3000;
Next we define a variable called secret_key. For security reasons, you probably don't want any random person to be able to send notifications to your callback server. Therefore the Watson STT asynchronous API allows you to specify a secret key that should be used to sign all requests from the Watson STT service to your callback server. A secret which is published in a blog post is not really a secret so you should edit this variable to some secret value which is unique to your deployment. If you don't want to use this security feature, just set the variable value to null.

// var secret_key = null
var secret_key = 'my_secret_key';
This callback server doesn't do much other than write messages in the log to help you understand the flow of messages to and from your callback server. Therefore the log_request() function does that key task for each request.


// record details of the request in the log (for debugging purposes)
function log_request (request) {
  console.log('verb='+request.method);;
  console.log('url='+ request.originalUrl);
  console.log("Query: "+JSON.stringify(request.query));
  console.log("Body: "+JSON.stringify(request.body));
  console.log("Headers: "+JSON.stringify(request.headers));
}
The only thing about this server which is moderately complex is the way it handles signatures. The following function checks whether or not the request contains a valid signature. If the secret_key variable is set to null then no checking is done. When the signature is not valid, it puts a message in the log responds telling you what the signature should have contained.This behaviour is intended to be helpful for developers debugging interactions, but you would probably want to turn it off for production systems because it would be helpful for hackers.

// check if the signature is valid
function check_signature(request, in_text) {

  // check the request has a signature if we are configured to expect one
  if (secret_key) {
    var this_signature = request.get('x-callback-signature');
    if (!this_signature) {
      console.log("No signature provided despite the fact that this server expects one");
      throw new Error("No signature provided despite the fact that this server expects one");
    } else {
      console.log("Signature: "+this_signature);
    }

    // Calculate what we thing the signature should be to make sure it matches
    var hmac = crypto.createHmac('sha1', secret_key);
    hmac.update(in_text);
    hmac.end();
    var hout = hmac.read();
    var expected_signature = hout.toString('base64');
    console.log("Expected signature: "+expected_signature);

    if (this_signature != expected_signature) {
      err_str = "Actual signature \""+this_signature+"\" does not match what we expected \""+expected_signature+"\"";
      console.log(err_str);
      throw new Error(err_str);
    }
  } 
}
The server needs to handle POST requests coming from the Watson STT server when the status of any transcription service changes. All we do is log the request for debugging purposes. If the signature matches the body of the POST, we give a status of 200 and respond with OK. Obviously a production server would be expected to do something more useful.

// Handle POST requests with STT job status notification
app.post('/results', jsonParser, (request, response) => {
  log_request (request);
  if (!request.body) {
    var err_text = 'Invalid POST request with no body';
    console.log(err_text);
    response.status(400);
    response.status(err_text);
  }
  check_signature(request, JSON.stringify(request.body));

  // for now just record the event in the log
  console.log('Event id:'+request.body.id+' event:'+request.body.event+' user_token:'+request.body.id);

  // The spec is not clear about what we should respond to just say OK
  response.type('text/plain');
  response.send("OK");
})
When registering your callback server, Watson STT issues a GET request with a random challenge_string to see if your server is up an running. If the signature on the request matches the content of the challenge_string then we simply echo back the challenge_string to let the Watson server know we are functioning OK. If the signature is wrong we issue an error response and the registration of the callback server will fail.

// Deal with the initial request checking if this is a valid STT callback URL
app.get('/results', (request, response) => {
  log_request (request);

  if (!request.query.challenge_string) {
    console.log("No challenge_string specified in GET request");
    throw new Error("No challenge_string specified in GET request");
  }

  check_signature(request, request.query.challenge_string);

  response.type('text/plain');
  response.send(request.query.challenge_string);
})
Finally the app starts listening for incoming requests:

app.listen(port, (err) => {
  if (err) {
    return console.log('something bad happened', err);
  }
  console.log(`server is listening on ${port}`);
})
I have an instance of this callback processor running at https://stt-async.eu-gb.mybluemix.net/results but it is not really any use to you since you won't be able to see the console log messages. You can also download the complete sample for code from GitHub and host it either in BlueMix or the hosting platform of your choice,

Thursday, 5 April 2018

Naming Intents

How should you name intents? Heres one way and an explanation as to why.

In this post we described clustering a topic into intents. The naming scheme I used was TopicIntent.

When you go to improve accuracy you will merge and split intents. You tend not to do this outside Topics. I find that if you have the topic name in the intent when you do these changes it is easier to keep your brain in one context.

Cluster Topics

"Happy families are all alike; every unhappy family is unhappy in its own way." the Anna Karenina principle

Some Topics cover loads but you don't really care about the individual intents inside. For example if you have a Complaints topic that could cover all sorts of things people moan about.

No one wants a message back saying "This robot cares we have lost your bags". A complaint question will have to be passed onto a person. If we can tell that person that we have a complaint they can then decide what to do next. If you do not break down complaint topic into intents though all sorts of questions will be in one intent. It will deal with damage, delays, queues, lost items, dirty conditions etc. This giant varied intent will suck in other questions damaging your overall system accuracy.

With a varied topic like complaints that your chatbot cannot handle by itself. If you make one giant intent it will damage your overall accuracy. But because complaints tend to be about a few things at once, 'The food was terrible and the portions were small', there is often not one solid intent anyway. By labelling all complaints ComplaintIntent it is possible to ignore the intent part as getting the topic right is good enough.

In our accuracy tests we can strip the intent part off and say that if we land in Complaint that is good enough. But not create on giant intent that covers too much and that will suck in all other questions.

This issue of big topic particularly happens with Off Topic topics where questions are out of scope, silly or just cover large areas that you can't really answer.

There are other ways to label intents. This TopicIntent method is what I use. If you have a different way please mention it in the comments.

Wednesday, 4 April 2018

Clustering Questions Part 2: Intentions

Once you have divided your questions into Topics the next step is to divide them into Intents. This is how I would find the intents inside a topic

An Intent is a purpose or goal expressed by a user’s input such as finding contact information or booking a trip.

Imagine you had an airline booking chatbot. And you had these questions in the Topic booking

There is a dataset of travel questions here I will take some questions from there and invent some myself

A booking topic could have

Question Intent
I'd like to book a trip to Atlantis from Caprica on May 13 BookTicket
I'd like to book a trip from Chicago to San Diego between Aug 26th and Sept 5th BookTicket
i wanna go to Kobe whats available? BookTicket
Can I get information for a trip from Toluca to Paris on August 25th? BookTicket
I'd like to book a trip to Tel Aviv from Tijuana. I was wondering if there are any packages from August 23rd to 26th BookTicket
I want to know how far in advance I can book a flight BookFuture
When do bookings open for 6 months time BookFuture
I want to get a ticket for my christmas flight home BookFuture
Can I check my booking? BookCheck
can i check my booking status BookCheck
can i check the status of my booking BookCheck
how do i check the status of my booking BookCheck
i need to check my booking status BookCheck
Let me know the status of my Booking BookCheck
Can I book now pay later BookPay
How can I pay for a booking? BookPay

On this the verbs Check and Pay each seem to form an intent. There is one intent on When bookings can happen. And a few unknowns that might make more sense when we have more questions later.

At this stage realise you are going to make mistakes and have to go back over your intents as you learn by doing. Fixing your intents once you have had a first cut at defining I will come back to.

One could way to find intents in a topic is to look for verbs. Unrelated actions tend to have different verbs. In this case the topic Booking is already a verb and a noun. This is common enough. Dual meanings like this can be a nightmare with Entities but that is another blogpost.

In this topic something like 'cancel a booking' is likely to be an intention. Here Cancel is the verb and booking as the object of the sentence.

Other clues to the intention are the Lexical Answer Type, the subject and the object The LAT is the type of question. Who questions have different types of answers to When questions. In practise I don't find you commonly use the LAT to define intentions.

One possible exception to this is definitional questions where users ask "What is a..." for a domain term to be explained. If more than 5% of your questions are definitional you may not have collected representitive questions as manufactured questions by non real users or real users forced to ask questions tend to be definitional. When someone runs out of real questions they will ask 'What is a booking'.

The Subject of the sentence is also rarely useful. Sometimes who is doing an action changes the answer but usually there is a set scheme to buy, book, cancel etc and who is doing it doesn't matter.

The Object of the sentence is more often useful. Frequently an intention is a combination of the verb and what it is being done to. Whichever one isn't the Topic is usually the intent. Booking might be a topic and various things you do with a booking would be intents.

In summary go through each topic. If there are verbs shared across questions they might go together in an intent. But you have to use the domain experts knowledge of what questions have the same intention this step cannot be automated.

Tuesday, 3 April 2018

Clustering Questions into Topics

IBM Watson used to claim it took 15 minutes to match up a question with an intent. The technique described here halves that time. Context switching is mentally draining and wastes a lot of time. Concentrating on one part of a job until it is done is much more efficient than switching between tasks.

In a similar way once we have our questions collected the next task is to divide them into topics. Then these topics will be looked at individually.

A topic is a category of types of questions people will ask your chatbot.

In an Airline these might be Checkin, Booking, Airmiles

In an Insurance company Renewal, Claim, Coverage

Before looking at the questions try think of 5 topics that might occur in customer questions to your business.

How many topics?

Roughly 20. You might have ten or 30. A rule of thumb used in K Nearest Neighbour classification is if you have N documents you expect to have Sqrt(N) clusters. This works out as 44 for 2000 clusters. You won't have 2000 questions at this stage more likely under 1000.

Can you automate discovering topics

Yes you can using a KNN algorithm with the number of clusters given above. No you really should not. You learn a hell of a lot clustering 500 questions. You will have to read all these questions eventually anyway so you might as well learn this stuff now.

Process of Marking up topics

Say you have 500 questions in a spreadsheet. What we are trying to do here is mark up a new column 'Topic' that puts each of these questions in a topic.

Go through your 500 questions. Looking for the 5 topics you listed in question above. You may find that actually what you thought was one cluster is two. Or that a topic you expected is missing If you are looking for the clothesReturn topic I would search for the key words 'return' and 'bring back'. I would look for the obvious words in each of the topics I expect.

Once I had marked up the obvious keywords from my list of 500 that were clothesReturn if I found a new question in that topic I would look for the word it had that showed me it was that topic but was not in my original search list

Can I exchange a jumper I bought yesterday for a new one

I would then search for other uses of 'exchange'. It is a word likely to be used in clothesReturn but one I missed earlier.

If you know the domain roughly half of the questions will be classified by your obvious keywords.

I would read through the remaining questions with my 5 expected topics in my head. If I see something that is obviously a new topic I add that to the topic list.

Feel free with marking 5-10% of questions with unknown. these might make more sense when you have more questions or might be part of the long tail, out of scope or off topic that your chatbot will not handle.

What Next

Once you have a spreadsheet with a column marked up with the Topic of each question the next step is to find the intent of each question. But now you are reviewing a series of questions in one topic which makes it much easier to concentrate and work in a batch mode.

I will describe this step of marking up intents in a later blogpost

Monday, 5 February 2018

An alternative way of training Watson Discovery Service

Watson Discovery Service (WDS) provides an excellent natural language query service. This service works well out of the box, but many users like to improve the results for their particular domain by training the service. In order to train the service how to better rank the results of natural language query you need to provide the service with some sample queries and for each query indicate which documents are good results for this query and equally importantly which documents would be a bad result for the query.

The standard user interface to the training capability allows you to view the potential results in a browser and then click on a button to indicate if the result is good or bad. Clicking on the results is easy for a small sample of queries, but it quickly becomes tedious. For this reason, many users prefer to use the API for the training service which gives additional control and capabilities.

Unfortunately the WDS training service only works well with large amounts of training data and in many cases it is not feasible to collect this volume of training data. Luckily there is an alternative (homegrown) way of training WDS which works significantly better for small amounts of training data. The method (which is known as hinting) is amazingly simple. All you need to do is add a new field to your target documents (e.g. named hints) with the text of the question that you want the document to be selected as an answer. Obviously when you as this question (or a similar question) the natural language query engine will select your target document and rank it highly since it is clearly a good match.

This alternative training method is sometimes called hinting because you are providing hints to WDS about which questions this document provides and answer. An additional benefit of this training method is that it helps find matches where the question and the answer document don't have any words in common. Whereas, the standard WDS training method only impacts upon the ranking of results so if the answer document you want to be selected is not even in the list of top 100 answers fetched for the query the normal training would not help.