Friday 14 September 2018

Connecting IBM Watson Speech services to the public phone system

Many use cases for IBM Watson speech services involve connecting phone calls. This can be tricky so I decided that it might be useful to publish a sample which shows such a connection in action. This application is very simple, it uses Speech to Text (STT) to understand what the caller says and when the caller pauses it uses Text to Speech (TTS) to read it back to them. This simple application can easily be used as a starting point for building a more complex application which does more with the received speech.

Flow Diagram


I chose to use the NEXMO service because it is the easiest way to connect a phone call to a websocket. You can visit their documentation site if you want to learn details of how this works. The short summary is that it acts as a broker between the phone system and a web application of your choice. You need to provide two URLs that define the interface. Firstly nexmo will do a GET on the '/answer" URL  every time a call comes in to the number you configure - the way your application handles this request is the key part of the application experience. Secondly nexmo will do a POST to your '/events' URL anytime anything happens on your number (e.g. a call comes in or the person hangs up) in our case we don't do anything interesting with these except to write them to the log for debugging purposes.

To get this working for yourself, the first thing you need to do is deploy my sample application. You can get the code from this GIT repository. Before you deploy it to IBM Bluemix, you need to edit the manifest.yml file and choose a unique URL for your instance of the application. You also need to create instances of the IBM Watson STT and TTS services and bind them to your application.

Next you need to configure nexmo to connect to your application. You need to login to the nexmo website and then click on the 'Voice' menu on the left side and then 'Create Application. This pops up a form where you can enter details of the /event and /answer URLs for the web application you just deployed. After you fill in this form, you will get an nexmo application id.


Unfortunately connecting to the phone system costs money. Nexmo charges different amounts of money for numbers depending upon what country they are associated with. In my case I bought the number +35315134721 which is a local number in Dublin, Ireland. This costs me €2.50 per month so I might not leave it live too long or maybe swap for a US based number at a reduced cost of US$0.67 per month.

Once you get out your credit card and buy a number, you must tell nexmo which application id you want to associate with the number. Visit the your numbers page and enter the details (like you see below).



Having done this, you can now ring the number and see it in action. When we receive a call, we open a websocket interface to the STT service and start echoing all audio received from the phone line to the STT service. Although the TTS service supports a websocket interfacer, we don't use the websocket interface because the chunks of audio data from the TTS service won't necessarily be returned evenly spaced the nexmo service will produce crackly audio output. Instead we use the REST interface and we write the returned audio into a temporary file before streaming it back into the phone call as a smooth rate.

The bulk of the code is in a file named index.js and it is fairly well explained in the comments, but here are a few more explanatory notes:

  • The first 70 lines or so are boilerplate code which should be familiar to anyone who has experience of deploying node.js application to BlueMix. First we import the required libraries that we use and then we try and figure out the details of the Watson service instances that we are using. If running on the cloud, this will be parsed from the environment variables. However, if you want to run it locally, you will need a file named vcap-local.json that contains the same information. I have included a file named vcap-local-sample.json in the repository to show you the required structure of the file.
  • Next comes a function named tts_stream which acts as an interface to the TTS service. It takes two parameters, the text to synthesise and the socket on which to play the result. We use the REST interface instead of opening a websocket to the TTS service (like we do with the STT service). The reason for this choice is that it results in crackly audio as the audio chunks coming back from the TTS service are not evenly spaced. The way it works is that it saves the audio to a temporary file and then pipes the file smoothly into the nexmo socket before deleting the temporary file. This approach introduces a slight delay because we need to wait for the entire response to be synthesised before we start playing. However, the problem is not as bad as you might think because a 15 second audio response might get sent back in under a second.
  • Next comes the two functions which respond to the /events and /answer URLS. As mentioned earlier the  /event  handler is very simple because it just echos the POST data into the log.  The /answer function is surprisingly simple also. Firstly it creates a websocket and then it sends a specially formatted message back to nexmo to tell it you want to connect the incoming phone call into the new websocket.
  • The real meat of the code is in the on connect method which we associate with the websocket that we created.
    • The first thing we do is stream audio from a file named greeting.wav which explains to the user what to do. While this message is helpful to the user, it also gives the application some breathing room because it might take some time to initialise the various services and the greeting will stop the user talking before we are ready.
    • Next we create a websocket stt_ws which is connected to the Watson STT service. 
      • As soon as the connection is established, we send a special JSON message to the service to let it know what type of audio we will send and what features we want to enable.
      • When the connection to STT is started, a special first message is sent back saying that it is ready to receive audio. We use a boolean variable stt_connected to record whether or not this message is received. This is because attempting to send audio data to the websocket before it is ready will cause errors.
      • When starting the STT service, we specify that we would like to receive interim results i.e. when it is transcribing some audio and it thinks it knows what was said, but it does not yet consider the results to be final (because it might change its mind when it hears next).  We do this because we want to speed up responses, but we don't want to echo back a transcription which might later be revised. For this reason we check the value of the final variable in the returned JSON and only call the tts_stream function when the results are final.
    • For the nexmo websocket, we simply say that all input received should automatically be echoed to the STT websocket (once we have received confirmation that the STT service link has been initialised properly.
    • When the nexmo websocket closes we also try to close the STT web socket
To give credit, I should point out that my starting point was this sample from nexmo. I should also point out that the code is currently only able to deal with only one call at a time. It should be possible to solve this problem, but I will leave this as a learning exercise for some reader of the blog.

No comments:

Post a Comment