Wednesday 20 September 2017

Adding a speech interface to the Watson Conversation Service

The IBM Watson Conversation Service does a great job of providing an interface that closely resembles a conversation with a real human being. However, with the advent of products like the Amazon Echo, Microsoft Cortana and the Google Home, people increasingly prefer to interact with services by speaking rather than typing. Luckily IBM Watson also has Text to Speech and Speech to Text services. In this post we show how to hook these services together to provide a unified speech interface to Watson's capabilities.

In this blog we will build upon the existing SpeechToSpeech sample which takes text spoken in one language and then leverages Watson's machine translation service to speak it back to you in another language. You can try the application described here on Bluemix or access the code on GitHub to see how you can customise the code and/or deploy on your own server.

This application has only one page and it is quite simple from the user's point of view.
  • At the top there is some header text introducing the sample and telling users how to use it. 
  • The sample uses some browser audio interfaces that are only available in recent browser versions. If we detect that these features are not present we put up a message telling the user that they need to choose a more modern browser. Hopefully you won't ever see this message.
  • In the original sample there are two drop down selection boxes which allow you to specify the source and target language. We removed these drop downs since they are not relevant to our modified use case.
  • The next block of the UI gives the user a number of different ways to enter speech samples:
    • There is a button   which allows you to start capturing audio directly from the microphone. Whatever you say will be buffered and then passed directly to the transcription service. While capturing audio, the button changes colour to red and the icon changes  - this is a visual indication that recording is in progress. When you are finished talking, click the button again to stop audio capture.
    • If are working in a noisy environment or if you don't have a good quality microphone, it might be difficult for you to speak clearly to Watson. To help solve this problem we have provided you with some ample files hosted in the web app. To play one of these samples click on one of the buttons to play the associated file and use it as input.
    • If you have your own recording that you can click on the  button and select the file containing the audio input that you want to send to the speech-to-text service.
    • Last, but not least, you can drag and drop an audio file onto the page to have it instantly uploaded
  • The transcribed text is displayed on an input box (so you can see if Watson is hearing properly) and sent to either the translation service (in the original version) or the conversation service in our updated service. If there is a problem with the way your voice is being transcribed, see this previous article on how to improve it.
  • When we get a response from the conversation or translation service we place the received text on an output text box and we also call the text-to-speech service to read out the response and save you the bother of having to read.
I know that you want to understand what is going on under the covers so here is a brief overview:
  • The app.js file is the core of the web application. It implements the connections between the front end code that runs in the browser and the various Watson services. This involves establishing 3 back-end REST services. This indirection is needed because you don't want to include your service credentials in the code sent to the browser and because your browser's cross site script protections will prohibit you from making a direct call to the Watson service from your browser. The services are
    • /message - this REST service implements the interface to the Watson Conversation service. Every time we have a text utterance transcribed, we do a POST on this URL with a JSON payload like {"context":{...},"input":{"text":"<transcribed_text>"}}. The first time we call the service we specify an empty context {} and in each subsequent call we supply the context object that the server sent back to us the last time. This allows the server to keep track of the state of the conversation.
      Most conversation flows are programmed to give a trite greeting in response to the first message. To avoid spending time on this the client code sends initial blank message when the page loads to get this out of the way.
    • /synthesize - this REST service use used to convert the response into audio. All that this service does to convert a get on http://localhosts:3000/synthesize?voice=en-US_MichaelVoice&text=Some%20responsevoice=en-US_MichaelVoice&text=Some%20response into a get on the URL  https://watson-api-explorer.mybluemix.net/text-to-speech/api/v1/synthesize?accept=audio%2Fwav&voice=en-US_MichaelVoice&text=Some%20response this will return a .wav file with the text "some response" being spoken in US English by the voice "Michael". 
    • /token - the speech to text transcription is an exception to the normal rule that your browser shouldn't connect directly to the Watson service. For performance reasons we chose to use the websocket interface to the speech to text service. At page load time, the browser will do a GET on this /token REST service and it will respond with a token code that can then be included in the URL used to open the websocket. After this, all sound information captured from the microphone (or read from a sample file) is sent via the websocket directly from the browser to the Watson speech to text service.
  • The index.html file is the UI that the user sees. 
    • As well as defining the main UI elements which appear on the page, it also  includes main.js which is the client side code that handles all interaction in your browser.
    • It also includes the JQuery and Bootstrap modules. But I won't cover these in detail.
  • You might want to have a closer look at the client side code which is contained in a file public/js/main.js:
    • The first 260 lines of code are concerned with how to capture audio from the client's microphone (if the user allows it - there are tight controls on when/if browser applications are allowed to capture audio). Some of the complexity of this code is due to the different ways that different browsers deal with audio. Hopefully it will become easier in the future. 
    • Regardless of what quality audio your computer is capable of tracking, we down sample it to 16bit, mono at 16 Khz because this is what the speech recognition is expecting.
    • Next we declare which language model we want to use for speech recognition. We have hardcoded this to a model named "en-GB_BroadbandModel" which is a model tuned to work with high fidelity captures of of speakers of UK English (sadly there is no language model available for Irish English). However, we have left in a few other language models commented out to make it easy for you if you want to change to another language. Consult the Watson documentation for a full list of language models available.
    • The handleFileUpload function deals with file uploads. Either file uploads which happen as a result of explicitly clicking on the "Select File" button or upload that happen as a result of a drag-and-drop event.
    • The initSocket function manages with the interface to the websicket that we use to communicate to/from the speech_to_text service. It declares that the showResult function should be called when a response is received. Since it is not always clear when a spaker is finnished talking, the text-to-speech can return several times. As a result the msg.results[0].final variable is used to deremine if the current transcription is final. If it is an intermediate result, we just update the resultsText field with what we heard. If it is the final result, the msg.results[0].alternatives[0].transcript variable is also used as the most likely transcription of what the user said and it is passed on to the converse function.
    • The converse function handles sending the detected text to the Watson Conversation Service (WCS) via the /message REST interface which was descibed above. When the service gives a response to the question, we pass it to the text-to-speech service via the TTS function and we write it on the response textarea so it can be read as well as listened to.
  • In addition there are many other files which control the look and feel of the web page, but won't be described in detail here e.g. 
    • Style sheets in the /public/css directory
    • Audio sample files in the /public/audio directory
    •  Images in the public/images directory
    • etc.
Anyone with a knowledge of how web applications work, should be able to figure out how it works. If you have any trouble, post your question as a comment on this blog.
At the time of writing, there is an instance of this application running at https://speak-to-watson-app.au-syd.mybluemix.net/ so you can see it running even if you are having trouble with your local deployment. However, I can't guarantee that this instance will stay running due to limits on mypersonal Bluemix account.

4 comments:

  1. I'm agree with the points that you've shared.

    ReplyDelete
  2. You’ve got some interesting points in this article. I would have never considered any of these if I didn’t come across this. Thanks!. Common Spanish phrases

    ReplyDelete
  3. You’ve got some interesting points in this article. I would have never considered any of these if I didn’t come across this. Thanks!. Learn spanish

    ReplyDelete
  4. I have read a few of the articles on your website now, and I really like your style of blogging. I added it to my favorites blog site list and will be checking back soon. Please check out my site as well and let me know what you think. Spanish present tense

    ReplyDelete