Friday, 7 September 2018

Watson Speech to Text supports language and acoustic customizations, but do you need both?

The Watson Speech to Text service has recently released a feature whereby customers can customize the service so that it works better for their specific domain. These customizations can be in the form of a Customized Language Model and/or an Customized Acoustic Model. Customers sometimes get confused by these two different model types and they wonder which they need or if they need both.

The quick summary is that the language customization model tells Watson how the words spoken in your domain are different from normal English (or whatever other base language you are using). For example you might be transcribing speech where the speakers use a specialised terminology. On the other hand, the acoustic customization model tells Watson that the words spoken in your domain might be spoken quite differently than they were spoken in the corpus initially used to train Watson STT. For example, you might have audio samples where the users are using a strong regional accent.

Depending upon your domain, you may need both types of customization, but lets look at them in more detail first.

Customized Language Models

Customized language models tell Watson what words are likely to occur in your domain. For example, when you specify that you wand to use the en-US language, Watson will have a very large (but fixed) list of possible words that can be spoken in US English. However, in your domain, the users might use a specialised vocabulary.  The purpose of the customized language model is to teach Watson how the language in your domain is different from normal English.

The way you build a customized language model is that you provide one or more corpora which are simple text files containing a single utterance per line. It is important to give complete utterances, because accurate speech to text transcription requires that the service knows not only what words might be seen but also in what context the words are likely to occur.

If you are building a model to be used in transcribing film reviews, your corpus might include words like movie release star and blockbuster. These words are already in the Watson standard dictionary, but including them in your model tells Watson that these words are more likely to occur in your domain than normal (which increases the chance that they will be recognised).

  • You might also include the word umpa lumpas in your corpus since people will be discussing them and you need to tell Watson it is a valid word. Since this word is pronounced like it is written, all you need to do is tell Watson that it is a valid word. 
  • However, if you are interested in Irish movies, it is likely that people will speak about an actress named Ailbhe or Caoimhe. These common Irish forenames wouldn't be in the Watson dictionary, but it is not enough to tell Watson that they exist. You also need to tell Watson that Ailbhe is pronounced like Alva and Caoimhe is pronounced like Keeva.

The building of this customized language model is probably a relatively simple job. Nevertheless, this customization will probably bring about a dramatic reduction in word error rate. If your audio contains examples of people saying words not in the standard Watson dictionary, then you will never transcribe these properly without a customized language model. In addition, when your speakers say words in the Watson dictionary, the language customization model will increase the chances of these being properly transcribed.

Many users find that the language customization by itself will meet their needs and there is not necessarily any need to combine it with an acoustic model.

Customized Acoustic Models

Customized acoustic models allow you to tell Watson what words sound like in your domain. For example, if speakers in your  region consistently pronounce the word there as if they are saying dare you might need need to build a customized acoustic model to account for this.

At one level building a customized acoustic model is even easier than building a customized language model. All you need to do is upload between 10 minutes and 50 hours of sample audio which is typical of the type of speech that you will be  trying to transcribe with the model that you are building. And then you train the model.

However, if you read the documentation carefully you will see that they say "you will get especially good results if you train with a language model built from a transcription of the audio".  Transcribing 50 hours of speech is a lot of work and so many people ignore this advice. However, I think the advice should read "you are extremely unlikely to get good results unless you train with a language model built from a transcription of the audio". In my experience, training without a language model containing the transcription can very often produce a model whose word error rate (WER) is significantly worse than having no model at all.

To understand why this is the case, you need to look a little closer at how acoustic model training works. For illustration purposes, assume that the problem you are trying to solve is that the utterance a-b-c is sometimes being erroneously transcribed at x-y-z.

  • If you train with a language model, Watson will encounter an ambiguous utterance in your training data which it thinks is 70% likely to be x-y-z and 55% likely to be a-b-c. Since your language model doesn't contain x-y-z it will know that it must be a-b-c and it will make adjustments to the neural network to make it more likely that this utterance will be transcribed as a-b-c in the future. Hence, the model gets better.
  • On the other hand, if you train without a language model. Watson will encounter an ambiguous utterance in your training data which it thinks is 70% likely to be x-y-z and 55% likely to be a-b-c. Since it has no other information will assume that it must be x-y-z. However, the confidence score is not very high so it will make adjustments to the neural network to make it even more likely that this utterance will be transcribed as x-y-z in the future. Hence, the model gets worse.

Of course the chance of such an error happening is related to the word error rate. In my experience, users rarely put the effort into building a customized model when the WER is low. Mostly people build customized models when they are seeing very hight WER and hence they often see carelessly built acoustic models making the problem even worse.

Another problem people encounter is building an acoustic model from speech which is not typical of their domain. For example, user might be tempted to get a single actor to read out their entire script and get then to record the samples in a recording studio with a good microphone. When their application goes live they might have to deal with audio recorded on poor phone lines, in a noisy environment by people with different regional accents.


Language and acoustic customizations serve a different purpose - the first deals with non-standard vocabulary while the other deals with non-standard speech sounds. It is possible that you can build a customized language model very easily and this may be enough for your domain. An acoustic model can improve your WER even further, but you should be careful to ensure you build a good one. In particular you should use transcribed data rather than just collecting random samples.

No comments:

Post a Comment