Wednesday, 3 July 2019

Watson is starting to sound much more natural

IBM has recently implemented a very significant change in the technology that they use for speech synthesis.

To simplify, the traditional technology involved splitting up the training audio in to chunks of roughly half a phoneme and when given a snippet of speech to synthesise it will pick the most suitable chunks of  to use and combine. Sometimes it will be lucky and it will find a large part of the desired speech already in the training corpus and in this case it can generate a very realistic output (because it is essentially replaying a recorded sample). However, more often Watson will need to combine chunks from different utterances in the training data. While there are techniques to try and seamlessly fuse the different chunks together, users frequently complain that they can hear a choppiness and the voice sounds more robotic than human.

The newly released technology generates the synthesised speech from scratch rather than leveraging recorded chunks of speech. It makes use of three different Deep Neural Networks (DNNs) that look after prosody, acoustic features and voice signal creation. The result is a much more natural sounding voice. Another advantage is that it is much easier to adapt the engine to a new voice because the amount of speech we require from the actor is much less (since we don't need a large corpus to pick samples from).

You can read an academic description of the research here and a more end user based description here.

Most users agree that this new technology sounds much better. You can try it out for yourself here at the normal Watson TTS demo page. When you select a voice to use, the ones with this new technology are identified by having 'dnn technology' written after their voice. I am sure that you will agree that these sound better than the traditional voices (which are still available).