Tuesday, 3 April 2018

Clustering Questions into Topics

IBM Watson used to claim it took 15 minutes to match up a question with an intent. The technique described here halves that time. Context switching is mentally draining and wastes a lot of time. Concentrating on one part of a job until it is done is much more efficient than switching between tasks.

In a similar way once we have our questions collected the next task is to divide them into topics. Then these topics will be looked at individually.

A topic is a category of types of questions people will ask your chatbot.

In an Airline these might be Checkin, Booking, Airmiles

In an Insurance company Renewal, Claim, Coverage

Before looking at the questions try think of 5 topics that might occur in customer questions to your business.

How many topics?

Roughly 20. You might have ten or 30. A rule of thumb used in K Nearest Neighbour classification is if you have N documents you expect to have Sqrt(N) clusters. This works out as 44 for 2000 clusters. You won't have 2000 questions at this stage more likely under 1000.

Can you automate discovering topics

Yes you can using a KNN algorithm with the number of clusters given above. No you really should not. You learn a hell of a lot clustering 500 questions. You will have to read all these questions eventually anyway so you might as well learn this stuff now.

Process of Marking up topics

Say you have 500 questions in a spreadsheet. What we are trying to do here is mark up a new column 'Topic' that puts each of these questions in a topic.

Go through your 500 questions. Looking for the 5 topics you listed in question above. You may find that actually what you thought was one cluster is two. Or that a topic you expected is missing If you are looking for the clothesReturn topic I would search for the key words 'return' and 'bring back'. I would look for the obvious words in each of the topics I expect.

Once I had marked up the obvious keywords from my list of 500 that were clothesReturn if I found a new question in that topic I would look for the word it had that showed me it was that topic but was not in my original search list

Can I exchange a jumper I bought yesterday for a new one

I would then search for other uses of 'exchange'. It is a word likely to be used in clothesReturn but one I missed earlier.

If you know the domain roughly half of the questions will be classified by your obvious keywords.

I would read through the remaining questions with my 5 expected topics in my head. If I see something that is obviously a new topic I add that to the topic list.

Feel free with marking 5-10% of questions with unknown. these might make more sense when you have more questions or might be part of the long tail, out of scope or off topic that your chatbot will not handle.

What Next

Once you have a spreadsheet with a column marked up with the Topic of each question the next step is to find the intent of each question. But now you are reviewing a series of questions in one topic which makes it much easier to concentrate and work in a batch mode.

I will describe this step of marking up intents in a later blogpost

No comments:

Post a Comment