Tuesday 16 January 2018

Combining the annotation capabilities of both Watson Knowledge Studio and Watson Discovery Service

Watson Discovery Service (WDS) provides a capability to automatically annotate the documents being ingested. This capability is available in several languages and it is able to recognize a wide range of entity types commonly found in typical texts written in these languages.

Unfortunately many users of WDS have to deal with documents which are not typical. For example, they could be dealing with medical documents that contain unusual drug and disease names or they could be dealing with a business domain that has obscure terminology that would not be understood by WDS (or indeed by most speakers of the language in question).

Luckily Watson Knowledge Studio (WKS) is can be used to create a language model that understands the specialized terminology for any domain. However many document collections will contain a mixture of specialized terminology and normal test. By default, when users choose to specify that a customized WKS domain model is to be used instead of the generic WDS model it is as a replacement and none of the normal entities will be annotated by WDS.

It is not feasible for users to build a complete WKS model that incorporates all of the normal language dictionaries as well as the specialized domain terminology. However, there is a trick which can be used to get WDS to use both the domain specific annotator from WKS and the generic language annotator from WDS.

Unfurtunately this trick is not possible with the normal WDS UI, but it requires the use of the REST API - hopefully you are already familiar with this and you should be able to export your configuration to a JSON file. Assuming that you have configured a number of enrichments for the field named "text" you will see that your configuration contains a fragment that looks something like the following:

  "enrichments": [
    {
      "enrichment": "natural_language_understanding",
      "source_field": "text",
      "destination_field": "enriched_text",
      "options": {
        "features": {
          "keywords": {},
          "entities": {
            "model": "a3398f8b-2282-4fdc-b062-227a162dc0eb"
          },
          "sentiment": {},
          "emotion": {},
          "categories": {},
          "relations": {},
          "concepts": {},
          "semantic_roles": {}
        }
      }
    }
  ],

This fragment means that you have selected a number of different enrichment types to be computed for the text field and the results to be placed in the field named "enriched_text". For most of these enrichments you will use the language model which is provided with the natural language understanding unit that is built into WDS, but for entities it will instead rely upon the WKS model ID "a3398f8b-2282-4fdc-b062-227a162dc0eb".

If you want to have the core WDS detected entities as well as the WKS detected ones, you need to define an additional enrichment entry in your configuration file to place these enrichments in a different named field e.g. wds_enriched_text. The fragment of JSON above needs to be replaced with the fragment below and then the new configuration should be uploaded via the API.

  "enrichments": [
    {
      "enrichment": "natural_language_understanding",
      "source_field": "text",
      "destination_field": "enriched_text",
      "options": {
        "features": {
          "keywords": {},
          "entities": {
            "model": "a3398f8b-2282-4fdc-b062-227a162dc0eb"
          },
          "sentiment": {},
          "emotion": {},
          "categories": {},
          "relations": {},
          "concepts": {},
          "semantic_roles": {}
        }
      }
    }, 
    {
      "enrichment": "natural_language_understanding",
      "source_field": "text",
      "destination_field": "wds_enriched_text",
      "options": {
        "features": {
          "entities": {}
        }
      }
    }
  ],

What this configuration will produce is two different enrichment fields containing the entities detected by WDS and WKS. However, it is likely that you want to have all of the detected entities available in a single field. Luckily this is possible by configuring the collection to merge the two fields during the "Normalize" phase.

4 comments:

  1. I meant to mention that the merge rule can be defined either through the API or through the WDS user interface. If you want to define it through the API, this is the JSON fragment you will need:

    "normalizations": [
    {
    "operation": "merge",
    "source_field": "wds_enriched_text.entities",
    "destination_field": "enriched_text.entities"
    }
    ]

    ReplyDelete
  2. Thanks, this helped a lot filling the conceptual gaps between the too.


    It seems they added this to their official documentation:

    https://console.bluemix.net/docs/services/discovery/integrate-wks.html#integrating-with-watson-knowledge-studio

    I am now trying to figure out how to do the opposite. I wanted WDS to pre-annotate documents in WKS for a type system which happens to have common entities (Company, Person, Location) .

    ReplyDelete
    Replies
    1. I can see that what you are asking should be a relatively common use case, but I don't know enough about WKS to advise you. I will consult with a collage who is more expert in WKS and get back to you.

      Delete
  3. Thanks for checking.

    I see that they added something very close to that function in the WKS version on Bluemix: they allow NLU and AlchemyLanguage to pre-annotate on common types (e.g. Company, Organization, Location, JobTitle, Person)

    Workspace -> Assets & Tools -> Pre-annotators -> Natural Language Understanding (or Alchemy Language)

    You can map their native entity types to entity types in the WKS model. Each has some 20 entity types (AlchemyLanguage has a bit more than NLU) , and they did a pretty good job of many of them.

    Unfortunately, it seems one cannot combine pre-annotators, each pre-annotator deletes the previous pre-annotations. Still, great improvement from what I remember.

    ReplyDelete