Many developers ask me about any additional advice or insights per the construction and utilization of custom models by Knowledge Studio so I curated a collection here per the ask!
Watson Knowledge Studio [WKS] is utilized within Watson Developer Cloud for the services of Natural Language Understanding and Discovery services. Natural Language Understanding provides OOTB annotation utilizing deep learning. But, sometimes that is not effective enough for a company to have the insights they need. So, what can we do about this? We can make a model of your information or company information you need to extract insights from.
This will be your own domain model you can then deploy for use within Natural Language Understanding and Discovery to leverage machine learning for those mentions and relations in your data. Watson Knowledge Studio is the tool you use to make your own domain model. You can freely try it out for thirty days, here.
WKS has the ability for you to make machine learning annotations, rule-based annotations, and pre-annotate documents using dictionaries to lessen the footwork of human annotation overall.
Customized Natural Language Processing [NLP] was used in the original Watson from the DeepQA Project that competed against humans and won on the game show of Jeopardy. Now we are wanting to bring this to your fingertips with offerings such as Watson Knowledge Studio and the Watson Developer Cloud services.
I wanted to make an article defining all best practices for Watson Knowledge Studio outside of those within the documentation. Here is a direct link to the documentation in which goes to explain information on getting started with the tool. I had to use this when getting started, it has information on your file formats that are acceptable for WKS, as well as configuring the service for different human languages, plus much more.
Here are the tutorials which lends example files and KLUE system for getting up and running. I have had to refer to this many times, even after becoming familiar with the software. It goes step by step per section, making a project, adding documents, making annotators, etc.
There are two video series for this as well on Youtube. They are dry but seeing someone visually step through the process was helpful. The first video series is “Getting Started” which is hands on and the second series is “Deep Dive” which is more high level. This series is meant for comprehension of what WKS can be used for in the context of business and how WKS works overall. If you have a trial of WKS and are starting with it - I would watch the “Getting Started” videos.
This is where the release notes on this tool are. If you are an active user or waiting for an update, I would bookmark this page.
Tips and Tricks
Now, here is a list I have wrote out myself and sanitized for utilizing Watson Knowledge Studio in the most efficient respect. This product was made for speedily utilizing natural language processing in a customizable respect. So I hope this helps people out in understanding what they need to do to hit the ground running with it.
Disclaimer: The advice comes from those that have utilized the tool. The first reference point of information should always be the offering’s documentation. These tips and tricks below should be taken as stories of experiences and recommendations from experiences with WKS.
- Entity and Relation Types can NOT have spaces. It is best to stick with alphanumeric characters and underscore characters.
- At least 2 entity types and at least 2 relation types with 2 example mentions of each in the ground truth is required to perform a successful training run of machine learning annotator.
- If defining coreference chain links, provide at least 2 coreference links.
- Rule of thumb: 50 mentions for a given type (entity type, relation type) in the training data.
- It is recommended to have training data distributed across all possible subtypes and roles for entities to help train system better.
- Do not use overlapping mentions and assign them to different entity types — this would confuse the training of the system.
- When defining type system and document size, make sure that type system is not too complex and document size is not too large that human annotators won’t be able to efficiently follow the guidelines. Keep the entity types to less than 50 and keep document size to no more than a few paragraphs.
- Getting the SMEs up to speed as to what is needed from them quickly as possible when upon beginning WKS, everyone has a different learning curve, and if under time constraint, focus on lessening that learning curve for those important to the project at hand. As stated before in this article, here is a video series to help them understand what WKS is.
- SMEs had an easier time to first create spreadsheets with entity types and then map keywords from documents to entity types.
- Type System creation can take time. Think about what is necessary to be accomplished with the type system.
- Creating snippets of text from documents and uploading them takes more time than uploading an entire document into WKS.
- Be careful with defining multiple domains within one domain model. This can lower the accuracy of the machine learning annotator. For example, if you are formulating a domain for real estate and for an airplane. “LOCATION” entity type for real estate domain will likely be represented differently than the location for an airplane domain.
- More often than not, if an entity type is similar to the definition to another entity type, make a unique name for those entity types in order to separate them.
- Check documents by looking at them after importing them into WKS before assigning them to human annotators. Keep an eye out for bad sentence breaks and consider pre-processing the files to remove problematic periods and punctuation overall. For example, converting U.S.A. to USA.
- Provide the human annotators a quick reference sheet that summarizes entity types, relations, what they mean, color codes for annotating and the keyboard shortcuts for the annotation process.
- Use the adjudication feature in order to see discrepancies between the human annotator annotations and tackle sources of confusion in order to drive up the annotation agreement. Ask your annotators to compare their version of the same documents to summarize where they possibly differ. This makes for communication on accuracy.
- A grey check mark does not indicate that the text is selected.
- It is helpful to include the name of the human annotator in the name of their document sets. It makes it easier to keep track of the sets and who is assigned when setting up annotation tasks.
- Try to plan out the type system as best you can before any annotations start happening. There will likely be changes in the type system as you go, but some of these revisions could possibly add time to human annotation requiring revision of the documents due to the changes made.
- For our project, it was much more easy to work on mentions first, then move to relations, then do the coreference chains lastly.
- We found our groove in taking 50 documents, annotating those for mentions, then refining our annotation guidelines overall, submitting, adjuctating, and then go back to do relations and then coreferences after.
- Set up your keyboard hotkeys for annotating and use them to save time!
- Use existing ontologies to define your starting points in the annotation guidelines.
- Try to utilize the pre-annotation dictionaries as much as possible to save you some time.
- Try to take out symbols from your document you are to annotate.
- Annotation of abstracts rather than the entire paper, abstracts such as the concluding summary of a scientific research study, can be very helpful to use in making your best domain model.
- Try and use your smaller documents first if at all possible. Your human annotations will take longer for large documents.
- Avoid trailing commas when you make dictionaries in Excel.
- Make sure you annotate consistently. If you annotate one entity within your document and then do not annotate that same entity uniformly in other documents, this can lower the precision of the machine learning.
- Helpful tips from the WKS Release Notes:
“Documents and dictionaries issues:
- If you try to import a large ZIP file that contains UIMA CAS XMI files and experience issues due to network performance, consider splitting your ZIP file into smaller ZIP files and upload the ZIP files one by one.
- If you use an Natural Language Understanding, dictionary, or machine-learning annotator to pre-annotate documents, the annotations appear only in tasks that you create after you run pre-annotation.
- If compound words that include a hyphen are not being pre-annotated, add a surface form that includes spaces around the hyphen. For example, ensure that “pre-jurassic” and “pre — jurassic” are defined in the dictionary before you use the dictionary to pre-annotate the documents.
- If you choose to use the dictionary-based tokenizer with a project, and you notice that some punctuation is causing incorrect sentence breaks, then you can add a dictionary to address the issue. For example, the punctuation in the abbreviation of the word Figure (Fig.) or the company name, Yahoo! might be misinterpreted by the tokenizer as indicating the end of a sentence. As a workaround, you can create a dictionary that includes potentially problematic terms like these, and then use it to pre-annotate the documents. While you cannot adjust sentence breaks using dictionary pre-annotation with the default machine learning-based tokenizer, it is also less likely to misinterpret the punctuation to begin with.”
Remember, the best source of practices comes from the documentation itself. If the recommendations above are not the best for what you are attempting to accomplish, go with your gut and our documentation before any opinions from experiences based on utilizing the tool from others.
This is DeveloperWorks link for Watson Knowledge Studio. The experts are on here and answer questions.
Are there any application examples where WKS is being utilized?
Yes, the Voice-Of-A-Customer application on our Watson Developer Cloud Github utilizes WKS.
Remember to have fun! ❤ Julia