Good Categories
The FAQ dataset is divided into categories and a set of, attached to it, questions. Our NLP engine compares every incoming question with all queries available in the dataset. When the similarity is detected the incoming question is matched to the appropriate category and the category’s answer is given to the user by the chatbot.
For this process to function as well as possible it is important to have a clean FAQ dataset with well-defined categories. The categories and questions within the set should not overlap. This means very similar questions should not belong to different categories.
How to create good categories?
Do not create overlaps.
The questions in two different categories should be clearly distinct from another. This means avoiding the same question being located in two different categories.Enrich your categories.
Add if possible add at least 15 questions to each category to avoid weak and worse performing categories.
Have fewer categories with more questions in them.
It is useful to combine all questions for a topic to one category and have a nice well-rounded answer for that category instead of having one topic spread into several smaller categories. For example: Combine questions for parking into one parking category (instead of having three categories: Parking allowed, Parking costs, Parking limit).
Keep in mind that with bigger topics it can make sense to split them into several categories, for example:
Benefits: Can be divided into Benefits Health insurance, Benefits general, Benefits internship etc.
Salary: Can be divided into Salary general, Salary internship, Salary dual studies etc.
It is important in this case to have the category-defining keywords (like an internship, insurance etc.) in the questions. In that way the categories will stay clean and distinctive like in the examples below:
“Do you offer healthcare” → Benefits Health insurance category
“What benefits do you have” → Benefits general category
“What benefits will I have as an intern?” → Benefits internship category
Keep also in mind that for job-specific answers you can use contexts and you should not create separate categories.
Learn more about creating good datasets in the next articles: