Recently I got this question via my chatbot:
How to get some sample training data for Dialogflow?
This is actually an important question.
The article below assumes you are building an FAQ style bot. If you are building a different kind of bot, please leave a comment below and I will write a different article based on your use case.
Now, the obvious first thing is to ask: what does your chatbot replace? Is it acting as the first step for your customer support? Then just look at the existing customer conversation logs. If you don't store them in a structured way, just analyze it manually.
Sometimes it isn't possible to do that, because you are trying to create something new. That's the scenario this article covers.
Step 1 First create a pre-NLU bot
These are rules-based, conditional logic bots. I used to call them "dumb bots", but now I feel there is a LOT that NLU bots can learn from these preNLU bots.
I use a service called collect.chat and I have found their service to be pretty good till now, but you can use whichever service works for you. There are plenty of alternatives, and you should be able to find them by doing a simple Google search.
You can ask some broad question such as "Do you have any question for us?", and just use a text field to collect the responses. Do this for a few weeks, until you get a reasonable number of responses (say 100 total).
Do you not get enough traffic to get 100 responses in a month? It might be a good idea to ask yourself exactly what you are trying to automate. If your bot ends up servicing less than 3 requests a day, you could probably install a Drift live chat widget and answer questions live when you are online. When you are away, you can just collect the user's question and email address and then reply later. I see many solo founders of SaaS (software as a service) businesses do this already.
Interestingly, adding a preNLU bot is a pretty good way to find out if you really need a bot to answer customer questions. Said another way, would your visitors actually ask questions to your bot if you did have one? The preNLU bot is an excellent way to do a trial run without wasting your time and resources.
Step 2 Figure out the top 5-10 training phrases
Dump all the questions you received via your preNLU bot into a spreadsheet, and find some patterns in your user questions.
Add a column in your spreadsheet called Category. Assign single categories to different questions. I would really recommend Airtable for this because it has selectable categories (buttons) which makes this exercise a lot easier.
When you identify the most frequent 5-10 categories, turn them into Dialogflow intents.
- The category name (or a variant) would be the name of the intent.
- All questions which fell under that category would become training phrases
Now publish this agent to your website, but be sure to mention that it is still in BETA and under development.
Also, remember to add a Default Fallback Intent to your chatbot (it is already there when you first create your agent, so you just need to make sure you don't delete it).
Step 3 Look at the phrases in the Training tab
After a few more weeks (say when your NLU bot gets about 100 responses), take a look at the Training tab. The Default Fallback Intent would have been assigned to all the questions your bot "missed". You will be able to take some user questions which your bot missed and simply add them to the appropriate existing intents. In some other cases, you might notice that you don't yet have an intent to answer the user's question, in which case you can add a new intent to your Dialogflow agent.
In some domains, it would also make sense to add your topic keyword into a tool like AnswerThePublic and generate questions and phrases. It may not apply to your use case, but if it does, it will make the process a lot less manual.
So that is how I would go about generating training phrases for my website bot.
Note: Links to Collect.chat and Airtable are affiliate links, but I have only added them because I have personally used them and think they are among the best-in-class in their respective categories.