Many clients and students have asked me if there is any way they can start from a list of sentences and "automatically" convert it into a Dialogflow agent.
I have created a course which can help you do that.
The course also includes a Python script which implements this concept. It is a very practical course as you can actually use this script on your existing sentences.txt file and generate a CSV file which can then be used within my CSV Importer tool. In other words, you will be able to start with nothing more than a list of sentences in your text file and get an actual Dialogflow agent ZIP file at the end of the process.
The idea for this approach was inspired by Google's recent article on assessing the quality of your intent training phrases.
How it works
At a very high level, this is how the script works:
First, we will create sentence vectors for each sentence in the text file using some word embedding (I use Numberbatch)
After that, we will create a graph G and add nodes to the graph corresponding to each sentence.
Then we will calculate the pairwise cosine similarity between all the sentence vectors. If the cosine similarity is greater than a certain threshold, it means the sentences are very similar to each other. In that case, we will add an edge between those respective nodes in the graph.
Finally, we will find maximal cliques in this graph. The maximal clique represents an interesting concept in Dialogflow because all nodes in a clique are connected to each other (that is the definition of a clique). This means every sentence (represented by a node in the clique) is very similar to all the other sentences in the clique, which is precisely what we want when defining Dialogflow intents.
We construct intents based on the clique membership of various nodes.
Note that this is really a Pareto strategy which quite literally gives you about a 80% solution. That is, when you use this idea, about 20% of the sentences will not belong to ANY clique and will not be part of any intent. I also provide some suggestions for this later in the article.
I have created a partial Python script based on the actual code and released it as sample.py (see the code here).
You can use this script to check if this approach works well for your use case.
The sample script takes a sentences text file and provides you with another text file which gives you the classification into various intents.
To see an example, check out this lesson in my course.
The full script is called script.py, which you can access when you purchase the full course.
Here is what you get in addition to the stuff already in sample.py:
First, you will automatically get a CSV file which you can use within my CSV Importer tool.
In addition, you will get an ExclusionScore for each sentence - the higher this score, the less similar this sentence is to the other sentence in that intent.
Lastly, you will also get an InclusionScore for each sentence which didn't belong in any clique. The higher this score, the more similar this sentence is to one of the existing sentences in the intent.
By combining the ExclusionScore and InclusionScore, you should be able to define a pretty good Dialogflow agent which covers a large percentage of sentences within your text file.
I got my sample sentences using a tool called Ubersuggest, which gives you a list of keywords which are being used by people to search for a particular topic on Google.
Why did I choose this tool?
There are a few reasons:
- you get access to shortish sentences which are more typical of what people type into chatbots
- it is already like an FAQ (note that the keywords are sorted by volume in descending order)
- the typical training phrase inside a Dialogflow intent is only a few words long
Here is an example of how the script works.
I took a text file which looks like this:
As you can already see, it is based on the CSV file I downloaded from Ubersuggest.
The Python script converts it into a CSV file.
I then loaded the CSV file into Airtable. Note that the Response field is empty, because that is something you will be filling in within the CSV file. Also note the ExclusionScore and InclusionScore fields I mentioned earlier. An ExclusionScore of -1 means that it is a sentence which didn't belong to the clique (and intent) originally, but was later mapped as a "best match". An InclusionScore of -1, on the other hand, means a sentence which did belong to the clique.
Personally, I think Airtable makes an excellent visual inspection tool for your intents.
First, after importing the CSV file, you will make the ExclusionScore and InclusionScore number fields (the will be imported as strings). You can do this conversion quite easily by editing the field type.
Then add a new field called Unmatched (meaning the phrase/sentence wasn't originally matched to an intent). You can do this by defining a Formula field.
Then group the entire sheet by ID first, and then by Unmatched.
This is what the output will look like at that point. Notice the very obvious difference between sentences/phrases which got mapped into an intent, and those that didn't.
Now, I will point out here that the screenshot above is a somewhat "best case" scenario. You will get sentences/phrases which do not belong to that intent under Unmatched = No sometimes. Conversely, sometimes you will also get sentences which can easily be added into a specific intent under Unmatched = Yes.
That is where the ExclusionScore and InclusionScore comes in.
We will now color code the sheet to give us some visual hints about phrases which should be excluded and included.
Let us first color code sentences which should be included.
We will add a Green color to those sentences which are Unmatched and whose InclusionScore is high.
Then we will add a Red color to sentences which are Matched (Unmatched = No) and whose ExclusionScore is high.
Here is what the output looks like:
Clearly, these hints could be improved. (And I am working on some ideas to do that).
Also a better choice of threshold values for Exclusion and Inclusion scores can make the suggestions better. Good values for these threshold values will depend quite a bit on the type of agent you are creating and are probably domain specific.
But you can already notice that the structure provided by this Airtable visual inspection is a big help when you actually create the intents.
You will then take this CSV file and use it as input to my Dialogflow CSV Importer tool. Remember to only select the first four columns so that you can use the file within the CSV Importer. You can do this by hiding the extra fields and downloading the rest as a CSV from inside Airtable.
Here is what the final agent looks like:
I didn't add Responses into the CSV file, but you can see that the match is accurate by looking at the Intent name.
Get the course here.
- How to debug your Dialogflow bot
- Client Question: Can I use GPT2 for my Dialogflow bot?
- Weekly Free Mini Courses
- 15 things all Dialogflow bot makers should know
- Autogenerating FAQ bot from training text
- A MUST read article on Dialogflow training phrase quality
- Using Collect.chat for preNLU bots
- Reader Question: How to get some sample training data for Dialogflow?
- Getting the top 3 (or top N) intents in Dialogflow: An experiment
- Dialogflow Regexp (regular expression) entity