This was the question I received from multiple clients recently, and I have been thinking about this question for a while. Recently, I got some time to do some research on this problem and here are my findings.
The quick and dirty answer
If you don't have time to read the post and just need a quick and dirty answer: just use the Default value you get when you create an agent. It is very likely optimized for the other recommendations provided by the Dialogflow team. But remember that when you choose this option, you also have to follow the other recommendations. Here are the defaults when you create an agent:
Notice that you are already presented with some default options and some default recommendations.
- Use the Hybrid mode if your agent has a small number of examples/templates in intents, and especially the ones using composite entities
- Use ML Mode for agents with a large number of examples in intents, especially the ones using @sys.any
- The Hybrid mode is chosen for you by default - they don't assume you have a large number of examples. The recommended userSays examples per intent is 15.
- The ML Classification Threshold is set at 0.3
If you have a background in Machine Learning, the first two points make sense.
Without sufficient examples, training the ML is harder, so they recommend using the hybrid mode. When using composite entities your pattern matching has to be a little more "pattern based" and again it is best to use Hybrid mode.
Similarly, when you have large number of examples, the ML mode becomes practical. The Hybrid mode uses predefined entities for its intent matching, so you don't give it anything useful when using the @sys.any wildcard. Here again, the ML Mode makes more sense.
An alternative option - use Automated Conversation Testing
Each time a user sends a message to your agent, Dialogflow assigns a score to the intent mapping process. You can access this score when you call Dialogflow's REST API. It is possible to use this score and combine it with some automation scripts to choose a good value for the ML Threshold. Let us call this Automated Conversation Testing (ACT).
What is Automated Conversation Testing?
Choose the most important phrases which you want your agent to correctly handle. Now using the Dialogflow's REST API, set up some automated scripts where you periodically send these phrases over to the API and verify if the intent which is mapped is the one you expect.
The happy path
In testing, the happy path is the default scenario which has no exceptions. For chatbots, the happy path is the most commonly used user phrase or phrases. Another way to think about it is - if you get an unexpected phrase from the user, it is not in the happy path and you don't expect to handle it correctly.
Choose the most commonly used phrases for the ACT - these should be the ones which you absolutely expect your chatbot to handle. Now for each call to the Dialogflow REST API, make a note of the score which is assigned to the given phrase.
Normally, if you have specified your chatbot very well, all these scores must be 1 or very close to 1.
Add more user phrases
Once you have done the step described above, you should select other user phrases which you haven't added to your agent's user says yet. The important strength of Dialogflow is that its NLP is good enough to handle these variations.
When you use these phrases for testing, you will notice the score starting to drop, and go closer towards the 0.3-0.5 range.
You should also add some user says phrases which fail and get mapped to the Default Fallback intent (or if there is a context, to a context-driven Fallback intent). Notice that when the phrase gets mapped to a fallback intent, you don't get a score close to zero. Instead, Dialogflow predicts that it is a Fallback phrase by assigning a score of 1. Let us call these two buckets of user phrases "Mapped" and "Unmapped" buckets.
Note: whether a phrase falls into the Mapped or Unmapped bucket is actually dependent on the current ML Threshold score.
Repeat the process for multiple ML Threshold values
Choose increments of 0.1 for your ML Threshold. Start at 0.1 and go all the way up to 0.9 or whatever value you feel comfortable with.
As you do this, you will see the different userSays phrases fall into Mapped and Unmapped buckets based on where the threshold is set.
Let me go all mathematical on you for a minute. Suppose Mapped(0.1) and Unmapped(0.1) represents the list of phrases which get mapped and unmapped (fallback) for the ML Threshold of 0.1. You should notice the following:
- The combination of the two buckets should give you the entire list of user phrases
- As you increase the ML Threshold score, some phrases should move from the Mapped to the Unmapped bucket
- You will probably have nothing in the Unmapped bucket when the ML Threshold is close to zero
- You will probably have only exact match phrases as you move close to an ML Threshold of 1.
Arrange the user says phrases by their score
Now sort the user says phrases by their score. You can also make your automation produce a report where you can immediately see the two buckets (Mapped and Unmapped) based on the ML Threshold.
At this point, you can manually inspect your userSays phrases and decide the ones you definitely wish to map, and those you don't care about as much. You are likely to see a ML Threshold score which separates it into two groups. Use that as your threshold.
Pros and Cons of using Testing to choose the ML Threshold
When using this approach to set the ML Threshold, you will have some pros and cons just like in any other process.
- You will have a much more systematic approach to deciding your ML Threshold. Even though it is just a simple fractional number, obviously the ML Threshold you choose has a big impact on the performance of your agent. With a process like what I have described, you actually have a systematic, and explainable process for choosing the threshold. It will not be based on intuition or gut feeling.
- Easier to get stakeholders to participate in the process. For example, you can have multiple stakeholders review the phrases in the Mapped and Unmapped buckets for a given ML threshold value, and decide which one you want to use.
- I can't really overstate this, but if you actually follow a systematic process, you will get a very good intuition about why your chatbot is behaving in a certain way.
- It will also simplify your deployment process. If you have a system like this in place, when you make changes to your test agent, you can use this to make sure your new changes don't push some messages from the mapped to the unmapped bucket.
- Clearly this requires some upfront work to set up.
- As you make changes to your agent, you need to do this review process each time. On the other hand, having automated conversation testing is like a harness you can use as you make changes to your chatbot. You can do so more confidently knowing that if you break something you will get a notification/alert right away.
Let us go over an example so it is easier to understand the process I am describing.
I created a super simple agent which gives some answers about some chemical properties of Carbon.
Here are the intents in the agent:
Let us take a closer look at the GetAtomicNumberOfCarbon intent:
As you can see it is a very simple intent with a single user says message and a single response.
Now, we use the process I described in the previous section. I have created a script which calls Dialogflow's REST API and sends a set of user messages (as queries) and then outputs the response. The following pieces of information are of interest to us:
- the text response coming back from the agent
- the intent which was mapped
- the score assigned to the mapping
ML Threshold 0.1
I set the ML Threshold very low to begin with, at just 0.1. As I ran the set of user requests I also recorded the response coming back for each request and displayed it in a table.
As expected, almost everything gets mapped when the ML Threshold is very low.
Notice that just one phrase (tell me about carbon) is being mapped to the Default Fallback Intent.
Let us increase the ML Threshold a little and see what happens.
ML Threshold 0.5
I set the ML Threshold at 0.5 and reran the same queries. This is what the output looks like for this case:
With an increased ML Threshold, you notice that you now have 5 phrases being mapped to the fallback intent. Not surprisingly, all those phrases had a score less than 0.5 when they were mapped to an intent in the previous run. Since the ML Threshold was raised to 0.5, obviously those phrases got mapped to the Default Fallback.
ML Threshold 0.9
Now let us crank it up a bit. We will set the ML Threshold to 0.9.
In Dialogflow-speak, we are telling Dialogflow to only match to an intent if what the user says is really close to one of the phrases we have already declared.
And as you might expect, everything except two phrases, have now been mapped to the default fallback intent. And those two phrases are very close to what we have in the user says already.
How to use this in your agent
Now, one thing you might have noticed is that you didn't really have to run tests 2 and 3 after you got all the scores for test 1. This is because Dialogflow's intent mapping is very predictable in this way - once you get all the scores after setting a threshold of 0.1 (i.e. very low), you will be able to automatically figure out which phrase will still be mapped to an intent for any ML Threshold value you might choose. (The score must be greater than the ML Threshold).