I am working on creating a new service – tentatively called Dialogflow conversation audit – where I plan to analyze the client’s agent and make suggestions to improve its accuracy.
But this means there must be a measurable way to determine the bot’s accuracy in the first place.
The U-M-M method
A while back, I wrote a post which linked to Chatbase’s UMM method which provides a way to reason about your chatbot’s accuracy. While it is a good idea and I do derive some ideas from it, it is not particularly useful because there isn’t any way to measure the accuracy using the UMM method.
What I am proposing here is much simpler.
The confusion matrix
You may be familiar with the term error matrix or confusion matrix. If not, don’t worry!
It is a way to measure if classification techniques work well, and it is quite appropriate in this case because under the hood, Dialogflow takes the user’s input and classifies it to the nearest matching intent.
So let us define the following terms:
Regular intent = An intent which is not a fallback intent
Correct mapping = the user’s phrase was mapped to the expected, appropriate intent. By the way, if you find the idea of “correct mapping” subjective rather than objective, you probably need to improve the way you are defining your intents.
True Positive (TP) = A user phrase is mapped to a regular intent correctly
True Negative (TN) = A user phrase is mapped to a fallback intent correctly (this means, we haven’t yet declared an intent to handle the user’s phrase)
False Positive (FP) = A regular intent is triggered, but it should have either been mapped to a different regular intent, or it should have been mapped to the fallback intent because we don’t yet handle the phrase. Instead it is wrongly mapped to a regular intent.
False Negative (FN) = A fallback intent is triggered, but we actually have already defined a regular intent which should have been mapped to the user’s phrase. An excellent example of this is when the user types a message which is nearly identical to a training phrase except for a small typo.
Consider the last 100 user messages to your bot. If you don’t have that many, get some 10 or 15 beta testers to try out your bot for a few minutes.
Let TP be the number (out of the 100) messages which were true positive mappings.
Similarly, TN = number of true negative mappings
FP = number of false positive mappings
FN = number of false negative mappings
Let Correct Mapping (CM) = TP + TN
Let Incorrect Mapping (IM) = FP + FN
Accuracy = CM / CM + IM
Since CM + IM = 100 (if you got the correct sample size), the value is already a percentage.
Why measure the bot accuracy
If you are trying to improve your bot, a good way is to measure its accuracy using some objective measure . By using the systematic approach I have explained here, you now have a numerical measure which can help. However, it works best with my existing recommendations such as avoiding slot filling, using a context lifespan of 1 etc. That makes the bot much easier to analyze.  Although it isn’t the only one. A more useful measure might actually be how often the users accomplish their goals. Here Chatbase’s funnels feature is a lot more useful.
- Actions Builder vs Dialogflow CX
- 5+ ways Dialogflow CX is better than Dialogflow ES
- How to bulk upload training phrases for Dialogflow Messenger
- Dialogflow CX vs ES: First look
- How to send rich responses from webhook to Dialogflow Messenger
- Dialogflow CX now generally available
- Understanding Dialogflow CX Parameters
- Dialogflow CX Missing Features
- Dialogflow Messenger integration for CX: First look
- Dialogflow Conversation Analytics Tips