We (me & @tylerjdunn) had a great question today in office hours that I think it's worth a thread. It seems pretty straight forward but reveals a lot about how you can go wrong when building NLP systems.

The person asked how you know if your NLP model is overfitting.
Very common question: a lot of ML training is about hitting that sweet spot between under and overfitting.

But if you're building NLP systems I think it can be a bit of a red herring. It's a lossy approximation of the thing you really need to care about, which is...
... how well will this system work for the people who use it?

Your test data is an estimate of what sort of interactions you think your users will have with the system. But what's better than an estimate is the real thing.

In the case of conversational AI, actual conversations.
For me, the better question is: how successful are users when using this system to do something?

Can they actually book a taxi, or file their insurance claim, or reschedule an appointment? Can they quickly and consistently do what they need to do?
And test data will only tell you so much. It's far better to get your system in front of test users ASAP, see where it fails and fix it right away than spending extra time tweaking hyperparameters on an NLU model. It's no use having great recall on an intent no one ever uses. 🤷‍♀️
To sum up: ML methods and model evaluation are a single (albeit flexible and useful) tool. But if you're solving a problem that's slightly perpendicular to the one your users actually have, all the ML engineering in the world won't make your system good to use.
You can follow @rctatman.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.