Thread by @SergeyFeldman, I'm glad this phenomenon has a name! The last time I observed [...]

Sergey Feldman

SergeyFeldman

I'm glad this phenomenon has a name! The last time I observed this was while working on the Semantic Scholar search engine (feature-based LightGBM LambdaMART model). It occurred in two different ways. 1/n https://twitter.com/alexdamour/status/1325921856738701312

https://twitter.com/alexdamour/status/1325921856738701312

First, I observed that adding perfectly sensible features that improved held-out NDCG performance would destroy qualitative performance. The solution was to narrow down and refine the feature space to give the model fewer ways to go wonky. 2/n

Second, different hyperparameter sets that had equivalent held-out NDCG behaved in totally different ways when actually deployed. The solution was to construct a custom validation test suite that tested models for correct behavior instead of NDCG. 3/n

What is "correct behavior"? It's what I believe that users of a scholarly academic search engine expect it to do when they issue a search, and was constructed by hand. It looked more like a giant set of unit tests, except it wasn't pass/fail but had a 0-to-1 score to argmax. 4/n

This made for a bizarre situation, from the perspective of standard ML. The training data was ordinary: queries, results, clicks from real users. But the test set was non-iid on purpose: real queries, constructed results, idealized clicks. But it worked. 5/n

We are now working on overhauling the author disambiguation (way harder than search BTW), and seeing the same sort of thing. But here we are using data augmentation to reduce underspecification. 6/n

This requires a slow cycle of (a) fit model, (b) manually examine results, (c) try to understand which errors can be attributed to modeling, (d) add augmented data that tilts model away this type of error. 7/n

Data augmentation is more systematic than custom val-set construction, and is likely more widely applicable. But all solutions to underspecification that I've implemented require squinting at model outputs repeatedly for weeks or months. But hey that's the work. 8/8

You can follow @SergeyFeldman.

Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: