In festive🎄angst news: SuperGLUE is officially "solved". Twice.

Not sure if 0.7% improvement over T5 means NLP got radically better. But it IS a mental goalpost: we're now back to defining "NLU" & how to test it 😱

A biased summary of my fav benchmark ideas from 2020👇:
/1 https://twitter.com/sleepinyourhat/status/1344382025986437122
- 'diversity' benchmarks with target data partitions *within* a single task (so you could tell e.g. what questions the system can/can't do). It can be a dataset collection like ORB by @ddua17 @nlpmattg @AlonTalmor et al...

https://leaderboard.allenai.org/orb/submissions/get-started
/5
Note that all the above approaches DEFINE things to test (adversarial patterns, reasoning types etc). This means more sophisticated data work. Yet papers are still max 8 pages, students can take NLP without Ling101, and Reviewer 2 is unconvinced that data work is 'research'🤷
/7
Last but not least: even if we somehow create the perfect data - we have SO MANY problems with our leaderboards!
https://twitter.com/annargrs/status/1338791638722875392?s=20
/9
You can follow @annargrs.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.