In festive
angst news: SuperGLUE is officially "solved". Twice.
Not sure if 0.7% improvement over T5 means NLP got radically better. But it IS a mental goalpost: we're now back to defining "NLU" & how to test it
A biased summary of my fav benchmark ideas from 2020
:
/1 https://twitter.com/sleepinyourhat/status/1344382025986437122

Not sure if 0.7% improvement over T5 means NLP got radically better. But it IS a mental goalpost: we're now back to defining "NLU" & how to test it

A biased summary of my fav benchmark ideas from 2020

/1 https://twitter.com/sleepinyourhat/status/1344382025986437122
- @tallinzen's test-only datasets, deliberately not coming from the same distribution as the training data:
https://www.aclweb.org/anthology/2020.acl-main.465/
There's been a lot of great work on adversarial eval and challenge sets by his and other groups.
/2
https://www.aclweb.org/anthology/2020.acl-main.465/
There's been a lot of great work on adversarial eval and challenge sets by his and other groups.
/2
- testing "basic linguistic capabilities": robustness to basic perturbations like negation. E.g. @sameer_ @guestrin and colleagues propose their "checklist":
https://www.aclweb.org/anthology/2020.acl-main.442/
There's also much BERTology probing work that could be viewed from this perspective.
/3
https://www.aclweb.org/anthology/2020.acl-main.442/
There's also much BERTology probing work that could be viewed from this perspective.
/3
- trying to define what it is that needs to be understood, and then testing specifically that. E.g. @jdunietz @GregHBurnham @OwenRambow @jchucarroll et al argue that narrative understanding involves spatiotemporal & causal information.
https://www.aclweb.org/anthology/2020.acl-main.701/
/4
https://www.aclweb.org/anthology/2020.acl-main.701/
/4
- 'diversity' benchmarks with target data partitions *within* a single task (so you could tell e.g. what questions the system can/can't do). It can be a dataset collection like ORB by @ddua17 @nlpmattg @AlonTalmor et al...
https://leaderboard.allenai.org/orb/submissions/get-started
/5
https://leaderboard.allenai.org/orb/submissions/get-started
/5
... or it can be data systematically collected so as to cover a given set of target phenomena across domains in a balanced way, as in our QuAIL:
https://text-machine-lab.github.io/blog/2020/quail/
/6
https://text-machine-lab.github.io/blog/2020/quail/
/6
Note that all the above approaches DEFINE things to test (adversarial patterns, reasoning types etc). This means more sophisticated data work. Yet papers are still max 8 pages, students can take NLP without Ling101, and Reviewer 2 is unconvinced that data work is 'research'
/7

/7
Now, these are all ideas for making better benchmarks. That's assuming that what we're doing can *in principle* lead to anything that could be called 'NLU'. That is disputed by @alkoller & @emilymbender, and also @ybisk et al.
https://aclweb.org/anthology/2020.acl-main.463/
https://aclweb.org/anthology/2020.emnlp-main.703/
/8
https://aclweb.org/anthology/2020.acl-main.463/
https://aclweb.org/anthology/2020.emnlp-main.703/
/8
Last but not least: even if we somehow create the perfect data - we have SO MANY problems with our leaderboards!
https://twitter.com/annargrs/status/1338791638722875392?s=20
/9
https://twitter.com/annargrs/status/1338791638722875392?s=20
/9