Thread by @annargrs, In festiveangst news: SuperGLUE is officially "solved". Twice.Not sure if 0.7% improvement [...]

In festive

angst news: SuperGLUE is officially "solved". Twice.

Not sure if 0.7% improvement over T5 means NLP got radically better. But it IS a mental goalpost: we're now back to defining "NLU" & how to test it

A biased summary of my fav benchmark ideas from 2020

:
/1 https://twitter.com/sleepinyourhat/status/1344382025986437122

https://twitter.com/sleepinyourhat/status/1344382025986437122

- @tallinzen's test-only datasets, deliberately not coming from the same distribution as the training data:
https://www.aclweb.org/anthology/2020.acl-main.465/

There's been a lot of great work on adversarial eval and challenge sets by his and other groups.
/2

How Can We Accelerate Progress Towards Human-like Linguistic Generalization?

Tal Linzen. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.

https://www.aclweb.org/anthology/2020.acl-main.465/

- testing "basic linguistic capabilities": robustness to basic perturbations like negation. E.g. @sameer_ @guestrin and colleagues propose their "checklist":
https://www.aclweb.org/anthology/2020.acl-main.442/

There's also much BERTology probing work that could be viewed from this perspective.
/3

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.

https://www.aclweb.org/anthology/2020.acl-main.442/

- trying to define what it is that needs to be understood, and then testing specifically that. E.g. @jdunietz @GregHBurnham @OwenRambow @jchucarroll et al argue that narrative understanding involves spatiotemporal & causal information.

https://www.aclweb.org/anthology/2020.acl-main.701/
/4

To Test Machine Comprehension, Start by Defining Comprehension

Jesse Dunietz, Greg Burnham, Akash Bharadwaj, Owen Rambow, Jennifer Chu-Carroll, Dave Ferrucci. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.

https://www.aclweb.org/anthology/2020.acl-main.701/

- 'diversity' benchmarks with target data partitions *within* a single task (so you could tell e.g. what questions the system can/can't do). It can be a dataset collection like ORB by @ddua17 @nlpmattg @AlonTalmor et al...

https://leaderboard.allenai.org/orb/submissions/get-started
/5

AI2 Leaderboard

The AI2 Leaderboard platform hosts public leaderboards for a variety of AI challenge tasks across multiple research domains.

https://leaderboard.allenai.org/orb/submissions/get-started

... or it can be data systematically collected so as to cover a given set of target phenomena across domains in a balanced way, as in our QuAIL:
https://text-machine-lab.github.io/blog/2020/quail/
/6

Question Answering for Artificial Intelligence (QuAIL)

QuAIL is a new challenging NLP benchmark that combines reading comprehension and commonsense reasoning.

https://text-machine-lab.github.io/blog/2020/quail/

Note that all the above approaches DEFINE things to test (adversarial patterns, reasoning types etc). This means more sophisticated data work. Yet papers are still max 8 pages, students can take NLP without Ling101, and Reviewer 2 is unconvinced that data work is 'research'

/7

Now, these are all ideas for making better benchmarks. That's assuming that what we're doing can *in principle* lead to anything that could be called 'NLU'. That is disputed by @alkoller & @emilymbender, and also @ybisk et al.
https://aclweb.org/anthology/2020.acl-main.463/
https://aclweb.org/anthology/2020.emnlp-main.703/
/8

Experience Grounds Language

Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nisnevich, Nicolas Pinto, Joseph Turian. Proceedings...

https://aclweb.org/anthology/2020.acl-main.463/

Last but not least: even if we somehow create the perfect data - we have SO MANY problems with our leaderboards!
https://twitter.com/annargrs/status/1338791638722875392?s=20
/9

https://twitter.com/annargrs/status/1338791638722875392?s=20

Latest Threads Unrolled: