What does it take to force a neural network to represent the syntactic structure of natural language? For those of us that take both decades of linguistic research and the recent successes in deep learning seriously, this is a key question.
One approach is to use an expressive enough neural architecture, massive data and massive compute, and cross your fingers that syntactic structure, to the extent that it is ‘real’ in the first place, is learned. E.g., https://twitter.com/gg42554/status/1295739316308672512
Recent empirical results (incl ‘bertology’, GPT3, etc) give this approach much momentum, but i.m.o. we’re still far from a convincing demonstration that the *productive* grammar rules that any native speaker knows can be learned in this manner from reasonable amounts of data.
A 2nd approach is to use a hybrid architecture, such as recursive NNs, graph NNs or recurrent NN grammars. This is the approach we and others explored (mostly 2012-2015). See, e.g., our 2015 TreeLSTM and Neural Chart Parser ("FCN") papers. https://twitter.com/wzuidema/status/1217143686460473345
Here, the symbolic component is typically provided by a symbolic parser. Bogin et al (2020) have some amazing new results w/ neural chart parsing, showing this approach is still viable when data is scarce and the task is challenging and syntax-dependent. https://twitter.com/ben_bogin/status/1278295395169435648
A 3d approach, which we coined ‘symbolic guidance’, is to use a common NN architecture, but change the loss function so that the NN is ‘guided’ to desired regions in the hypothesis space, where solutions are similar to a symbolic grammar (see also [1,2]).
http://www.interpretable-ml.org/nips2017workshop/papers/12.pdf
http://www.interpretable-ml.org/nips2017workshop/papers/12.pdf
A 4th approach would combine the best bits from approach 1 (expressive architecture), 2 (hybrid teacher model) & 3 (symbolically-guided student model), & glue them together using Knowledge Distillation. In 2015, I almost convinced a funding agency that this was the way forward...
The proposal was rejected, however, and the model never materialized. We did extensive work on knowledge distillation in another project, but without the ‘symbolic guidance’ twist. https://twitter.com/wzuidema/status/1275913841700937731
But, it turns out, a fully-fledged approach-4 model is already out there, described in a beautiful recent paper by Kuncero et al (incl. Chris Dyer @redpony), using RNNG teachers and BERT students. (I somehow missed the earlier Kuncero et al 2019 ACL
). https://twitter.com/DeepMind/status/1266338278350913537

Results again show improvements when syntax arguable is crucial and data scarce (i.e., coreference resolution, semantic role labeling and an interesting diagnostic ‘supertagging’ probe), and slightly degraded performance when they are not.
So, although we currently look in amazement at what an unsupervised model can do when given 1000B words, 175B parameters and >$10M worth of computing time, when we go back to inserting syntax into neurals nets, there are plenty of promising ideas there! https://twitter.com/wzuidema/status/1286714583642705921
[1] Chrupała & Alishahi (2019), Correlating Neural and Symbolic Representations of Language, , ACL2019: https://www.aclweb.org/anthology/P19-1283/
[2] Du, Lin, Shen, O'Donnell, Bengio, & Zhang (2020), Exploiting Syntactic Structure for Better Language Modeling https://arxiv.org/abs/2005.05864
[2] Du, Lin, Shen, O'Donnell, Bengio, & Zhang (2020), Exploiting Syntactic Structure for Better Language Modeling https://arxiv.org/abs/2005.05864