Arguments made by @emilymbender & @timnitGebru's seem controversial for En-NLP model building community because the scale of available data may give one an intuition that it's large enough to actually resemble the real world use of language. Thankfully working on Polish taught
me the exact opposite. Not because it is a heavily underresourced language. We're in the middle, both under-resourced and privileged to have more resources than others. So definitely what we have qualifies as large data - we have modern models available for Polish, but training
a GPT-2 and claiming it understands conversational Polish does feel like a joke. We actually have enough funding to do it, but it's absurd - knowing the quality of available data, since most of the data has been generated either in our group or one of 2 others or scraped. If we
trained a model on polish twitter, parliamentary speeches, wikipedia discussions, some internet forums and classical dramas and actually wrote in a paper that we now have a model that understands how Polish is actually used when most of Polish people are 42+ and don't really
talk the same IRL as they shitpost on the internet. I'm pretty sure we'd win some kind of a comedy prize for that GPT-2's conversational capabilities. So in the end we're sitting down trying to get somewhat better data in place with a team of linguists, and it slows our progress.
It's just slow to build and you'll sometimes face a comment in the internet that polish AI scientists are slow and less efficient than researchers tackling english problems. But hey, we didn't build a twitter bot who became a nazi and wanted to have sex with everyone.
So it's slow, generating examples, collecting verified data, using an ISO norm for annotation, learning that we don't understand something, making small progress, there's no best paper prize for that. We'll probably take some big modelling approaches too, there's a demand in the
industry, but we're probably safe from claiming Polish language understanding :D That was a nice read, sure it's somewhat biased towards theoretical idealism against experimental bonanza, but it's not like making models is now illegal, it's a great food for thought,
useful to help avoid mistakes other made. As a computer scientist working on a daily basis with linguists i felt enlightened and not offended or attacked.