Buried in the recent trillion parameter language model paper is how the dataset to train it was created. Any page that contained one of these words was excluded: https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en Two sample banned words: "twink" and "sex"
Almost all of these words are *contextual*: they can be offensive, or they can be part of normal and respectful communication. For example, the @QueerinAI website is excluded from this dataset for having the word "sex" in relation to human bodies https://sites.google.com/view/queer-in-ai/diversity-inclusion?authuser=0
This ban list will exclude many texts and online communities containing our understanding and dialogue on gender, sex, sexuality, and race. How is a language model supposed to understand the queer community if it only has texts without the word "sex"?
18 people, software engineers from what I can tell, assembled this list with little or no public discussion, and now its being used by one of the most powerful and pervasive companies in the world to train its language models.
The NLP research community bears a lot of responsibility for this: spending millions to train a big version of a standard architecture gets a best paper award, but an essential part of the dataset it is trained on (and thereby who is excluded) is literally a footnote left to SWEs
Shoutout to @emilymbender @timnitGebru and others for their paper "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" which dives into this issue and others!