Thread by @ghuubear, 1/ So I spent a day trying to understand GPT-3: what it [...]

1/ So I spent a day trying to understand GPT-3: what it is and how it works, from the pov of a layperson.

Compiling my learnings here.

Hope it helps a few to get a basic understanding of GPT-3 and how it works.

It's time for a

GPT-3 MEGATHREAD!

2/ Disclaimers:

I've dabbled with basic linear/logistic regression and Bayesian models earlier in my career. My knowledge is still pretty amateurish, and this thread isn't meant to be *precise* as much as its meant to layperson-friendly peek into the blackbox that's GPT-3.

3/ Having said that, if you find any grievious mistakes, please reply to the thread and I'll add the corrections to the thread.

All quotes used in this thread are from the GPT-3 whitepaper.

4/ To start off with, let's see how any deep learning ML model "learns."

5/ There are essentially two components to a machine learning model: PARAMETERS and DATA.

The model identifies a set of PARAMETERS that help it to identify and predict the nature of any given data. It learns these parameters this using the DATA it is trained on.

6/ A analogy, we identify a cat by using parameters like

does it have the shape of a cat?
features of a cat?
color of a cat?
do others call it a cat?
etc.

And we can do this based on our training dataset — what we already know about cats from our past.

7/ For most of us, that would be something we learned as kids:

that this is a cat and this isn't a cat.

We often don't feel this parametric-reasoning happening everytime we see a cat because it happens subconsciously, in realtime.

8/ In the case of an AI model, it is fed a set of data labeled "cat" and "not-cat" to train itself.

While it goes through this dataset, it learns what features are common to pictures labeled "cat" vs. pictures labeled "not-cat."

9/ By the end of its training, it has prepared a set of parameters that it uses to predict if any new given picture has a cat in it or not with a certain probability (which can be measured using TESTING DATASETS).

These parameters are saved as a part of the learned model.

10/ These parameters are often not set manually by the engineer. They might not be easily legible to us humans and might look like quite obscure mathematical values and distributions.

An unsupervised AI neural-network is essentially a black box, we don't exactly know what...

11/ it sees in each picture it's trained on; features it deems important for differentiating cats from other stuff. We can only tell it how many parameters we want it to model itself on, based on our computational prowess.

For GPT-3, the computation was of the order of petaflops

12/ However, we generally fine-tune these models while employing them for our own specific tasks using our own "hyperparameters" (parameters that have to be fed externally to the model by the engineers, but we'll leave that for now).

13/ We also train the learned model using data that's specific to our task. In the case of GPT-3, the NLP model identifies any given piece of text on 175 BILLION saved parameters. It has been trained on huge amounts of data (practically the entire internet of text)

14/ Generally, a trained model still has to be fine-tuned thousands or tens of thousands of examples of our own task-specific data to achieve
a strong performance in that task.

Which is a hassle due to a few reasons:

15/ Sometimes gathering enough data to fine-tune and increase the accuracy of the model can be unrealistic.

For e.g., if we want to categorize bird species from photos, some rare species of birds may lack enough pictures to be used as training images.

16/ Also, when working with a huge dataset, cleaning the data (removing erroneous data that would confuse the model), adding specific features for every task, and correctly labeling it can be costly and difficult to implement.

17/ Hence, removing this limitation would be quite desirable for mass usage of the model in the real world.

This is a very important factor to note when it comes to the GPT-3 model. Allow me to explain...

18/ When it comes to the problem of parsing human language, specifically,

"there exists a very wide range of possible useful language tasks, encompassing anything from correcting grammar, to generating examples of an abstract concept, to critiquing a short story...

19/ ...For many of these tasks it is difficult to collect a large supervised training dataset, especially when the process must be repeated for every new task."

This is clearly unsustainable at scale.

20/ There's also the slightly more technical problem of spurious correlations that you encounter when models are designed to be large to absorb diverse information during pre-training, but are then fine-tuned on very narrow and specific tasks.

21/ Here's why GPT-3 is special:

It is a "few-shot" model.

As GPT-3 is already trained on a ginormous dataset, it requires very little new data to fine-tune it. We can tell it what we want it to do using very few examples from our end.

This makes it highly scalable.

22/ If we only have 1 image of a bird for fine-tuning, this would be a "one-shot" ML problem.

In extreme cases, we can even end up with 0 training samples where the model is expected to predict categories of yet unknown data, which would make it a "zero-shot" ML problem.

23/ Low-shot learning in deep learning strives to create reliable models that can make accurate predictions from minimalist datasets.

GPT-3 is one such Natural Language Processing (NLP) model.

24/ "This adaptability has practical advantages – it allows humans to seamlessly mix together or switch between many tasks and skills, for example performing addition during a lengthy dialogue."

25/ Consider humans. We are few-shot learners.

We do not need large supervised datasets to learn most language tasks.

A child doesn't have to be trained on thousands of examples of a cat, to recognize a cat. Seeing one or two cats is enough to identify ALL cats in the future.

26/ This is why GPT-3 can appear so close to human. It can learn what we want it to do using fewer than 10 examples of training data.

"One-shot, or even sometimes zero-shot, seem like the fairest comparisons to human performance, and are important targets for future work."

27/ The creators of GPT tested its performance under 3 conditions:

1. “few-shot learning”, where they allowed 10-100 examples of training data to guide the trained GPT-3 model on what they wanted it to do i.e. what kind of answers were expected from it

28/

2. “one-shot learning”, where they gave it only one example, and

3. “zero-shot” learning, where they gave it absolutely no example and directly asked it a question.

29/ Here's a graph showing the relationship between

the number of parameters the model has (how "large" of a model it is)
vs.
number of in-context, task-specific examples it needs
vs.
accuracy % of output

30/ As we can see, how large a model is directly predicts its accuracy.

Drawback of training large models with billions of parameters:

They require tonnes of compute power. That's why the GPT-3 could only be trained once on the gigantic dataset.

31/ A screenshot describing the difference between zero-shot, one-shot, and few-shot learning approaches, and what they really mean in terms of examples supplied to test the model's performance.

32/ After GPT-3 was trained, its performance was tested on many kinds of test datasets, which I'll try to go through in short.

A bit technical, but some technicalities are quite interesting. They'll make you realize:

SCIENCE IS TOUGH.

33/

Language Modeling datasets:

These datasets tested GPT-3 on tasks that involved predicting a single word of interest, completing a sentence/paragraph, or choosing between possible completions of a piece of text.

They did this using "fill in the blanks"-type questions.

34/ LAMBADA dataset:

It tested GPT's accuracy at predicting long-range dependencies in text – by asking it to predict the last word of sentences which required supplying it with a paragraph of context.

Here's how it performed:

Close to 90% accuracy with only 15 examples!

35/ Interesting tidbit about LAMBADA:

"Although the completion in LAMBADA is always the last word in a sentence, a standard language model has no way of knowing this detail."

36/ Which leads to a problem.

The model thus assigns probability not only to the correct ending but also to other valid continuations of the paragraph. It might suggest multiple words or sentences as continuation text, instead of a single word, which was the requirement.

37/ They addresed this problem partially with stop-word filters (which ban “continuation” words).

It allowed the language model to infer from examples that a completion of exactly one word is desired.

38/ Examples supplied:

Alice was friends with Bob. Alice went to visit her friend ______. → Bob

George bought some baseball equipment, a ball, a glove, and a ______. →

39/

HellaSwag dataset (love the name, lmao):

This dataset was used to check GPT-3's performance in picking the best ending to a story or set of instructions. The examples were
adversarially mined to be difficult for language models while remaining easy for humans.

40/ GPT-3 achieved 78.1% accuracy in the one-shot setting and 79.3% accuracy in the few-shot setting, compared to humans who achieve 95.6% accuracy.

Not bad, but there's still a model called ALUM, which achieved 85.6% accuracy. So GPT-3 is not the best we've got here.

41/ StoryCloze dataset was used for testing GPT on selecting the correct ending sentence for five-sentence long stories.

GPT-3 achieved 83.2% in the zero-shot setting and 87.7% in the few-shot setting (with K = 70 ). It improved over previous zero-shot results by roughly 10%.

42/ Let's why GPT-3 is being deemed as a POTENTIAL RIVAL TO GOOGLE SEARCH

"Closed Book" Question Answering was used to test GPT-3’s ability to answer questions about broad factual knowledge.

43/ Traditionally, due to the large no. of possible queries, such a task needs an information retrieval system to find relevant text in combination with a model, which then learns to generate an answer based on the query and the retrieved text.

i.e., how Google search works.

44/ But AI research recently demonstrated that larger language models can perform surprisingly well directly answering the questions without relying on auxilliary search models or heuristic information.

This was tested with GPT-3.

45/ Winograd-Style Tasks:

These tested GPT's ability to preserve conversational context i.e. its ability to identify who or what is being referred to in an extended conversation when pronouns like "he, she, they, it, etc." are used.

46/ Maintaining conversational context plays a large part in making conversations sound more human.

The task "involves determining which word a pronoun refers to, when the pronoun is grammatically ambiguous but semantically unambiguous to a human."

This is how well GPT-3 can maintain context in a conversation: over 80% accurate when provided with 50 examples.

It showed no clear in-context learning, but in all cases achieved strong results just a few points below state-of-the-art and estimated human performance.

47/ Synthetic and Qualitative Tasks: GPT-3’s range of abilities were tested by giving it tasks which required it to perform simple, on-the-fly computations, and recognize new patterns that it was unlikely to have occurred in training.

48/ These tasks basically measured GPT-3's ability to adapt quickly to an unusual queries.

One of these involved testing GPT-3’s ability to perform arithmetic, since most arithmetic operations are not directly a part of the dataset it was trained on.

49/ I came across many tweets showing GPT-3's failure to perform simple arithmetic calculations.

It was one of the major arguments against claims of GPT-3 nearing AGI.

But don't lose hope, because it looks like GPT-3 isn't memorizing by rote, it's actually learning logic!

50/ Here's how it performed on Arithmetic though

51/ In addition to these, they tested GPT-3’s ability to solve SAT-style analogy problems with supplying it a few examples.

52/ A typical example is “audacious is to boldness as

(a) sanctimonious is to hypocrisy
(b)
(c)
(d)
(e)

The student is expected to choose the five-word pair that has the same relationship as the original word pair; in this example the answer is “sanctimonious is to hypocrisy”.

53/ On this task, GPT-3 achieved 65.2% in the few-shot setting, 59.1% in the one-shot setting, and 53.7% in the zero-shot setting,

whereas the average score among college applicants was 57%.

The average SAT entrant is marginally better to GPT-3 when it comes to analogy.

54/ They also tested GPT-3 on its ability to write NEWS ARTICLES.

YEESH! Scary stuff.

They provided 3 news articles as examples, to condition the model.

55/ What's more, they performed a sort of a Turing Test with real people, asking them to identify if the given news article was written by a human or an AI.

56/ "The articles we selected were not in the models’ training data and the model outputs were formatted and selected programmatically to prevent human cherry-picking."

Testing data not being a part of the training data sounds obvious, but it's a real problem. Will discuss later

57/ The average human was accurate at detecting AI articles 52% of the time where 50% is chance-level performance.

"For news articles that are around 500 words long, GPT-3
continues to produce articles that humans find difficult to distinguish from human-written news articles."

58/ "Human abilities to detect model generated text appear to decrease as model size increases."

Despite the fact that participants in the study spent more time on each output as model size increased.

Our detection of GPT-3 news articles is close to a coin toss.

WHOOPS!

59/ Here's the graph plotting accuracy vs. model size

60/ "If a model consistently produces texts that are more impressive than human articles, it is possible that human performance on this task would drop below 50%. Indeed, many individual participants scored below 50% on this task."

Here's the one that was toughest to predict

61/

FUCKETY FUCK FUCK FUUUUUUUCK!

The little nuances and trip-ups that would let you see through the fake veneer are so few! And this is me evaluating the article after-the-fact.

This could easily have been a human. Easily.

62/

12% accuracy on the toughest one.
61% accuracy on the easiest one.

This one is easier to predict for an experienced eye, though. Reads like it was written by a 6-year old child.

I would've bet AI.

63/ Besides these, there were lotsa other tests like checking simple on-the-fly computational reasoning, recognizing a novel pattern that was unlikely to have had occurred in training, ability to adapt quickly to unusual task...

64/ ...tasks that were unlikely to have been exactly seen during training, like

letting GPT-3 solve anagrams,
fill sentences with alternate words,
giving GPT-3 the definition of a nonexistent word, such as “Gigamuru”, and then asking it to use it in a sentence,
etc.

Pheww!

65/ Let's talk about the contamination problem that would naturally plague a training dataset as large as the frigging internet itself.

"Since our training dataset is sourced from the internet, it is possible that our model was trained on some of our benchmark test sets."

66/ I think we all realize that having testing data as a part of training data would screw up the test results, for the model would perform with 100% accuracy on the data that was already a part of its training set.

The creators of GPT-3 committed a blunder.

67/ "We initially tried to address the issue of contamination by proactively searching for and attempting to remove any overlap between our training data and the development and test sets of all benchmarks studied in this paper. Unfortunately...

68/ ...a bug resulted in only partial removal of all detected overlaps from the training data. Due to the cost of training, it wasn’t feasible to retrain the model."

Oopsy daisy!

SCIENCE IS TOUGH.

69/ Being scientists, they were aware of this blunder and made necessary compensations in their testing. The contamination, as tested and verified, had little effect on performance results.

Good shit.

70/ Continued explanation to the previous screenshot

71/ Let's get WOKE. Let's look at GPT-3's biases. First up

GENDER

When given a context such as "The {occupation} was a_____"

"We found that occupations, in general, have a higher probability of being followed by a male gender identifier than a female one."

72/ "83% of the 388 occupations we tested were more likely to be followed by a male identifier by GPT-3."

73/ "We found that, when prompted with

"The competent {occupation} was a ___"

the majority of occupations had an even higher probability of being followed by a male identifier than a female one than was the case with our original neutral prompt,

"The {occupation} was a ___"

74/ Top 10 most biased descriptor words that GPT-3 assigns to both the male and female gender

75/ Next up,

RACE BIAS

To check GPT-3 for racial bias, they seeded the model with prompts such as -

"The {race} man was very ___"

"The {race} woman was very ___"

76/ The researchers performed a sentiment analysis on the words filled in the blanks by GPT-3.

"Across the models we analyzed, ‘Asian’ had a consistently high sentiment - it ranked 1st in 3 out of 7 models. On the other hand, ’Black’ had a consistently low sentiment...

77/ ... - it ranked the lowest in 5 out of 7 models. These differences narrowed marginally on the larger model sizes."

78/ Sooo...

Yayy Asians?

Or too insensitive? idc

79/ Next up,

RELIGION BIAS

For this they studied which words co-occurred with religious terms relating to Atheism, Buddhism, Christianity, Hinduism, Islam, and Judaism.

The prompts:

"{Religion practitioners} are ____"

for each of the six religious categories listed above.

80/ These are GPT-3's top associations with each religion

81/ Finally, the paper discusses some limitations of GPT which I found interesting.

82/

1. "Although the overall quality is high, GPT-3 samples still sometimes repeat themselves semantically at the document level, start to lose coherence over sufficiently long passages, contradict themselves, and occasionally contain non-sequitur sentences or paragraphs."

83/ GPT-3 losing coherence over long samples might be our only refuge against fake news, presently.

PRESENTLY.

84/

2. GPT-3 is not good at common sense physics.

Specifically, GPT-3 has difficulty with questions of the type

“If I put cheese into the fridge, will it melt?”

lel

85/

3. "Our current objective weights every token equally and lacks a notion of what is most important to predict and what is less important."

Super-important point when it comes to comparing GPT-3 with AGI.

86/ With supervised learning, where parameters are manually implemented, the desired task is forced into a prediction problem.

Whereas, useful language systems (e.g. virtual assistants) are better thought of as taking goal-directed actions rather than just making predictions.

87/ The language model not only needs to predict the correct response to my query, it also needs to get a feel for what I'm ultimately trying to achieve with my queries.

Important distinction to make, and a space where GPT-3 might be lagging, still.

88/

4. "Large pre-trained language models are not grounded in other domains of experience, such as video or real-world physical interaction, and thus lack a large amount of context about the world."

89/ GPT-3 only knows the text part of the internet. It doesn't know the context that links this text to the real world.

90/

5. Poor efficiency during pre-training: While GPT-3 takes a step towards test-time sample efficiency closer to that of humans (one-shot or zero-shot), it still sees much more text during pre-training than a human sees in their lifetime.

It needs loads of training data.

91/

6. An uncertainty while deploying GPT-3 to any real world application:

Does few-shot learning actually learn new tasks “from scratch” at inference time?

Or does it simply recognize and identify tasks that it has learned during training?

92/ "Synthetic tasks such as wordscrambling or defining nonsense words seem especially likely to be learned de novo, whereas translation clearly must be learned during pre-training, although possibly from data that is very different in organization and style than the test data."

93/ Do humans learn from scratch? Or is everything a remix of what we already know?

In any case, still very useful imo https://twitter.com/ghuubear/status/1284760886855364610?s=20

https://twitter.com/ghuubear/status/1284760886855364610?s=20

94/ And finally,

7. Training a GPT-3 like model with 175 bn parameters is hella expensive! It's quite resource-intensive to train.

95/ I think that's enough for this thread. If you are still reading, I admire your patience.

With this thread, my attempt was to simplify a lot that I found interesting when I read the paper. There's a lot more still in there that I haven't covered

https://arxiv.org/pdf/2005.14165.pdf

96/ Also, I wanted to understand for myself what knowledge workers are up "against" when it comes to the latest developments in AI.

97/ We don't have to see AI as something that will steal our jobs. We can understand its capabilities and build skills that are synergistic with it.

We can see it as a liberating positive-sum tool, not as a zero-sum evil technology.

98/ If you found this helpful in clearing up a few things, please retweet and share with your frens!

Although I've taken utmost care to present accurate stuff, I still doubt my understanding. Feedback from all those who know better is welcome.

Latest Threads Unrolled: