Thread by @svpino, Today let's talk about why we keep "splitting the data" into different [...]

Today let's talk about why we keep "splitting the data" into different sets.

Besides machine learning people being quirky, what else is going on here?

Grab your coffee

, and let's do it!

Imagine you are teaching a class.

Your students are getting ready for the exam, and you give them 100 answered questions, so they prepare.

You now need to design the exam.

What's the best way to evaluate the students?

(2 / 19)

If you evaluate the students on the same questions you gave them to prepare, you'll reward those who just memorized the questions.

That won't give you a good measure of how much they learned.

(3 / 19)

Instead, you decide to use different questions.

Only students that learned the material will be able to get a good score. Those who memorized the initial set of questions will have no luck.

(4 / 19)

When building machine learning models, we follow a similar strategy.

We take a portion of the data and use it to train our model (the student.)

We call this portion the "train set."

(5 / 19)

But we don't use all of the data for training!

Instead, we leave a portion of it to evaluate how much our model learned after training.

We call this portion of the data "validation set."

(6 / 19)

What do you think would happen if we evaluate the model on the same data we used to train it?

Just like in our analogy, the score of the model will probably be very high.

Even if it just memorized the data, it will still score well!

This is not good.

(7 / 19)

Machine learning people usually talk about "training and validation accuracy."

Which one do you think would be higher?

The training accuracy probably will: that's the evaluation of the model on the same data used for training!

(8 / 19)

Sometimes, the training accuracy is excellent, while the validation accuracy is not.

When this happens, we say the model "overfit."

This means that the model memorized the training data, and when presented with the real exam (validation set), it failed it miserably.

(9 / 19)

There's more.

We use the results of evaluating our models to improve them.

This is no different than a teacher pointing their students in the right direction after analyzing the exam results.

(10 / 19)

We do this over and over again:

Train

Evaluate

Tweak

Repeat

What do you think will happen after we repeat this cycle too many times?

(11 / 19)

Repeat the cycle too many times, and the model will get really good at acing the evaluation.

Slowly, it will start "overfitting" to the validation set.

At some point, we will get excellent scores that don't truly represent the model's actual performance.

(12 / 19)

You can probably imagine the solution: we need a new validation set.

In practice, we add the old validation set to the training data, and we get a new, fresh validation set.

Remember the teacher giving you the previous year's tests for practice? Same thing.

(13 / 19)

There's something else we do.

We take another portion of the data and set it aside. We call this "test set," and we never look at it during training.

Then we go and train and validate our model until we are happy with it.

(14 / 19)

When we finish, we use the test set for a final, proper evaluation of the model's performance.

The advantage is that the model has never seen this data, neither directly (during training) or indirectly (during validation.)

(15 / 19)

This is the best evaluation to understand the true capabilities of our model.

Right after we use the test set, we never use it again to test the model. We put it back as part of the train set and find more data to test the model in future iterations.

(16 / 19)

I always felt that splitting the original data into multiple parts was arbitrary until I understood its importance.

Hopefully, this thread helps you with this.

To finish, here are a few more notes about this.

(17 / 19)

1. In practice, the size of train, validation, and test sets vary. Think about them around a 60% - 20% - 20% split.

2. There are multiple ways to validate a model. Here I explained a simple split, but there are other techniques like k-fold cross-validation.

(18 / 19)

3. The size of the dataset influences the split and the techniques that I presented here. Some may not be possible without enough data.

4. This thread is not a paper or a scientific presentation. I'm aiming to build intuition among those who are learning this stuff.

(19 / 19)

If you enjoy these attempts to make machine learning a little more intuitive, stay tuned and check out @svpino for more of these threads.

I'm really enjoying the feedback of those who tell me that these explanations hit home for them. Thanks for the feedback!

Thanks to @gusthema for the inspiration to write this thread.

Latest Threads Unrolled: