Thread by @svpino, Everything you need to know about the batch size when training a [...]

Everything you need to know about the batch size when training a neural network.

(Because it really matters, and understanding it makes a huge difference.)

A thread.

Gradient Descent is an optimization algorithm to train neural networks.

The algorithm computes how much we need to adjust the model to get closer to the results we want on every iteration.

2/

We take samples from the training dataset, run them through the model, and determine how far away our results are from the ones we expect.

We call this "error," and using it, we compute how much we need to update the model weights to improve the results.

3/

A critical decision we need to make is how many samples we use on every iteration.

We have three choices:

Use a single sample of data.

Use all of the data at once.

Use some of the data.

4/

Using a single sample of data on every iteration is called "Stochastic Gradient Descent" (SGD.)

The algorithm uses one sample at a time to compute the updates.

5/

Advantages of Stochastic Gradient Descent:

Faster learning on some problems.

The algorithm is simple to understand.

Avoids getting stuck in local minima.

Provides immediate feedback.

6/

Disadvantages of Stochastic Gradient Descent:

Computationally intensive.

May not settle in the global minimum.

The performance will be very noisy.

7/

Using all the data at once is called "Batch Gradient Descent."

The algorithm takes the entire dataset and computes the updates after processing every sample.

8/

Advantages of Batch Gradient Descent:

Computationally efficient.

Stable performance (less noise.)

9/

Disadvantages of Batch Gradient Descent:

Requires a lot of memory.

Slower learning process.

May get stuck in local minima.

10/

Using some of the data (more than one sample but fewer than the entire dataset) is called "Mini-Batch Gradient Descent."

The algorithm works like Batch Gradient Descent, with the only difference that we use fewer samples.

11/

Advantages of Mini-Batch Gradient Descent:

Avoids getting stuck in local minima.

More computationally efficient than SGD.

Doesn't need as much memory as BGD.

12/

Disadvantages of Mini-Batch Gradient Descent:

New hyperparameter to worry about.

We usually call this hyperparameter "batch_size."

This is one of the most influential parameters in the outcome of a neural network.

13/

Batch Gradient Descent is rarely used in practice, especially in deep learning, where datasets are huge.

Stochastic Gradient Descent (using a single sample at a time) is not that popular either.

Instead, Mini-Batch is the most used.

14/

But of course, we like to make things confusing.

In practice, we call it "Stochastic Gradient Descent" regardless of the batch's size.

When you hear somebody mention "SGD," keep in mind that they are probably specifying batches of samples (not just a single one.)

15/

There's a lot of research around the optimal batch size.

Your problem is different from any other problem, but empirical evidence suggests that smaller batches perform better.

(Small as in less than a hundred or so.)

16/

To make it even more concrete:

"(...) 32 is a good default value."

Here is the paper I'm quoting. A good read if you want more of this: https://arxiv.org/abs/1206.5533

17/

To make it even more concrete: "(...) 32 is a good default value."Here is the paper I'm quoting. A good read if you want more of this: https://arxiv.org/abs/1206.5533 17/

TLDR; to finish the thread:

Mini-Batch Gradient Descent is the way to go.

batch_size = 32 is a good starting point.

Don't be lazy. Read the thread.

For more "breaking machine learning down" threads, stay tuned and check out @svpino.

Latest Threads Unrolled: