Learning Rate 📉

During training of a neural network, we minimize a loss function using gradient descent.

In each iteration we adapt the weights by changing them a little in the direction of the gradient. How large this change is is defined by the learning rate.

Details 👇
Imagine you are on the top of a mountain and want to get down to the lodge in the valley. The learning rate defines the speed at which you will be going down. ⛷️

A high speed means that you may breeze past the lodge and need to go back. Going to slow will cost you lots of time.
It is similar with the learning rate...

🐘 If it is too high, the optimization will be unstable jumping around the optimum.

🐁 It if is too low, the optimization will take too long to converge.

So how to find the best learning rate... 🤔

You have several options 👇
As usual, for hyperparameter tuning, it will require some trail and error. You will try different approaches on your training dataset and check how good it was on the validation set.

You can use the following strategies:
- Parameter search
- Schedule
- Adaptive learning rate
Parameter search 🔎

The easiest way is to just try different values and see how it goes. You should select them in a exponential manner: 1, 0.1, 0.01, 0.001... you get the idea...

If the loss goes down, but slowly - increase ⬆️

If the loss starts oscillating - decrease ⬇️
Schedule ⏰

The schedule allows you to apply a decay of the learning rate. You start with a high rate to get close to the optimum fast, but decrease it after that to be more precise.

The LR can be reduced based on the epoch or based on another criteria, like hitting a plateau.
Momentum 🎢

You can also apply momentum. Just like a ball rolling down a hill, the optimization will accelerate if it is steadily going down.

Momentum will also help the optimizer jump over small local minima and find a better optimum.
Adaptive Learning Rate 📉

There are also optimizers that adapt the learning rate automatically based on the gradients. They introduce other hyperparameters (e.g. the forgetting factor), but they are usually easier to tune.

Prominent examples:
- Adagrad
- RMSProp
- Adam ⭐️
You can follow @haltakov.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.