Thread by @vllry, I just had coffee and I'm waiting for lunch to cook, so [...]

Vallery Lancey

vllry

I just had coffee and I'm waiting for lunch to cook, so let's talk about jitter and exponential backoff.

Background: "backoff" is when you retry a failed action (say, re-making a request if the backend was unreachable). The generally recommended way to implement backoff is exponentially, where you double the delay every time, until you hit some max delay.

Exponential backoff reduces the pressure buildup on a system. If a backend goes down, you don't want to be completely hammering it in every consumer when it comes back.

But this can still easily result in a "thundering herd" - when retries overwhelm the system and delay recovery. Suppose the backend fails more or less all at once. Every client starts a backoff cycle at about the same time... and retries at the same time.

Instead of smearing those requests over time time period (say, a client makes a request every minute on average, which means $numClients/60 requests a second), you get those requests all bunched up (say, closer to $numClient requests on specific seconds).

You solve for this with jitter - adding random delays to the backoff. A typical implementation might be "each delay is double the last, +/- 10%". With jitter, retries are more of a curve than an up-and down.

This is fairly well known (but if it's new to you, no shame!).

The risk of backoff without jitter that most people _don't_ realize is coordinated recovery.

Say we have a "backend fails roughly all at once" scenario. If it lasts for a while, our clients will eventually hit their max backoff (probably in the order of minutes between retries).

With no jitter, they're retrying at the same time. Let's say that the load of those retries is a non-issue (which can easily be the case).

The backend is restored after a while.

But all clients still take minutes to retry and succeed, because they still have $n mins of delay.

This introduces an artificial recovery delay (statistically, half of the max backoff time). I've seen this happen in the real world plenty of times (e.g. with Kubernetes CrashLoopBackoffs and requests of death or hard dependencies).

The higher the max backoff, the worse it is.

Jitter also escapes this, by smearing out retries. You see a gradual recovery right away, instead of an instant recovery after a delay.

Takeaway: think about what a system will be doing in a failure state.

You can follow @vllry.

Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: