Thread by @danrkports, We studied this effect a few years ago in the context of [...]

Dan Ports

danrkports

We studied this effect a few years ago in the context of tail latency of individual servers ( https://drkp.net/papers/latency-socc14.pdf). Both the basic effect and some of the things that interfere with it are counter-intuitive or unexpected! 1/ https://twitter.com/MarcJBrooker/status/1291419061742497792

Warm-up: what's the tail latency of a single-CPU service where processing each request takes 50 us? Hint: it's not just 50 us.

Sometimes a request arrives while the CPU is busy, so it has to wait.

We can model this as a M/M/1 queue also. 2/

The answer, as anyone who's run a large service knows, depends on the utilization.

At 70% utilization, the p90 latency is about 200 us, the p99 is about 400 us, and the p99.9 is about 600 us.

At 95% utilization, they're all about 10x worse. 3/

What if we have more than one CPU? Queuing theory says that tail latency should improve -- now it's a M/M/k queue instead of M/M/1

With k = 8 cores, the p99 latency should drop by about 4x, even as we process 8x as many requests. 4/

But that's just theory. What if we actually try this with memcached? 5/

The ideal model says we should get a ~3x improvement going from 1 core (red line) to 4 (green).

This happens with the real measurements too! (1 core = blue, 4 = orange).

In both cases, the real numbers are still ~3x worse than the ideal; more on that later. 6/

Here's a surprise: this only happens when memcached is receiving requests over UDP. Using TCP, adding more cores doesn't help at all! 7/

Why? The M/M/k queuing model assumes that there's one queue: any core can process any request. For UDP sockets, there's is only one queue! But each TCP connection is assigned to a specific core when it's opened.

That means we have 4 copies of a M/M/1 system, not a M/M/4. 8/

Now, what about that ~3x difference from the ideal before? The queuing model assumes requests are processed FIFO without interruption. A real system has, well, interrupts.

Network activity interrupts user-space processing, causing context switching and violating FIFO. 9/

We can compensate by dedicating some cores to handle interrupts and run the kernel network stack, leaving the others to run memcached. (1:3 is about right for this cluster.)

This improves our tail latency distribution (green -> blue) and gets it close to the ideal! (red) 10/

Of course, there are plenty of other tricky factors too -- background processes, scheduling policies, NUMA effects, power saving. We looked at some of these too, but I won't go into those now. 11/

All of this (especially the graphs) is from
"Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency", with @JialinLi14, @nkrsharma, and Steve Gribble, SOCC 2014.

https://drkp.net/papers/latency-socc14.pdf

12/12

You can follow @danrkports.

Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: