We studied this effect a few years ago in the context of tail latency of individual servers ( https://drkp.net/papers/latency-socc14.pdf). Both the basic effect and some of the things that interfere with it are counter-intuitive or unexpected! 1/ https://twitter.com/MarcJBrooker/status/1291419061742497792
Warm-up: what's the tail latency of a single-CPU service where processing each request takes 50 us? Hint: it's not just 50 us.
Sometimes a request arrives while the CPU is busy, so it has to wait.
We can model this as a M/M/1 queue also. 2/
Sometimes a request arrives while the CPU is busy, so it has to wait.
We can model this as a M/M/1 queue also. 2/
The answer, as anyone who's run a large service knows, depends on the utilization.
At 70% utilization, the p90 latency is about 200 us, the p99 is about 400 us, and the p99.9 is about 600 us.
At 95% utilization, they're all about 10x worse. 3/
At 70% utilization, the p90 latency is about 200 us, the p99 is about 400 us, and the p99.9 is about 600 us.
At 95% utilization, they're all about 10x worse. 3/
What if we have more than one CPU? Queuing theory says that tail latency should improve -- now it's a M/M/k queue instead of M/M/1
With k = 8 cores, the p99 latency should drop by about 4x, even as we process 8x as many requests. 4/
With k = 8 cores, the p99 latency should drop by about 4x, even as we process 8x as many requests. 4/
But that's just theory. What if we actually try this with memcached? 5/
The ideal model says we should get a ~3x improvement going from 1 core (red line) to 4 (green).
This happens with the real measurements too! (1 core = blue, 4 = orange).
In both cases, the real numbers are still ~3x worse than the ideal; more on that later. 6/
This happens with the real measurements too! (1 core = blue, 4 = orange).
In both cases, the real numbers are still ~3x worse than the ideal; more on that later. 6/
Here's a surprise: this only happens when memcached is receiving requests over UDP. Using TCP, adding more cores doesn't help at all! 7/
Why? The M/M/k queuing model assumes that there's one queue: any core can process any request. For UDP sockets, there's is only one queue! But each TCP connection is assigned to a specific core when it's opened.
That means we have 4 copies of a M/M/1 system, not a M/M/4. 8/
That means we have 4 copies of a M/M/1 system, not a M/M/4. 8/
Now, what about that ~3x difference from the ideal before? The queuing model assumes requests are processed FIFO without interruption. A real system has, well, interrupts.
Network activity interrupts user-space processing, causing context switching and violating FIFO. 9/
Network activity interrupts user-space processing, causing context switching and violating FIFO. 9/
We can compensate by dedicating some cores to handle interrupts and run the kernel network stack, leaving the others to run memcached. (1:3 is about right for this cluster.)
This improves our tail latency distribution (green -> blue) and gets it close to the ideal! (red) 10/
This improves our tail latency distribution (green -> blue) and gets it close to the ideal! (red) 10/
Of course, there are plenty of other tricky factors too -- background processes, scheduling policies, NUMA effects, power saving. We looked at some of these too, but I won't go into those now. 11/
All of this (especially the graphs) is from
"Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency", with @JialinLi14, @nkrsharma, and Steve Gribble, SOCC 2014.
https://drkp.net/papers/latency-socc14.pdf
12/12
"Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency", with @JialinLi14, @nkrsharma, and Steve Gribble, SOCC 2014.
https://drkp.net/papers/latency-socc14.pdf
12/12