The emergent behaviour of distributed systems is complex and amazing and so much fun to debug.

At work I just debugged and fixed a problem which someone introduced to our codebase back in August 2017 - nearly 3 years ago!
The behaviour of this one was the result of an intricate interplay between the behavour of
* Our distributed locking system
* Our service config distrbution system
* An event queue system
* Its client library
* The Go sync.RWMutex, and
* Go defer statements
Occasionally, the service would switch from operating normally to attempting to acquire an increasing number of distributed locks in parallel. This implied contention - something was taking locks and never letting go, and everything else trying to acquire one was blocking.
It always affected every instance of the service at once - which made some local state changing inside one instance seem less likely, because normally you'd expect local state corruption to only affect one pod.

But you could also always fix it by restarting all the pods
The other obvious symptom was that the number of concurrent goroutines in each of the service instances would start increasing.

In fact, once it started misbehaving, both gorotuines and locks being acquired would start increasing linearly
This, of course, isn't too much of a surprise when you think about it - acquiring a lock requires at least one goroutine, so you'd expect this sort of correlation.
The first time we noticed the bug happen was last year. It started leaking, this didn't fire any alerts, and it eventually caused the distributed locking system problems. Once we identified the service causing the excess load, we restarted it. This resolved the incident.
The on-call team breathed a sigh of relief and went back to bed.

A tricky thing about incidents, of course, is that if things are on fire, your first priority is extinguishing the fire. Unfortunately, sometimes dousing the flames with water can cover up the source of ignition.
We still had logs and metrics, of course, but they were inconclusive; and unfortunately you can't just go and pull debug information from a dead process.

A whole bunch of people stared at the code, but nobody spotted anything obvious.
We added better alerting around these cases. The issue recurred a couple of times, but mostly out of hours. People did a bit of investigating, didn't spot anything obvious, and restarted the service to ensure things kept working.
One thing, that we could definitely invest in and in fact I'd suggest anyone do so, would be a tool that, before restarting any service pods, grabs as much debugging information as is possible to do quickly (possibly even including a single profile)
Today, however, an alert that this service appeared to be leaking goroutines popped up in one of our monitoring channels.

This meant that, for a change, we spotted it in hours, and also before it got a chance to endanger any other systems.

Time to debug!
The graphs showed:
* Increasing goroutines, as expected
* We were building backlog on our event queue, and
* We had more lock acquisitions going on

As expected, but not very conclusive.
I pulled a pprof goroutine trace from a pod. It showed thousands of goroutines piled up inside our distributed locking client library, as expected, and that the stack trace included event consumers. This explained why event consumption had slowed, though not much else.
This event queue system has an important behaviour: Messages which aren't ACKed, NAKed or touched within 30 seconds get automatically placed back on the queue.

i.e. if processing a message takes longer than 30s, your handler needs to explicitly periodically touch it
Our client library doesn't cancel the request context when this happens. This is probably a bug, and it probably should, but it doesn't. It's likely that part of the reason for this is that the server doesn't give you any indication that the message has timed out
So I formed an initial hypothesis:

* Some messages took >30s to process
* The server timed them out and re-queued them to another instance
* This didn't cause the lock to be released
* The other pod was now blocked on the same lock
* This process repeated
I didn't know whether messages were utterly stuck, or just being processed very slowly. If they were being processed slowly enough, then this could still cause cascading issues across the pods

I just needed to explain why the first message got stuck.
So, what happens in the code while the distributed lock is held?

Part of it is that the code obtained an instance of the client for a 3rd party API:

cli, release := client.Obtain(...)
defer release()
Obtaining a client took a reader lock (RWMutex.RLock()). The release function did the corresponding unlock (RWMutex.RUnlock())

All good and balanced, except... this code was inside a loop

for _, foo := range bar {
c, release := client.Obtain(...)
defer release()
// use c
And where and when to Go defer statements execute? When you exit the enclosing function.

And so if the multiple trips through this loop obtained the same client, one goroutine would take the same RLock twice

And in fact they always returned the same client
The docs for RWMutex state that, once someone is blocked acquiring it for writing, new readers are blocked from taking it from reading. Or: recursively read locking the same RWMutex is prone to deadlock.

But it'll only hit you if something tries to take the mutex for writing
And that was the last bit. The client library portion of the service observed for changes to the service configuration and, when they happened, updated the client, taking the write lock.
So an instance was processing a message, took the read lock, then observed a config change and tried to take the write lock. It attempted to take the read lock again, and deadlocked.

Every other message handler in the same instance then crashed into the same deadlock
Their messages got redistributed to other service pods by the message queue. The other pods then attempted to acquire the distributed lock on a resource, but the original pod was still holding onto it. Their handlers were now also deadlocked
The service has an event consumption rate limit, but the maximum number of messages that may be in flight at once is managed by the queue server.

But it doesn't count messages that it has tried out, so the number could still creep up.
As for the config behaviour: Surely we would have noticed that changing the service's config caused it to break? Yes, if we'd been changing the service's config.

But someone else changed platform wide config that the service inherited.
We don't change platform config often, so the issues were only triggered rarely (and even then it required losing a race!); and the team who generally change platform config are not the same ones who maintain this service.
The final mystery, of course, is why my goroutine dump didn't tell me?

The unfortunate but obvious answer is that the service runs 3 replicas and I'd randomly grabbed a profile from a replica which was fine.
In fact on this occasion the platform team were doing work which required making multiple changes in quick succession which caused two of the three instances to deadlock shortly after each other.
I'd gotten an unlucky roll of the dice and it sent me down a couple of paths of investigation. When they proved fruitless, I decided to revisit that possibility and look for other data.
Our general service dashboard has a graph of event consumption rate, but doesn't break it down by pod.

But if you do that, it became really obvious that something dramatic had happened to two of the pods when things started going wrong:
And if you pulled goroutine dumps for one of those two pods, well... you found thousands of goroutines piled up calling RLock on the client, plus a single one in the change observing loop trying to take the WLock.
So the root cause was a deadlock in a single instance of the service (or, here, in two of the three), and this caused lock contention in our distributed locking system when another instance tried to take a lock on the same resource
But that alone wasn't enough to cause sudden and substantial degradation and evolve this into an incident. If the queue client library had cancelled contexts, these pods would have eventually given up on trying to acquire those locks.
If the re-queueing behaviour hadn't existed, then those first few cases would have gotten completely stuck but had little further impact.

It took the combined behaviour and (mis)features of the three to cause the problem to grow into an incident
And finally, if inherited config didn't exist, or the client library had checked that its configuration had actually changed before it tried to update the client object, then changes to the platform-wide config wouldn't have been able to have any impact.
This bug had existed for 3 years and puzzled us for 1. Once the ideal opportunity arose to debug it, it took me about an hour to an hour and a half to track down and required extensive knowledge about and investigation into the behaviour of various platform components
Once the cause was identified, the fix was trivial as they often are, and just moving a couple of lines outside the loop. Embarrassingly, it could probably have been caught if `go vet` warned about defer statements in loops.
But I think its a really neat example of how big systems are complicated and its really hard to figure out how everything will interact until you see what they do in practice
~ FIN ~
(p.p.s. I'm currently on the job market; if debugging these sorts of problems is among the skills you're looking for, perhaps drop me a message?) https://twitter.com/erincandescent/status/1268581437667774464
You can follow @erincandescent.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.