Thread by @ben11kehoe, I want to talk a bit about what this was like.tl;dr: it [...]

Ben Kehoe

ben11kehoe

I want to talk a bit about what this was like.
tl;dr: it was *long* and inconvenient timing but, as an operations team, not particularly stressful. Questions of “when”, not how or if systems would come back. A lot of waiting and watching—and that’s desirable. https://twitter.com/ben11kehoe/status/1332028868740354048

https://twitter.com/ben11kehoe/status/1332028868740354048

Wednesday, there was not much to do. AWS IoT was hit hard by the Kinesis outage, which meant lots of stuff was simply not going to work. And CloudWatch outage meant we couldn’t see what was and wasn’t working cloud-side.

The resolution to the main outage came late in the evening. This meant we were up late because we needed to be there when we had visibility again into Lambda logs and DynamoDB metrics. If I had known that would be 6 hours later, I would have sent everyone to bed first.

Once we had that, we had to up-provision a couple of DDB tables for the extra backfilling traffic, and there was one issue we needed developer feedback for. So we were up early to meet with them to resolve it before higher traffic later in the day.

That ended up being a one-line code change and a tweak to an env var.
Then it was waiting on a follow-on effects in AWS services to get worked out, which took another 8 hours before everything was all back to normal.

The point here is that we were never feverishly working away trying to fix something that was broken. We were calmly and methodically evaluating the state of the system, and being on hand to take the next steps when something changed.

If this happened often, we’d probably have automated detection and notification of “something changed” for an incident and gone to sleep while waiting for that. This is the first incident of any kind that’s kept us up all night.

All of this is to say that this experience, unpleasant as it was, changes none of my thinking about how to build or who to build it on.

With #serverless, you’re trading the ability to take positive action in any given incident *for vastly fewer incidents*

For the operations team during an incident, you’re getting rid of a major source of stress, because the majority of the responsibility for fixing what’s broken is on the experts in the given system that’s broken. They are good at fixing it, better than we would be.

I’m absolutely not saying you shouldn’t make changes in your architecture in response to this outage. But be deliberate about it, and always focus on the TCO of those decisions.

You can follow @ben11kehoe.

Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: