Thread by @gitbisect, Listening to @KoltonAndrus ask "How many 9s do you need?" & it [...]

Jason Yee, but ✨actually a purple-haired goose✨

gitbisect

Listening to @KoltonAndrus ask "How many 9s do you need?" & it made me think: the way most ppl consider 9s is flawed bcuz it assumes all time is equal & doesnt consider outage length beyond aggregates. The problem is this is getting codified into SLOs.

i.e. Most orgs arent 24/7. All orgs have peak times. Short incidents are clearly less impactful than long ones. All of this is lost when you declare an SLO like "99.9% of requests over past 2 weeks return 200 OK"

Or e.g. take our friends at Slack. They're still 99.9% for the quarter, but clearly the Jan 4 incident made headlines... & it likely wouldnt have (or at least been small news) if it was May 4 at midnight PST.

So one common perspective is "you dont need that many 9s". If your app is only used during business hours in the USA, then it can go down at night & weekends... so you only need 23.8% availability (261 work days * 8 hours).

But here's the thing:...

YOU DO NOT CONTROL WHEN OUTAGES OCCUR! You only have some influence, but can never preclude incidents (e.g. how's the no-deploy Fridays working out for you?).

Whatever 9s you choose, you need to accept that the worst incident can happen at the least opportune time.

So when you say your goal is 99.9% & you can have 8.76hrs of downtime for the year, do you really think your execs would be ok if you had a perfect year, but were down for 8.76hrs on Black Friday/some other important peak day?

If not, then maybe you need more 9s. Or better, maybe you should start rethinking what SLOs should look like. e.g. consider segmentation/slicing.

I don't know what the solution is, but there's always going to be a disconnect btwn your pragmatic exec doing worst outage * $ lost calculations vs. you doing statistics of small, occasional incidents over time to determine what is/isnt acceptable.

You can follow @gitbisect.

Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: