i close my laptop at 5pm and go home https://twitter.com/walfieee/status/953848431184875520
i know i say "the purpose of a system is what it does" a whole bunch, but most of the production outages i have lived through have a simple cause, the system is doing what it was designed to do under failure
under the guise of "simplicity" or eschewing complexity or even troubling aesthetics, programmers tend to avoid implementing error handling, like fail fast, back pressure and predictably build distributed systems that ddos themselves
i remember reading about an akamai or erlang outage where one of the nodes failed, so the engineers just turned it off properly, and left it overnight, and the system continued as-is, just under capacity

it was almost revelatory, systems built with operations in mind
meanwhile at my last company, i was confronted with the opposite, something was hammering our system, latencies had spiked, and no-one could decode the graphs on the dashboard to divine the source of the problem
in the end it was discovered that another team was performing bulk loads, and when the system began to error, it retried, instantly

our system was falling over because a cache was getting exhausted, something the bulk import contributed to
i guess, in the end, the escape room metaphor works because it's often not by your own volition that you're confronted with these problems

that said, the next day you're back in the puzzle box but there's a new graph on the dashboard as an achievement trophy
You can follow @tef_ebooks.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.