Thread by @save_spoons, 1/ No one ever says “thank god we’ve got *so many* dashboards.” [...]

1/ No one ever says “thank god we’ve got *so many* dashboards.” New @LightstepHQ features out today mean that you can toss most (but not all) of them in the

!

(Though you should scroll below)

https://twitter.com/LightstepHQ/status/1357364821830733827

https://twitter.com/LightstepHQ/status/1357364821830733827

2/ I’ve mentioned before: “AIOps” (and #AIOps) makes me go grrrr. I think part of the frustration is that there’s a grain of truth in AIOps – algorithms *should* be able to help us triage pages, qualify releases, and help us understand what’s happening in production.

3/ With features we launched today, you can toss most

in the

because software can show you what matters... wait, does that sound like "AIOps"? Yes and no. First – and maybe you already knew – automation _must_ be used to *assist* human response, not replace the humans.

4/ Second, the other issue with AIOps-as-usual is that we aren't thinking hard enough about the data: we don’t have enough *labeled* data or a good objective function.

(If you don’t believe me, here’s a great talk by @tmu to walk you through it:
)

5/ “Labeled data”? “Objective function”? Effective AI in <280 chars: algorithms need to “understand” categories like good and bad. Labeling is a way of teaching that. Objective functions measure how close an algorithm is to achieving it (and guide refinement of the model).

6/ What’s an “objective function” for the performance of production software systems? Whatever it is, it should measure whether or not your application is meeting the expectations of your users... (I can see @ahidalgosre nodding over there!) Yes, it’s your SLOs!

7/ (Side note: SLOs – service level objectives – are a way of measuring the quality of your service against a predetermined threshold of success. And a way of communicating with your customers about that quality and with leadership about priorities! They’re great!)

8/ How can we use SLOs to guide automated analysis? How do we get the computers to focus on how deployments, config changes, infrastructure rollouts, and changes in workload are affecting those SLOs?

9/

Simple: focus on those changes and how they affect SLOs. That is, use *changes in SLOs* as the way of evaluating the impact of deployments, config changes, infrastructure rollouts, and changes in workload. And *if it didn’t change, don’t fix it!*

10/ That is, rather than seeing “X is an anomaly in latency,” what I really want is for tools to say “deployment Y caused a change in latency.” That way, I immediately know what release to roll back, what code to look at, and what team to reach out to.

11/ For me, the idea of focusing on *change* has been powerful. Not just advantageous from an algorithmic perspective, but putting on my SRE hat for a moment, it also just feels reassuring to frame the problem as “what changed” and – most importantly – what changes *matter*.

12/ Yes, AIOps is still mostly a dream^H nightmare (and lots of good comments on why in my rant, er, thread on this last time!) but I think we’re actually making progress toward automating the most painful parts with Change Intelligence. (Check it out!) https://twitter.com/save_spoons/status/1352379999144218625

https://twitter.com/save_spoons/status/1352379999144218625

Latest Threads Unrolled: