1/ “AIOps” – Not sure I’ve ever had a stronger reaction to a buzzword than to this union of snake-oil (“AI”) and crufty tradition (“Ops”). But it’s not just an emotional response to being replaced by robots. *It’s just that it's not going to work as promised.*
Thread
Thread

2/ First, false positives are particularly harmful in ops. Ask anyone who’s carried a pager and gotten woken up at 3am for something that didn’t need to be fixed. And not only do false positives create more work, they erode trust in tools.
3/ Second, the software itself is constantly changing. I’ve seen organizations that make 100s or even 1000s of changes to their application every week! Training an algorithm on _last_ week’s releases is unlikely to perform in any reasonable way on _this_ week's software.
4/ Third, it’s critical for humans to understand the *reasons* why an alert was triggered. Sometimes simple actions can be taken without knowing why, but nearly all mitigations and every true resolution requires a deeper understanding of what’s happened.
5/ Black box approaches don’t help operators understand what’s happening or how to address it. Building response systems that are completely automatic – or worse auto*magic* – are an anathema to building *reliable* systems.
6/ (Don’t get me started on why any sort of automation and analysis needs to be tightly integrated into your ingestion pipeline! And it's not _just_ the economics!) https://twitter.com/lizthegrey/status/1308889038221385732
7/ Finally, whether or not a change is "good" or "bad" is a value judgement that requires business context – context that only a human has. For example, a 5% increase in latency may be totally acceptable if it’s part of launching a new feature on time… then again maybe not.
8/ Is AIOps _complete_ trash? Whether you like the term or not, there certainly *is* a lot of data so it seems like computers can/should help. But it's far from understood (at this point) what sorts of algorithms can be applied or how they can be used to _support_ engineers.
9/ So more next time on how I think AIOps *can* work, but for now just a reminder: we all know what happens when we take the humans out of the loop (ask Matthew Broderick) <EOF>