The 97 Things Every SRE Should Know book ( https://www.oreilly.com/library/view/97-things-every/9781492081487) looks interesting and covers a number of 1-3 page short topics. I'm going to try and read a few a week and make some notes. #sre-97
The random number generator came up with "42 - Why I Hate Our Playbooks" by Frances Rees first and it's a good overview of playbooks.
Quotables include - "Any playbook that can describe the exact steps to resolve an exact circumstance should be an automated script instead."
and
"We escalate to humans for a complex response, not a fast response."
and
"We escalate to humans for a complex response, not a fast response."
It also has some guidance:
Ideally, a playbook should only contain:
* Why do I care? Severity and qualification of the user-visible impact.
* What can I look at? Consoles, logs, and inspection tools.
* What can I do? Mitigation tooling.
Ideally, a playbook should only contain:
* Why do I care? Severity and qualification of the user-visible impact.
* What can I look at? Consoles, logs, and inspection tools.
* What can I do? Mitigation tooling.
And a great summary workflow:
|-> 1 Identify issue
| 2 Debug
| 3 Add alerts
| 4 Write documentation
| |-> 5 Automate resolution
| |<- 6 Update documentation
|<- 7 Have a different problem
|-> 1 Identify issue
| 2 Debug
| 3 Add alerts
| 4 Write documentation
| |-> 5 Automate resolution
| |<- 6 Update documentation
|<- 7 Have a different problem
My own views on playbooks are that you get out what you put in, and they are often a last minute band aid rather than a full part of the product.
Runbooks should have a lifecycle. A time to be useful and a time to die / be automated away. The trick is knowing what stage one's at
Runbooks should have a lifecycle. A time to be useful and a time to die / be automated away. The trick is knowing what stage one's at
I like to capture usage and relevance information on documentation like this. A simple thumbs up thumbs down on each page gets you started but ideally you'd track a little more.
Has the page ever been read? If so how long ago? Was it actually used? When was it last reviewed? A simple "helped" / "didn't help" checkbox and a comment box can help you get started.
It's important that any feedback you capture is as frictionless as possible and is immediately actionable. Don't make people change tab to a doc review system for example.
Anything that adds friction will stop people responding and IMHO it's better to gather some data than none.
Anything that adds friction will stop people responding and IMHO it's better to gather some data than none.