Thread by @curtiseinsmann, I’ve diagnosed and resolved hundreds of bugs in my 5 years at [...]

Curtis Einsmann

curtiseinsmann

I’ve diagnosed and resolved hundreds of bugs in my 5 years at Amazon.

As a junior engineer, bug diagnosis in large scale software systems was challenging.

I wasn’t aware of a good thought process.

Here’s how I’m doing it.

I understand why accurate root cause diagnosis matters.

Half baked resolutions allow problems to linger. These discreet flaws are easy to overlook by future developers.

Diagnosis is the first step towards fixing bugs, and keeping them fixed.

I clarify ambiguity about the reported bug.

I ask questions:
- What behavior was expected?
- Is there an error code or screenshot?
- Are there known identifiers I can use for diagnosis?
- Exactly what time (and time zone)?

This clarity gives me a specific starting point.

I treat all written communication as stories, which could be true or false.

Problem descriptions tell a story of what happened. System documentation tells a story of how things work.

Stories aren't 100% truth. I read with skepticism.

The truth is in the code and logs.

I make sure I’m browsing the right logs and code commits.

I avoid these common, frustrating mistakes:
- Reading the wrong logs
- Reading logs in the wrong timespan
- Browsing source code at a commit inconsistent with the time of the event

This saves time (and my sanity).

I read the logs and code simultaneously.

The logs tell me what code executed. The code tells me which log statements to look for.

I filter logs by the known identifiers. Request ID’s, entity ID’s, user ID’s.

I’m patient. This process takes time.

I’m mindful of misleading logs.

Stack traces and error messages are useful, as long as they’re relevant to the problem.

Sometimes red herrings trick me. I jump in and pull myself out of rabbit holes.

After iteration I formulate a hypothesis about what happened.

I test my hypothesis.

This is necessary to confirm what happened.

I use the alpha or beta environment. This can be a manual test and/or unit test, depending on the situation.

If it's proven false; back to the logs.

If it's proven true; time to fix the bug.

I fix the bug after exposing it with a test.

Bug fixes are the perfect opportunity to use a strict TDD approach.

I write a failing test, make the code change, and observe the passing test.

I brain dump my diagnosis.

I record verbose information in the ticket:
- The root causes
- Links to logs
- Filter expressions or Linux commands
- Relevant metrics
- How I fixed it

This keeps a track record. It helps peers follow a similar process in the future.

I create backlog tasks for pain points I experienced during diagnosis.

- Was it difficult to find logs?
- Were logs inadequate or incorrect?
- Was diagnosis tedious?
- Should we write operational scripts for convenience?

This improves long term operational excellence.

That’s how I diagnose and resolve bugs in large scale software systems.

Do you follow a similar process? What steps would you add?

You can follow @curtiseinsmann.

Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: