I’ve diagnosed and resolved hundreds of bugs in my 5 years at Amazon.
As a junior engineer, bug diagnosis in large scale software systems was challenging.
I wasn’t aware of a good thought process.
Here’s how I’m doing it.
As a junior engineer, bug diagnosis in large scale software systems was challenging.
I wasn’t aware of a good thought process.
Here’s how I’m doing it.



Half baked resolutions allow problems to linger. These discreet flaws are easy to overlook by future developers.
Diagnosis is the first step towards fixing bugs, and keeping them fixed.


I ask questions:
- What behavior was expected?
- Is there an error code or screenshot?
- Are there known identifiers I can use for diagnosis?
- Exactly what time (and time zone)?
This clarity gives me a specific starting point.


Problem descriptions tell a story of what happened. System documentation tells a story of how things work.
Stories aren't 100% truth. I read with skepticism.
The truth is in the code and logs.


I avoid these common, frustrating mistakes:
- Reading the wrong logs
- Reading logs in the wrong timespan
- Browsing source code at a commit inconsistent with the time of the event
This saves time (and my sanity).


The logs tell me what code executed. The code tells me which log statements to look for.
I filter logs by the known identifiers. Request ID’s, entity ID’s, user ID’s.
I’m patient. This process takes time.


Stack traces and error messages are useful, as long as they’re relevant to the problem.
Sometimes red herrings trick me. I jump in and pull myself out of rabbit holes.
After iteration I formulate a hypothesis about what happened.


This is necessary to confirm what happened.
I use the alpha or beta environment. This can be a manual test and/or unit test, depending on the situation.
If it's proven false; back to the logs.
If it's proven true; time to fix the bug.


Bug fixes are the perfect opportunity to use a strict TDD approach.
I write a failing test, make the code change, and observe the passing test.


I record verbose information in the ticket:
- The root causes
- Links to logs
- Filter expressions or Linux commands
- Relevant metrics
- How I fixed it
This keeps a track record. It helps peers follow a similar process in the future.


- Was it difficult to find logs?
- Were logs inadequate or incorrect?
- Was diagnosis tedious?
- Should we write operational scripts for convenience?
This improves long term operational excellence.


Do you follow a similar process? What steps would you add?