0/ This is a
about my experiences building both the Distributed Tracing and Metrics infra at Google.
And, particularly, my regrets. :)
Here goes:

And, particularly, my regrets. :)
Here goes:

1/ Dapper certainly did some fancy tricks, and Iâm sure it still does. If itâs possible to fall in love with an idea or a piece of technology, thatâs what happened with me and Dapper. It wasnât just new data, it was a new *type* of data â and lots of it. So. Much. Fun. âŠ
2/ ⊠And yet: early on, *hardly anybody actually used it.*
Dapper was certainly _valuable_ (we saved GOOG untold $10Ms in latency improvements alone) but not âday-to-day essential.â Why?
Dapper was certainly _valuable_ (we saved GOOG untold $10Ms in latency improvements alone) but not âday-to-day essential.â Why?
3/ Dapperâs early-days usage issues boiled down to two core challenges:
a) The insights were *restricted to the tracing (span) telemetry*
b) Those insights could only be accessed *from Dapper.* (And hardly anybody âstartedâ in Dapper from a UX standpoint)
a) The insights were *restricted to the tracing (span) telemetry*
b) Those insights could only be accessed *from Dapper.* (And hardly anybody âstartedâ in Dapper from a UX standpoint)
4/ Eventually we did make some progress (e.g., @jmacdee1âs brilliant integration into Googleâs /requestz â the ancestor of OpenCensus zpages: https://opencensus.io/zpages/#tracez ).
Yet it still didnât feel like Dapper was vital, canât-live-without-it technology for most SREs and devs.
Yet it still didnât feel like Dapper was vital, canât-live-without-it technology for most SREs and devs.
5/ Now, rather than step back to think about *how* we might harvest the insights from Dapper and integrate them into daily workflows, we let the project âevolve in placeâ â and I regret that.
Anyway, I wanted to work on something âmore P0,â so I talked with lots of SREs.
Anyway, I wanted to work on something âmore P0,â so I talked with lots of SREs.
6/ At the time, what tool *did* every SRE at Google use every day?
Borgmon.
And what tool caused every SRE at Google endless frustration and pain?
Also borgmon.
So we created Monarch: scalable, HA monitoring that was also, well, *usable*. https://www.vldb.org/pvldb/vol13/p3181-adams.pdf
Borgmon.
And what tool caused every SRE at Google endless frustration and pain?
Also borgmon.
So we created Monarch: scalable, HA monitoring that was also, well, *usable*. https://www.vldb.org/pvldb/vol13/p3181-adams.pdf
7/ The complete story about Monarchâs early days is an interesting one, but it will have to wait for a different thread/post (too long!). What I would emphasize, though, is that Monarch only tried to solve the *monitoring* problem, not the *observability* problem.
8/ And while I am proud of the teamâs technical accomplishments (Monarch is a *vast* system: over 220,000 (!) processes in steady-state), I regret that we stopped at âmonitoring.â Why did such an *expensive* system have such limitations?
Correction: I *really* regret that.
Correction: I *really* regret that.
9/ So what would a scalable, HA monitoring product look like if observability was built into its fabric, into its very infrastructure? If monitoring was there to measure critical signals, and observability was there to explain changes to those signals?
10/ (TBH, we never even *tried* to build that at Google⊠though admittedly it would have been very difficult to take on given all of the hurdles that large companies bring to any and every development process.)
11/ So, ultimately, of course there were â small regrets, and two *big* regrets:
I) We built Dapper to find patterns in traces, but we failed to make those findings *discoverable.*
II) We built Monarch for core monitoring, but we failed to make that monitoring *actionable.*
I) We built Dapper to find patterns in traces, but we failed to make those findings *discoverable.*
II) We built Monarch for core monitoring, but we failed to make that monitoring *actionable.*
12/ Why am I telling this story now?
Well, this week, after years of effort and experimentation, we at @LightstepHQ are ready to share some news.
And this time, I have no regrets. :)
Take a lookâŠ
Architecture: https://thenewstack.io/observability-wont-replace-monitoring-because-it-shouldnt/
Product: https://lightstep.com/blog/announcing-lightsteps-change-intelligence/
Well, this week, after years of effort and experimentation, we at @LightstepHQ are ready to share some news.
And this time, I have no regrets. :)
Take a lookâŠ
Architecture: https://thenewstack.io/observability-wont-replace-monitoring-because-it-shouldnt/
Product: https://lightstep.com/blog/announcing-lightsteps-change-intelligence/