Thread by @shelbyspees, was just thinking like, "I have so many war stories but I [...]

was just thinking like, "I have so many war stories but I never shipped anything useful as an SRE/DevOps engineer" but that's not true, I have a couple really cool success stories from my last job.

the biggest one was figuring out how to safely deploy and migrate traffic to a new service. it started out as a refactor branch and eventually became a complete rewrite of the core business logic over the course of several years.

I joined the company at about 2.5 years into this rewrite effort.

the original codebase from 2009ish(?) was written in two months by an 18-year-old wunderkind. it was a spaghetti mess.

thankfully I never had to touch it

I joined the company in 2018. I was brand new to Chef, Terraform, Jenkins, fairly new to AWS. Got thrown in the deep end of ops, infra/config as code, linux sysadmin stuff (which was the hardest for me).

took me about six months to get a handle on things, and then I started asking questions.

like, why is there only one person working on the rewrite? why does he seem so burnt out? why does his manager keep saying it'll be done X date and it never is?

I became friends with the platform engineer who was working on the rewrite and I started learning more about the history and context. he kept downplaying his work like, "I just like writing pretty code" and I was suspicious.

I started asking like, "What's the deployment plan? We don't need to wait until it's done to start planning for that" and the platform engineer kinda just shrugged. He didn't really care about opsy stuff, plus the DevOps team didn't make it easy for him to make changes.

This pissed me off. Like we don't need the platform team to learn about networking or AMI builds, but they should still be able to do some stuff self-serve.

Nope, everything from "We need to increase the ASG size" to "please add these credentials to secrets" was a JIRA ticket.

Don't get me wrong, I love closing tickets and PRs. I love helping people. But I could see other project work that would make things better for everybody, and I couldn't get it done because we were basically the dev-facing IT department.

So in an effort to be more proactive, I opened a ticket like, "Hey let's discuss what it's gonna look like to deploy the rewrite. Here are my thoughts." and quickly got shot down by the platform engineer's manager.

We ended up implementing my plan almost exactly lol. Eventually. But apparently I was getting too involved and now I was getting distracted and distracting the platform engineer as well.

I was thinking like, "What can possibly be more important than the core money-making service for the entire company? Btw your SPOF platform engineer who's the only person who understands that business logic is burnt out and making his exit plans."

but I was less than a year removed from being broke and unemployed so I didn't want to rock the boat too much. https://twitter.com/shelbyspees/status/1286444858429775874

https://twitter.com/shelbyspees/status/1286444858429775874

I had only just learned about canary deploys, blue/green deploys, feature flags. I had also just productionized my first new service, and created a go-to-prod checklist to document all the shit I discovered in the process.

the company completed their migration to VPC soon after I joined, and they were able to do it just by updating DNS. once I learned what all those things meant I was like, "Oh hey you can do that, neat."

the system architecture was actually pretty nice, even if the terraform code was hard to navigate. I was like, "can we just make a second cluster (even with just one server for now) and deploy to that and then add some configuration to use that for some subset of traffic?"

in retrospect, I think my manager wanted to keep costs down, which is why he didn't want to spin up all the resources of a proper cluster right away (ALB, ASG, etc.).

so instead we got a hand-configured EC2 box that he spent like a week configuring on his own.

which meant that I'd have to redo the same work a few months later when I created a new Chef cookbook for the rewrite, with none of his work documented. ughhh.

anyway my favorite bit happened a few months after that. I didn't know what it would look like to migrate traffic over to the new service, but I talked to the platform engineer some more and helped him figure out what he needed to figure out.

after talking to the lead application engineer, the platform engineer realized that with the business logic already complete for certain kinds of requests, he could create a config to send traffic to the new service for only those requests.

platform engineer wrote a proxy in the old service to redirect traffic to the new service based on the request type and customer, and had the config live in a dynamodb table, fetched every 1 min or 5 mins or something.

the config would just tell the proxy what % of requests to send to the new service.

I said, "kinda like a switchboard?" and so it was henceforth christened.

it was great because like a feature flag, the platform engineer could redirect traffic without waiting on a deploy--which took like 30 mins total across the legacy cluster. he could switch back at the drop of a hat.

the only code I wrote for these services was the chef and terraform. but in a few months of side conversations I took the team from a scary gigantor branch deploy to a much less scary config change for migrating traffic.

I don't have a million years of experience like many of my colleagues do, but I do like to ask questions. "Why is it this way? Does it have to be this way? What's the human impact? What's a reasonable step in a healthier direction?"

it's easy for me to look back on that job and think of all the things I wanted to accomplish and didn't. all the big plans I dropped.

so it feels good to reflect on doing something that had a positive impact.

because of these conversations I had, the rewrite project was no longer a giant question mark--QA could start testing it. when they started migrating production traffic (after I left), it was almost disappointing in the lack of fanfare.

and being able to migrate traffic one customer at a time meant that the roadmap was much clearer.

the platform engineer was allergic to Jira so I suggested using a GitHub board, and it ended up working great for him. his manager was thrilled to be able to see it too.

I won't take credit for the years of nights and weekends that this platform engineer spent rewriting the core service. (the new architecture is fantastic btw, I'm hoping he'll open source it as a framework once he recovers from burnout).

but he needed help solving operational problems. they weren't front-of-mind for him.

and I honestly believe that if I hadn't said anything, he'd still be there chipping away at that rewrite branch in the legacy repo.

because nobody else on that team cared about his (obvious-to-me) burnout. nobody else wanted to ensure a safe migration to the new service. they still thought of it as a branch to be merged.

one of my career goals is to be the kind of SRE that I see in Liz and Paul and Amy.

not like the "we need someone to manage Jenkins" "SRE", but the "have you tried talking to each other?" SRE.

I know I'm not there yet. the gaps in my knowledge and experience are too big. I haven't been burned enough times. I haven't learned enough ops lessons the hard way.

but I want to get there.

Latest Threads Unrolled: