Thread by @tef_ebooks, at a company i shall not name, one where i spent less [...]

at a company i shall not name, one where i spent less than ninety days employed, i had been hired to "fix things" and "not take the pager", which turned out to be total lies

according to the gossip, in those 90 days, i "did nothing" but i'm also to blame for everything

this is a story about a company afraid of load balancers

we had a front-end, and a backend, and cpu intensive jobs would be farmed out from one to the other, but several problems with failed jobs had occurred, and i was sent to investigate

the company used a background work scheduling system to distribute jobs

the frontend would enqueue some job, and the backend would poll for available work, and the frontend would poll for the status

this worked when things were over-provisioned, i.e well under capacity

the problem with this system, one of many, was that load was not being balanced

the background scheduler was designed for batch processing, optimising for throughput over latency, with long running background jobs

of course, we were using it for the opposite reasons

we had two job schedulers, shared amongst the front and back end for distributing tasks, and this turned out to be the cause of the latency/error spikes

although work was being distributed evenly across the scheduler, workers tended to stick to one, or the other

we ended up with ten workers in the pool, where seven would be emptying one work queue, and three would be emptying another

there wasn't an easy fix, but there were several options, and all but one got shot down

the first fix was obvious: fork the open source project and change the behaviour to suit our needs, picking a queue for work at random, every time a worker was available

this wasn't my preferred approach

the second fix was to use a http load balancer between front and back end

this seemed the best idea, as the backend tasks needed to complete before the frontend finished rendering the page

this was "too risky"

the third fix was "well, the front-end is cpu bound and the back-end is memory bound, we can probably run them on the same machine, and get more of them"

this was seen as too much work, even though i'm told this is eventually what happened

the fix that got chosen? turn off one of the job schedulers, thus forcing all of the workers to share from the same pool

it was the easiest route, but the most dangerous one. obviously it was the one i got asked to implement

the problems i raised were numerous: it would slow down all requests, and if it ever got overwhelmed, very bad things would happen

but the server hadn't crashed yet, so, into production went the fix

the graphs did look a lot nicer afterwards though

about eighteen months after i left, i got a message "everything's on fire, do you remember what you did"

the scheduler had filled up, and got killed for exhausting memory, but for one reason or another, it kept happening over and over and over

i have to admit, i laughed.

remember how i said workers polled for work? well, what happened is that the front end workers filled up the queue faster than the back-end could empty it, and upon a cold start, the backend workers were sleeping and polled even slower

ha ha

i felt a little vindicated. i'd warned them about that exact scenario. it was one of the reasons i'd pushed to move to an http load balancer of some form

i did at least tell the coworker how to fix things: over provision the workers to hell and back and cross your fingers

i did wonder, would I have fixed that bug properly if i'd spent more time at the company? and the answer was: absolutely not

the only way things could get fixed was after all hell broke loose, and even then, you got shouted at by the founder for fixing them

during the week i'd left, i'd set up an autoscaling group, to handle most of the load spikes automatically

something that previously took four engineers shouting at each other in a panic, desperately trying to avoid clicking on the same row in the aws console

instead of a five hour outage, there was a five minute outage, but that was all the excuse the founder needed to scream at me for an hour

despite someone being paged & fixing the problem by themselves, by changing a number in a box from 10 to 20

in the end, it wasn't being screamed at that made me decide to eject, i'd worked in startups before and had become numb to the abuse that passes for leadership

it was my line manager saying "did he do that thing he does" and brushing it off as routine—that's what made me quit

hearing from my manager that such shitty behaviour was just part-and-parcel of the job, in a you-know-how it is way

made me realise that absolutely nothing would get fixed, nothing would change, not just the code, but the toxic environment that lead to it

it was vindicating to hear things had fallen over as i had predicted, and kinda lolsob to hear that i had a reputation for doing nothing & also being the cause of all the bad decisions in my eight week tenure

the cognitive dissonance needed to continue working there was a lot

uptime was important only because it served as a means to punish the engineers, and guilt them into more overtime for feature dev

in other words, the reason you end up fighting fires at startups is that you work for arsonists

to go back to this joke tweet that set me off down memory lane

the solution? put on your mask before helping others with theirs. you close your laptop and call it a day.

the company's on fire, too https://twitter.com/walfieee/status/953848431184875520

https://twitter.com/walfieee/status/953848431184875520

Latest Threads Unrolled: