Change Management gone very, very wrong
An extremely belated postmortem
My favorite interview question of all time is “tell me about a disaster”. Disasters provide rich learning opportunities and software careers are lousy with them. If I meet a senior engineer who can’t offer a single disaster story then I am instantly skeptical. During those first 5+ years something somewhere should have gone “boom” loud enough to knock them off their feet.
Usually when a candidate digs deep they will think of a system that crashed or a deadline that whizzed by. Disasters can apply to other things, too, though, like inter-team relationships and processes gone wrong.
My favorite process disaster of all time was change management for our internal software releases at Big Bank.
Normally when I write about a disaster I focus on telling a story and then try to draw some actionable insights out of it. Engineering teams have a specific tool for learning from disasters, though, called the postmortem. There are a bunch of different styles out there but all good postmortems push participants to go deep, uncover multiple causes, and identify next steps to prevent an encore.
I feel like we don’t do this a whole lot when talking about process disasters. We establish what’s bad about the process and the negative impacts it has. Maybe we’ll even propose alternatives. I think it’s rare, though, to dig into all the things that caused the bad process to exist in the first place.
So today I’m going to dust off some old skills, pretend I’m sitting in a room with a bunch of coworkers too many years ago and postmortem our gnarly old change management process.
But I like telling stories so I’ll do that first
My memory is fuzzy (this is always the case with my Big Bank stories, sorry!) but I’m pretty sure it all started with decent intentions: release stability and standardization1. The process was created in some other part of the bank where it was presumably successful before it was pushed to ours.
It went like this: Each team had to submit a ticket several days before their release was scheduled to go through. You had to answer a few dozen highly repetitive and interdependent questions which the ticketing platform used to calculate a risk level for the change. Higher risk releases were assigned longer timelines and greater scrutiny. Various artifacts had to be attached to the release ticket as proof that you had followed your QA process before advancing the ticket to review.
You then had to sit in on this hours long meeting while a small team from outside the org reviewed each release ticket and asked follow up questions.
It was mind numbing; there were usually several dozen tickets to review and it was very, very hard not to zone out. If you missed the meeting, they didn’t like your answers or your ticket wasn’t at a particular stage of completeness then you were out of luck; no release for you. If you were fortunate enough to receive the green light during this call then you had the privilege of chasing each of your stakeholders for additional written approval over the course of the next few days. I have vivid memories of calling our desk head, lead quant, and other senior leaders begging them to please please PLEASE reply to the email chain saying “yes” right now because the release would be automatically canceled if I didn’t have everyone’s signature uploaded into the system three hours before the scheduled rollout time.
This was, as you might imagine, a source of much discontent. A few people actually cited this process when departing the company and for good reason: wrangling the change management process easily accounted for a third of my time most weeks. A third of my time! Multiply and sum that across the salary of each engineer going through this flow and that is one expensive process. It was also pretty annoying for my stakeholders; no one wanted to get that whiney call from me Friday lunchtime to sign off on a thing that impacted some other stakeholder but not them.
So what do you think happens when you present folks with an unbelievably toilsome and largely irrelevant process? What do they do when you give them a mountain to climb? They go around it, of course! Over time we learned exactly what to say on the ticket to make the platform spit out a “low risk” rating. We pinged each other awake during the big weekly call. We layered one hot patch on top of another because sometimes we had no time left to do a full build and deploy. The massive pressure we were under to deliver week after week forced us through the cracks of our highly restrictive change management process, giving birth to a whole new class of anti-patterns. Chronic instability maintained its tight grip on our systems and people.
Hopefully I’ve established our change management process as sufficiently disastrous to merit a postmortem: I don’t think it did the thing it was supposed to do (improve release stability) and was so high friction that some people actually cited it when leaving the company.
So let’s get to it.
The postmortem
There are a bunch of postmortem styles out there. I’m most familiar with a kind of loosey goosey take on the 5-whys method. You’ll note as you go through my write up that some of the “Why”s are left hanging. These are places where I don’t know or remember enough to speculate. Please also remember that I was a junior → senior engineer during this time and that it happened a good long while ago.
What happened: Our org was required to follow a centralized change management process which was intended to improve release stability through a series of standardized checks. In practice it introduced tremendous levels of toil and did little to actually reduce instability overall.
Why was it so toilsome?
Because there were a lot of things that the team running this process wanted to check for when assessing risk. Unfortunately the most obvious way to do that is to tack more stuff on to your existing process until it collapses under its own bulk.
Because it was one small external team running the process for our entire org (plus several others) and it was simpler for them to review everything if 1) we all followed the same flow and 2) we all showed up at the same giant meeting.
Why didn’t it move the needle on instability for our org?
Because it was one size fits all and came from outside our org. Why?
I don’t know! I was an IC for most of that time and wasn’t part of those discussions so it was confusing.
Because our project roadmap was extremely aggressive. Why?
More on this further down.
Because the process was so burdensome that it squeezed out time for more meaningful development and even proper build & deploys. I think this may have actually ended up contributing more instability to our systems over time because we were busy supporting messily layered hot patches vs a nice clean build.
Why did we have such an onerous version of this process in the first place?
Because our releases were historically unstable; something always went wrong. Why?
Because our ecosystem had been completely rearchitected into a collection of tightly interwoven components at varying levels of maturity. We heard about services and data lakes and got very very excited without thinking through contracts, 9’s or error budgets. We actually did some remarkably innovative things for the time but it came at the expense of system reliability.
Because we had really low bus factor. Why?
Because management had replaced all of our consultants, who had constituted the majority of the org, with new full time hires right before rearchitecting everything. Why?
I have thoughts but no answers! Fodder for a future post.
Because we had a TON of turnover even after that year over year. Why?
I think it was a mix of dissatisfaction with compensation, heavy workloads, stressful/unhappy work culture and, in a few cases, extremely toilsome processes like this one.
Because some leads refused to prioritize stability/infra work and kept pushing incredibly aggressive product timelines. Why?
Because they were more PM (Product Manager) than they were EM (Engineering Manager) but we didn’t distinguish between the two within the team lead role.
Because desk-facing deliverables were rewarded more highly at performance review and budget time than stability was. Why?
It is always easier to sell user-facing wins than it is to sell just about any other kind of engineering work.
Why couldn’t we provide upward feedback and improve/replace/remove the process?
Because our technology organization was large enough to populate several towns and we moved at the speed of slow. I think there was a committee to revise the process that my managing director was on but this thing moved on an annual basis or something. The people who actually participated in the change management process itself were largely disenfranchised.
In other words:

Reflection and Next Steps
Good postmortems don’t close without reflecting on what has been learned and suggesting next steps. This is hard, though, and not just because it was so long ago. What do you do when your entire org is saddled with a beast like this?
We can consider some surface level action items. Streamlining the existing process via iteration would be a decent first step. Let’s dig a little deeper, though. This process was rolled out to our org partly in response to intolerable levels of instability. If we had addressed the instability then the process requirements might have actually lightened up. I vaguely remember being told that we were under a particularly heavy level of scrutiny because we were just that bad. So next steps should have reasonably included a bottom up initiative to actually stabilize our systems and releases.
Why were we so unstable? Our ecosystem was architected in such a way that a fire in one system quickly spread to another with few checks on request volume in between. We had made the move from self-contained monoliths to highly interdependent components intermediated by web-calls in the space of a year. Building in contract enforcement, stress testing, production parallel environments, SLO’s and error budgets could have gone a long way toward the resilience of our ecosystem.
Why didn’t we do that? Well some teams did! In fact some teams did a great job of this and they were really, really annoyed when our team… didn’t. Or, rather, they were annoyed when we did some stability work but still maintained a breakneck pace toward features while our system was actively on fire.
So what was up with us? One piece of it was that we were very much a product team. The teams that did make solid stability gains were mostly infrastructure teams; we were just a lot closer to the trading desk than they were. They had very few user facing initiatives whereas that was pretty much all we did. We also had a ton of regulatory work on our plate with non-negotiable deadlines. There’s still a big gap, though, between the amount of time we should have spent on stability vs the time we actually did.
I genuinely believe that most of the engineers on our team would have preferred to shore up our system given the chance because on-call was hell. That wasn’t true for some of our local team leads, though; the people who owned the roadmap. They ate 2AM support calls for breakfast (or late night snack). These folks probably would have absolutely killed it as PM’s if they’d been paired with an experienced EM or TL (Tech Lead). Alas, we didn’t have this kind of setup. Imagine giving the most aggressive PM you’d ever met full control of the engineering roadmap! I wish there had been more cultural pressure from above to prioritize stability work but I wonder if we might also have benefitted from a formal PM/TL or PM/EM division.
My team was also heavily siloed; we maintained an N:1 mapping of projects to engineers. Keeping us isolated like that meant that we didn’t make roadmap decisions as a team - that was all done by the local lead who, again, was more focused on the product end of things. I believe that less siloing would have led to a more empowered team.
On the topic of team empowerment, I don’t quite understand why we needed one change management process to rule them all. Maybe this was simpler from a regulatory perspective. Maybe there was some kind of economy of scale going on. I don’t know. If, however, we could have had our own process that delivered against a set of required outcomes then I think that would have been better. Teams should own their processes wherever possible.
Finally all of these improvements taken together may have made the experience of developing and rolling out software in our org less miserable. I like to think this would have reduced turnover and promoted a virtuous cycle: Less turnover → greater stability → less misery → less turnover → ...
Closing thoughts and caveats
I’m writing about something that happened a really long time ago with none of the other participants present! These are big no-no’s for a good postmortem and so I am sure that I have warped the narrative here. Please take all of this with a very large grain of salt; there is no way I got all the details right or have provided a full picture of what went on and why. If someone I worked with reads this and points out the things I got wrong then I will be grateful.
All that being said I stand by my original thesis that a terrible process is a disaster worth postmorteming. I also strongly believe that we could have avoided a whole lot of pain by instilling a bottom up culture of stability, designing for resilience, and trusting teams to maintain their own processes.
In short, do not suffer a lousy change management setup and keep asking “why”.
Read next
If you liked this piece then you may also enjoy:
A Tale of Two Migrations
Another disaster story! This time it's a system migration that missed the deadline by over a year.
Questionable whether standardization is actually a “decent intention.” More on this later.

