A Tale of Two Migrations
How our brute force project plan delayed delivery by over a year and what we did differently the next time around
I joined Big Bank as a junior engineer on the front office side during the spring of 2011. I was quickly assigned to a massive migration effort. We were developing a new system to calculate risk on a trade by trade basis for a host of different credit instruments (bonds, CDSs, and stuff built on top of bonds and CDSs). This new system was supposed to replace an existing legacy one. It would mirror the original system’s nightly batch compute functionality while adding some new real time features for our traders.
The New York team (my team) was responsible for migrating synthetic CDO’s. These are sophisticated and complex credit instruments that required a lot of input data and some intense number crunching. Fortunately the number crunching was handled by a blackbox analytics package owned by our quant team and the new platform itself was already largely developed. We just needed to implement the new business logic plus data queries, apply any relevant analytics package upgrades, run the batches, and get sign off on the results. Not so bad, we thought.
It would be well over a year before the first synthetic CDO batch was approved to go live in the new production system. What happened?
We had inadvertently brute forced our project plan. Usually when folks talk about migrations they lean into the technical end of things like strangler patterns and routing records to new databases. Today I’m going to talk about the operational bits, though - how you phase changes, work out dependencies, verify results, and work with other teams.
That first migration was a nightmare. Luckily (?) we didn’t stop at synthetic CDO’s and had the opportunity to iterate on our approach as we took on several more credit instrument types. I will cover what I remember as the three major iterations of our migration plan before wrapping things up with some key takeaways.
Some background
It helps to know a little more about the changes we were making and who was involved before jumping into those three phases.
Our migration required changes to:
The risk platform itself. Brand new codebase in a different language under a different architecture.
Where we got our input data from. Ours was not the only system in flux. The entire ecosystem was getting an overhaul and we were expected to pull input data from a new place.
Where we published our output data to. See above re: ecosystem in flux.
The analytics package. There were a bunch of new numbers the quants wanted to expose to our traders and other players on the front office side so that they could make more informed decisions. This ended up requiring several iterations to the package.
The migration required major participation from:
The NY-based risk platform team. Participants were my manager and me. We were responsible for building the new business logic and data queries required for handling synthetic CDO’s. We also had a London-based team with whom we collaborated. They were largely focused on other credit instruments (bonds and CDSs) and had developed much of the new platform. London didn’t participate in this specific project (migrate synthetic CDO’s to the new platform) but we did have to work closely with them to ensure we weren’t stepping on each others’ toes.
A data pipeline team1. Participants were the project lead and a senior IC. For the purposes of this post they: 1) published much of the input data we needed to its new home and 2) pulled our output data into another system where the desk could actually use it.
The quants. Super duper math-y types who interacted closely with the traders. They owned the analytics package, a c++ library which did all the number crunching. Our risk platform would make library calls to the analytics package and then publish the output to its home. The package was a blackbox - only the quants were permitted to have visibility into the code.
The front office support team. These folks sat on the trading floor and worked with the traders when there was a problem. They would route issues to the appropriate team when they couldn’t solve the problem themselves or when a system appeared to be on fire2.
Phase 1: A naive brute force solution
We felt the problem was fairly straightforward. Our initial project plan mirrored that perceived simplicity. It looked like this:
Given M changes and N trades:
Make M changes ← bug #1
Run the batch job
Foreach trade in N trades:
Compare output to legacy output
If different:
Pull the new and legacy inputs to the analytics package then send everything to the quants to debug ← bug #2
Fix whatever issue we identified with the quants, either by fixing our code or applying a new version of the analytics package
Goto Step 2 (Run the batch job). ← bug #3
I think we implicitly assumed that we’d hit on the best case scenario - get everything right after just a few batch runs. Batch runs were expensive and so was the act of comparing outputs between the new and legacy systems (it was Big Bank after all so we were talking about a ton of trades). If we got things right early on then this approach made sense because we’d only run the batch a few times after making all the changes upfront.
This is not, of course, what happened. We weren’t just changing a couple data pipelines but several PLUS the analytics package PLUS the risk platform itself PLUS where we published our numbers to. Issues scale in their debugging-complexity with the number of changes you stuff into a single release. We quickly found ourselves in a position where a given trade would spit out the wrong number because of more than one reason. We’d discover one of the inputs was incorrect and fix it only to discover that the same trade now yielded an entirely different wrong number for some second reason. Our first several batches were full of these nasty little onions; each trade was wrapped in multiple layers of bugginess.
So that was Bug #1 - too many changes all up front with too little validation early on.
Bug #2 was that, when we found a discrepancy in our output, we blindly sent all the inputs and outputs for the affected trade from both systems to the quants to debug. “The analytics package is a black box!” We cried. “We can’t debug this! If only we had access to the code…” More often than not, though, the quants found that the inputs themselves differed and kicked the trade back to us. In the beginning they wasted hours and sometimes days trying to debug their analytics package for a nonexistent issue. Soon they realized that we were not checking our inputs for parity before sending things along to their team. We blew through our goodwill budget with them very quickly.
Bug #3 was that we pulled the emergency break as soon we identified the first issue with the batch, be it a bad input or a bug in the analytics package. We focused on fixing that one thing without searching for more problems and then repeated the whole dreadful process of running the batch, checking the outputs, and annoying the shit out of our quants.
Phase 2: A tweak
It took an embarrassing amount of time for us to change our approach. I have to imagine there was a lot of yelling going on during meetings with leadership as we pushed our deadlines further and further out into the future3. Whatever the impetus was, we did eventually recognize that we needed to do better and it looked kind of like this:
Given M changes and N trades:
Make M changes // We already did this part so there was no opportunity for improvement here
Run the batch job
Issues = #{} // A tracker listing trades impacted by a given issue
Foreach trade in N trades:
Compare output to legacy output
If different:
Check the inputs
If inputs differ:
Identify issue, i
If i not in issues:
issues[i] = []
issues[i] += trade
Else:
Send issue to quants, including inputs and outputs from new and legacy systems
Have quants identify issue, i
If i not in issues:
issues[i] = []
issues[i] += trade
If issues not empty:
Sort issues by number of trades in descending order
Fix as many issues as we can, hitting the most impactful issues first
Goto Step 2 (Run the batch job).
Though it was far from ideal, this was still better than what we had been doing. Bug #1 (too many changes in the release) could not be helped since we had front loaded development at the beginning of the project but Bug #2 (annoy the quants) was much improved and so, to a lesser extent, was Bug #3 (pull the emergency brake). We ran fewer batches overall and only reached out to the quants with things that were actionable to them. This enabled us to finally go live with synthetic CDO’s in the new platform.
Phase 3: A refactor
Soon enough it was time to migrate the next credit instrument. This provided an opportunity to revisit the migration plan from before. By the time we took on the next credit instrument after that we had landed on the following:
// Change platforms
Keeping all else constant (analytics package, data sources), write the new business logic in the new platform
Implement new data queries against the legacy data sources
Run the batch
Check outputs and track issues
If issues not empty:
Fix issues
Goto 1.c (Run the batch)
// Change analytics package
Upgrade the analytics package
Run the batch
Check outputs and track issues
If issues not empty:
Send issues to quants and await fix
Goto 2.a (Upgrade the analytics package)
// Change data inputs
For each intended data input change:
Update data query to pull from new data source
Run the batch
Check outputs and track issues
If issues not empty:
Work with data pipeline team to fix issues
Goto 3.a.ii (Run the batch)
This last approach finally allowed us to address Bug #1, too many changes in the release, and gave us our biggest performance boost to date. Because we were changing one chunk of things at a time we knew where to look when debugging new issues. TTR (Time To Resolution) for each issue shrank dramatically. No more onions. This last migration was delivered at breakneck speed compared to our first adventure with synthetic CDO’s.
There are more things we could have improved on. Soon enough, though, we ran out of credit instruments. The migration was complete.
Other contributing factors
This post was very focused on the impact of our project plan. There were other forces at play, though. I don’t remember if we ever did a formal retro on the synthetic CDO migration to the new risk platform. If we had, though, and if we’d been familiar with Sailboat retros then I might have proposed the following additional tail winds and anchors:
Tail winds
In the Sailboat analogy, tail winds are the things that speed you up and get you to your destination faster:
Batch size. Synthetic CDO’s were, by far, the largest trade population the NY based team migrated. Everything else was significantly smaller. The final version of our project plan reduced the number of batch runs but the batches themselves were also smaller.
The Differ. My manager from the beginning of the project moved on a few years in but left me a gift: the Large Data Differ. This was a tool he had developed in his spare time to diff large column-based data. It was very helpful for migrating credit instruments as it allowed us to compare results much faster than we had before.
Ecosystem stabilized (somewhat). A few years in our ecosystem was no longer experiencing anything like the level of flux it had at the beginning. By the last migration most of the systems we relied on were fully mature if not terribly stable.
Anchors, Headwinds and Rocky Shoals
The things that slowed us down and tripped us up:
Scope creep. We took so long to deliver synthetic CDO’s that leadership decided to add more and more features to make the wait worthwhile for our stakeholders. This, of course, had the effect of delaying the migration further as we split focus and reallocated engineers to those additional features.
Massive turnover. Right before I joined Big Bank, the majority of the folks working in my organization were consultants. The incoming leader opted to make the drastic move of replacing these folks with full-time employees. As a result, most of the people who had developed the original legacy risk platform were gone. Luckily my manager was one of the few folks who stayed on and had deep knowledge of the system. He was just one person, though; the rest of us were brand new. This settled after awhile as us newbs learned the ropes, but a few years in attrition went back through the roof and remained high. It was incredibly hard to maintain any kind of project velocity when we were constantly training new hires only to have them leave after 6 months.
Production instability. London beat us to the punch - they delivered their first live functionality into production before we did, the new realtime stuff that the desk had originally been sold on. There was a ton of instability in the beginning and then again periodically whenever trading activity spiked. As both teams delivered more stuff into production, the support needs multiplied to the point where, in my final year, I was easily spending 80% of my business hours on fire fighting. I did so under the fog of significant sleep deprivation thanks to desperate calls from our heroic and incredibly hard working support team at 2am every other night.
Untenable processes. That last year or so was also characterized by a new release approval process implemented with ServiceNow. We naturally gave it the nickname ServiceNever. This was a massively bureaucratic, centrally managed process that was intended to battle our increasing levels of instability through standardized checks and built-in accountability. Designing and applying a release approval process at this scale was insane: Hours-long weekly meetings with dozens of attendees. Endless forms chock full of seemingly repetitive questions which, if not answered in just the right way, would send you to the wrong next set of forms. Begging stakeholders for written sign off which then had to be uploaded 24 hours before the release window otherwise the whole thing was canceled until the next week… It was bad; very very bad.
Organizational disfunction. I worked with a lot of good people for the most part but we were setup to compete with each other vs collaborate. This dynamic showed up in every interaction and it impacted everything from missed deadlines to instability in production to sky high and sustained attrition.
There’s so much more to say about each of the above themes. I’d like to dive more deeply into some of them during future posts because they strongly colored my time at Big Bank. I’ll end things here, though, for the sake of maintaining the scope of this post.
Conclusion
Brute force algorithms work well enough most of the time for our machines. Engineers are warned time and time again to avoid premature optimization. Project plans are executed by people, though, who move far more slowly than machines and who need to maintain positive relationships with each other. Spend some time upfront designing your migration plan keeping in mind:
How you will isolate changes to spot and fix bugs sooner, keeping the dev cycle tight
How your plan will perform under the worst case scenarios (lots of bugs, delays in upstream data, immature system dependencies etc)
How you will measure system parity
How you will interact with other teams and what your friction points will be
Our human compute cycles are precious and so are our relationships with one another. Optimize accordingly.
These aren’t the real team names. I didn’t actually encounter the term ‘data pipeline’ until I made the switch from finance to health tech 4.5 years later. ‘Data pipeline’ makes the most sense in retrospect: they moved data from here to there, transforming it and combining it with other data along the way.
This last bit happened with depressing frequency, particularly toward the end of my time with Big Bank.
I vividly remember project status meetings for this project. There was a big Gantt chart with changes in the rows and dates in the columns. After some discussion we would determine there was no way we’d meet our current deadline and change it. We’d then update the background color from an angry red to a happy green. Phew! Our Director started sitting in on these meetings at some point and was Not Pleased.