Revisiting Spolsky's "Things You Should Never Do, Part I"
Is the best time to do a rewrite truly "never"?
I fleetingly referenced an old Joel Spolsky essay, Where there’s muck, there’s brass, in my last post and it reminded me of just how impactful his writing was for my early development as a software engineer. This prompted me to go back and reread some more of his old essays. Although many of the technologies he referenced have naturally fallen out of favor, the stories he told continue to resonate today. Even if you haven’t read his blog yourself (again, most of his posts are quite old so there probably is a whole generation of engineers now who haven’t seen them), you have probably been influenced by it indirectly because so many others have.
Another one of his essays made a strong enough impression on me that I continue to think about it fourteen years after my first read-through and twenty four years after it was actually published: Things you should never do, Part I.
In it, Spolsky dissects what he views as a terrible mistake on the part of Netscape’s leadership. Do you remember Netscape? Netscape was the alternative to Internet Explorer when I first started using the internet. What happened to those guys? Spolsky’s essay would have us believe that a poor decision to rewrite the entire codebase from scratch played a significant part; he noted that this decision fatally delayed the release of the next major version by almost three years while Netscape watched its market share take a nose dive. Let’s take a moment to note the timeline: Spolsky’s essay was published back in 2000. According to Failory.com, Netscape was ‘obsolete’ by 1998 when it was acquired by AOL and then finally bit the dust in 2008.
I find Spolsky’s argument to be compelling. It certainly sounds like Netscape made the wrong call and that the outcome should have been foreseeable by them. Is it always a terrible idea to rewrite the code from scratch, though? Today I will reflect on a few rewrites that did work out and then noodle on what makes a rewrite a good idea or a total disaster.
Spolsky’s argument
First, though, let’s go over the reasons why a total rewrite is supposedly a terrible idea. You’ll note that most of Spolsky’s argument starts from the assumption that the rewrite is happening in response to perceived tech debt and a visceral reaction on the part of incoming engineers to the hairiness of the existing codebase.
Old code is hairy for a reason. Each one of those hairs represents a weird bug that was detected, fixed, and vetted in real life.
You can’t ship new code during a rewrite1. If this goes on for long enough your competitors will eat your lunch.
You spend money implementing functionality that already exists. This could otherwise be spent tackling tech debt in place while continuing to deliver user-facing improvements.
You might not have access to the original developers. This is especially true for long-lived codebases. Hopefully everyone documented as they went. Even if that’s the case, though, you just don’t have all the context that they did when they decided to do things the way that they did.
Please do go back and actually read Spolsky’s essay. I listed out the major points above just for the sake of having things in one place but this obviously isn’t a substitute for the original.
Successful Rewrite #1: A failed product
The first successful rewrite that came to mind for me was a small product that bolted on to the much larger flagship product at my company. Our users lived in the larger one where they performed the vast majority of their workflows. Our little product took a workflow that they would otherwise need to perform somewhere else and brought it home.
The concept itself was a good one. Unfortunately the the workflow we were bringing in was powered by third party content which our clients really didn’t like. They would have preferred we use different content which wasn’t currently supported in a digital format that we could ingest. We were trying to sell something nobody wanted.
Low adoption eventually triggered the decision to cease investment and we tore it down. We tried to revive the product a couple times with different content vendors but never quite hit the right note.
A few years later I was surprised and honestly frustrated to hear that we were at it again. A product manager from another part of the business had outsourced development of the prototype and a team I led was expected to take it over. Worse, a few members of my team had participated in one of the past failed attempts and were thoroughly demoralized. Argh. This time turned out to be different, though. The business partnerships were spot on and we had our preferred content. The team slowly took ownership of the code and this product went on to become highly successful.
Successful Rewrite #2: Another failed product
Another product at the same company also faced the chopping block. This one was underused, under supported and implemented in a tech stack we didn’t like. The product was torn down, its source code archived.
A few years later the business finally decided that it was time to solve the problem this product had originally been designed for. This time the company spun up a new, fast moving team to create a modern solution built on our preferred tech stack. They operated like a mini-startup and delivered their solution at record speed. It went on to become another highly successful product.
(Painful but) Successful Rewrite #3: A fundamental re-architecting
I wrote about this next one in a previous post, A Tale of Two Migrations2. We had a legacy platform which calculated risk measures on credit instruments (bonds, CDSs etc) in these big nightly batches. New leadership for the org rightly observed that innovation had stalled for quite some time so they made a bunch of sweeping changes. This included replacing our legacy risk platform with a new one which would also have all kinds of exciting real-time functionality.
The legacy risk platform was big and hairy. Worse, almost everyone who had contributed to it was gone with the very important exception of my manager and maybe one or two others. It took a hellaciously long time for our traders to get the real-time functionality they were promised and even longer for us to migrate the original batch functionality over.
Teams who were closer to the desk basically ended up running interference by providing them with new reporting functionality and fancy spreadsheets while we slogged through our years’ long rewrite. We did finally get there, though. The desk got their real time functionality and our nightly batches moved over, allowing us to put the legacy platform to bed. Was the new thing actually better? Not from a tech debt perspective; if anything, the new platform felt significantly hairier than the legacy one had. Fresh hires would immediately bemoan the state of the ‘new’ codebase. The rewrite had taken long enough that we had acquired a legacy sheen by the time we went live. From a product perspective, though, yes; it delivered new functionality that would not have fit into the design of the legacy batch approach and new applications proliferated once we hit our inflection point. The risk platform proved flexible enough to handle a bunch of different use cases and moved the needle for our desks and other stakeholders.
(Painful but) Successful Refactor #1: The beast
This is another one that I wrote about recently. This story is a refactor instead of a rewrite and I’m including it for the sake of comparison. This was, by far, the hairiest and largest codebase that I’ve worked on to date. It was also a critical part of the company’s business strategy.
A small part of why I joined the company and accepted my role working with the beast was their decision to refactor instead of rewrite. I thought this was a very smart move given my experience with Rewrite #3. Spolsky’s essay was definitely top of mind during those conversations.
So I took the role and the refactor was very hard. The team responsible for it, though, was incredibly intentional in how they went about their modernization efforts. There were two streams that would flow in parallel. The first was composed of engineering hygiene-type improvements like automated testing, CI/CD, observability tooling, and integrated code review. The other was an in-place re-structuring of the code that allowed us to move to a more modern framework with clear separation of concerns.
During this time we were allowed to release smaller user-facing improvements and, of course, bug fixes. Larger deliverables, however, had to wait until that second stream, the major code restructuring, was complete. This took a long time but we made it. The pace of delivery after this point took off. The refactor may never truly be done, there’s always more to do, but it all runs in parallel with user-facing improvements. I feel strongly that refactoring was the right call. Knowing what I know now from my time with the beast, I am convinced that a rewrite would have spelled disaster for the product.
Comparison time
I didn’t finish the Netscape story. Maybe you already know what happened after they shutdown but I didn’t. Luckily the internet (Failory.com again) told me: they open sourced their code. That code went on to compose the backbone of Mozilla/Firefox which gave IE a run for its money for many years until Chrome came along and blasted them both out of the water3. There is a more colorful version of the story published by Mozilla itself available here.
Netscape the company never saw the fruits of their rewrite but we, the users, did. They didn’t fail to create a stronger product; they failed to survive the consequences of taking too long to ship while battling a competitor that came pre-installed with the dominant OS of the day. When I look back at Rewrite #3, the risk platform, I have no doubt that we would have gone bottoms up if not for the fact that 1) we had another team running interference with a reporting product and 2) all our stuff was in-house. We had a captive user base. They had nowhere to go.
So how can we predict whether a rewrite will work out vs crash and burn? Looking at my little stories, success seems more likely if:
You can survive the time it will take to complete the rewrite. This one seems obvious but accurately estimating the time a rewrite will take is very hard. Netscape probably wouldn’t have chosen to do the rewrite if they knew how long it was going to take. Rewrites seem to scale exponentially4 in time, pain and suffering with the size and hairiness of the codebase. We wouldn’t have survived our long and arduous Rewrite #3 if our users could have up and left at any time5. It’s worth pointing out that Refactor #1 also took a very long time; it was technically still going when I moved on. Spolsky predicted that this would be manageable, though, and it was. During the refactor we were still able to deliver customer-facing improvements. Things slowed down a bit and our clients did not like it but they were still getting new things. They didn’t leave, we made it through the worst of the initial refactor, and the pace of feature delivery subsequently increased.
The complexity of the codebase is low. Again, it’s really hard to estimate the time a rewrite will take for a larger, hairier codebase. Your estimate will naturally be shorter and more accurate for a smaller project. Rewrites #1 and #2 were both for small and relatively simple codebases. There wasn’t a whole lot of logic to rewrite by comparison.
The product you are rewriting is not the primary breadwinner for your business. When I refer to Netscape as a product, I mean the browser. I do this because the browser was their big thing. Taking multiple years to rewrite your flagship product without shipping any user-facing improvements is a business killer. One of the reasons that Rewrite #3 proved to be non-fatal to the org was that we had other teams who were busy shipping highly visible improvements via their own products. Our team didn’t look great but as an org we continued to deliver for the desk. This is related to the next point:
There just isn’t all that much to lose. The products in Rewrite #1 and Rewrite #2 were pretty unsuccessful in their original incarnations; one had extremely low traffic while the other was basically rejected outright by our clients. It’s worth pointing out the timelines: neither of these smaller rewrites started until a few years after the original products had been shut down. We had already determined that those products were not worth supporting in their current forms. We tore down the code and didn’t reinvest until the business case made sense again, which brings me to:
There is a clear and highly compelling business case for the rewrite. Wanting to eliminate a mountain of tech debt purely for engineering reasons feels good and right but it isn’t enough. The rewrite either needs to directly serve the business or it needs to strongly accelerate business objectives in the near to midterm. Spolsky’s analysis of the Netscape rewrite makes it sound like it was solely driven by engineering intolerance toward tech debt and hairy code. I don’t know if that’s a fair representation (he got pretty snarky with this one6) but I certainly agree that this is not a strong enough reason to rewrite a large and successful product.
Conclusion
I’m hardly the first person to respond to Things you should never do, Part I! It’s been around since 2000 and it’s a provocative piece, so there’s a lot of great writing out there. For me, though, I would boil it down to this:
If your only reason for completely rewriting your codebase is tech debt and visceral ick, don’t do it. If the business case is strong then think critically about the size and complexity of the code as it stands today. If it’s small, then a rewrite may be manageable. Otherwise, stick to a carefully planned in-place refactor with incremental improvements. The refactor may take a very long time but remember that the cost of a rewrite can quickly spiral out of control for large hairy codebases. Either path is slow going, but at least with a refactor you get to deliver value along the way and stay alive.
Technically you can make feature changes in parallel: one version in the old, one in the new. This essentially doubles the work and creates its own engineering ick.
I am sorry for the odd graphic in that post. It’s an onion from my pantry. I shouldn’t be trusted with image editing.
My robust hand waving parallels the story-pointing practice some teams use where tickets are assigned points from the fibonacci sequence. The smallest ticket you can imagine is a 1-pointer. Next smallest is a 2-pointer. Next 3, then 5, then 8 etc. Most teams I know stop once you hit 5 or 8 and break the ticket down into smaller ones. This threshold tells us that the ticket is 1) too big/complex and 2) too hard to estimate well.
To clarify, I am not actually suggesting that you torture your users with long timelines just because they are stuck with you. Torturing users is Bad.
Spolsky’s closing sentence is brutal: “…throwing away the whole program is a dangerous folly, and if Netscape actually had some adult supervision with software industry experience, they might not have shot themselves in the foot so bad” [emphasis mine]. Eep. But wait, there’s more! Netscape was led by Marc Andreessen who went on to do many very big things including co-found Andreessen Horowitz the VC firm. Spolsky, meanwhile, founded and co-founded several startups including Stack Exchange. Do you know who provided a lot of the investment for Stack Exchange? Andreessen Horowitz! Check out the announcement here. Spolsky has nicer things to say at this point: “We’re ecstatic to have Andreessen Horowitz on board. The partners there believe in our idea of programmers taking over (it was Marc Andreessen who coined the phrase “Software is eating the world”)”.