The Software Rewrite

Greg Utas

Rate me:

4.90/5 (27 votes)

4 Mar 2022GPL317 min read

30.7K

Open-heart surgery on your Big Ball of Mud

Software rewrites often fail, so this article discusses how to avoid them. But if the situation has gotten out of hand, how do you decide if a rewrite is warranted? And if it is, how do you improve its chances for success?

Introduction

A Big Ball of Mud continues to cause suffering because rewriting it is daunting and because stories of failed rewrites abound. If the suffering reaches the point where everyone finds it intolerable, the question of whether to rewrite the software finally arises. This article discusses

how to reduce the risk of eventually having to rewrite your software;
if you are considering a rewrite, the prerequisites for embarking on one; and
once you've decided to rewrite, how to improve your chances of success.

Case study. Much of this article is based on experience with a product that was headed for cancellation. It didn't provide the throughput that customers needed, and they wanted new features that would be hard to implement. It was a latecomer to its market and needed to increase its market share, so it would probably get cancelled if its throughput and the productivity of its developers did not improve significantly. It was therefore decided that the product had to be rewritten. The rewrite saved the product, which is still undergoing development over 20 years later.

On the Rewrite Road

Avoiding a Rewrite

Unlike other engineering products, the essence of software systems is change. A request to lengthen a bridge by 50 meters as part of its "next release" would provoke derisive laughter, but such expectations are commonplace for software products.

If your system's original design did not anticipate a new requirement, adding it can be hard because it won't mesh cleanly with your existing architecture. If you continually ram these types of requirements into your system, it will degrade into a Big Ball of Mud. The term technical debt describes this situation. Productivity declines. Bugs arise more often and become harder to fix. Eventually, people start to float the idea of a rewrite.

Rewrites can usually be avoided by refactoring, which is the process of continually reworking software to maintain a clean overall design. At the limit, refactoring and rewriting are similar. But refactoring is gradual and spreads its costs over a product's lifetime, whereas a rewrite is disruptive and imposes a much greater overhead while it is underway. Refactoring is usually seen as evolutionary, but a rewrite will be seen as revolutionary.

Although the importance of refactoring seems to be gaining acceptance, it still often meets resistance. There are two major reasons for this:

Managers want everyone to work on new features.
If it ain't broke, don't fix it.

The first reason is understandable because new features generate revenue; refactoring doesn't. However, this is a blinkered view because it only focuses on what is seen, while ignoring what is not seen. What is not seen are the unintended consequences of not refactoring: the system degenerating to the point where, even though everyone is working on new features, progressively fewer features get delivered as productivity and quality decline.

This is not to say that refactoring is always paramount. If customers need a bug fixed now, then fix it however, and damn the architecture. But this should happen in the release branch. A proper fix, which includes whatever refactoring is called for, should later be merged into the development branch.

The second reason for not refactoring—not wanting to fix something that isn't broken—is also understandable. The "fix" might introduce a bug, and bugs need to be found and repaired. And because it isn't broken per se, fixing it is also seen as wasting time that should be spent working on a new feature. So again, refactoring doesn't happen. A former colleague referred to this as verification inertia: once software has been tested and is known to work, no one wants to change it unless doing so is unavoidable while fixing a bug or developing a new feature.

To overcome verification inertia, automated testing is vital. The cost of developing an automated testing capability will soon be recovered. It will eliminate the cost of manually executing an ever-growing set of regression tests before each software release. And just as importantly, it will encourage refactoring by making it easy to retest modified software to confirm that it still works properly.

Rewrites are best avoided, so you need to refactor. And to refactor, you need to automate your tests.

Case study. Although the product had automated tests, it had been developed quickly, using a prototype application framework. This got it to market, but it soon became clear that new requirements would be hard to implement. Refactoring alone couldn't address the challenges, so a more aggressive rewrite was needed.

Contemplating a Rewrite

Once a system has become a Big Ball of Mud, should it be rewritten? Before committing to a rewrite, you need to satisfy several prerequisites.

Your product's requirements must be well understood, and you must be able to efficiently determine if the new software satisfies them.

If your product adheres to industry standards, such as those published by the IETF, that will go a long way to satisfying this criterion. If not, you should have a product specification that was written before development originally started, and other documents should describe the capabilities that were added later.

If your product's requirements are not documented, you must document them and verify them by having people familiar with the product review them. These reviewers should include customers. Embarking on a rewrite without knowing what needs to be delivered is clinically insane.

If your product has a good set of tests, they will help to confirm that the product is meeting requirements. If the tests are not automated, include this in the cost of the rewrite. You're probably in the mess you are because no one wanted to refactor software that had to be manually retested.

Case study. The product had industry standards to follow, and it also had a large suite of automated tests.

Your architect must be confident that the rewrite will significantly improve productivity.

If your reaction was "What architect?", that's one reason you're in a bad place. Forget rewriting until you find an architect with experience in your problem domain. The architect will then need time to study your product's requirements and its existing software before estimating how much a rewrite could improve productivity.

If your architect states that they could lead a rewrite that would significantly improve productivity, review the proposed design with your senior developers to see if they generally agree with it, because it is important that they commit to the rewrite.

The larger your system, the more likely it is that you will need an application framework to significantly improve productivity. Ideally, the framework should be developed internally so that it can evolve to cleanly support new capabilities. However, an external framework could be selected if the architect is confident that it meets your product's needs and if its vendor is responsive to its users.

Your framework should define base classes and many of the collaborations between them. It is actually the embodiment of your system's object model. Its base classes should often provide default behaviors that developers can customize as appropriate. A well-designed framework

provides reusable components, speeding development time;
eliminates superfluous diversity, making it easier for developers to understand each other's code; and
is plug-and-play, allowing new capabilities to be added without modifying existing code.

Case study. The architect had spent lots of time thinking about an application framework that would double, and possibly even triple, productivity. A more precise estimate was irrelevant because the product would probably get cancelled if the rewrite failed or fell short of what was needed to save the product.

You must recover the cost of the rewrite over your product's lifetime.

To determine whether this requirement will be met, you need to answer a number of questions.

How much more productive will developers become if the product is rewritten? Let's say that productivity will improve by a factor of p. It is likely that p needs to be at least 2, and maybe more, to recover the costs of the rewrite.

How many staff-years have been spent developing the software to be rewritten? Let's call this number k, in which case the cost of the rewrite should be about k/p. But it's not quite that simple, because there is also an opportunity cost when developers are rewriting existing features instead of developing new ones. During that time, your product will generate less revenue.

What is the anticipated lifetime of the product if it is—or is not—rewritten? Rewriting a product that will soon be obsolete is a non-starter. Is a competitor likely to displace the product before the rewrite can be completed? Or could the rewrite significantly extend your product's lifetime? If it could, it starts to look very appealing.

Now we can start to estimate whether the rewrite will recoup its cost. Let's say that

the anticipated lifetime of the product is m years without a rewrite and n years with one,
the number of developers working on the product is d, and
the rewrite will take r years.

We could also introduce f, for the number of features delivered per staff-year, but this is unnecessary because f can be scaled to 1 before the rewrite and p after it. So we'll just say that d developers produce d features per year before the rewrite and pd features thereafter.

Thus, not rewriting yields dm more features. Rewriting yields dr - k/p features during the rewrite, and pd(n-r) more features after it is finished. So for the rewrite to recoup its cost, it must hold that

d(r - m + p(n - r)) > k/p

That is, the extra features that will be delivered because of the rewrite must more than cover its cost.

However, we have assumed a couple of things that aren't necessarily true.

First, we've assumed that revenue is directly proportional to the number of features delivered. This is dubious because, in any release, the value of the lowest priority feature is usually less than that of the highest priority one. Doubling the number of features delivered, for example, is unlikely to double revenue.

Second, we've assumed that the number of developers is constant. However, it is almost certain that more developers could be added once an application framework is in place. This is because a good framework will allow more developers to work in parallel without getting in each other's way.

It makes sense to deliver more features so long as the revenue from adding another feature is greater than the cost of adding another developer. Given that customers usually demand more features in a release than can be delivered, this will often be the case.

If you think you will be able to add more developers after the rewrite, we can replace d with d₀, the number of developers before, and d₁, the number of developers after, in which case the formula becomes

d₀(r - m) + pd₁(n - r)) > k/p

If it still isn't clear whether the rewrite makes sense, put together a spreadsheet that assesses both options by estimating the revenues in each year. To do this, you need a high-level project plan for the rewrite, which is something that we will outline in the next section. And let's not forget that there's also a third option: stop new development, put the product in sustaining mode, and milk it until it runs dry. Even after no more features are delivered, the product will continue to generate revenue for a time.

Case study. The product didn't run this formula. It was clear that it had to be written and that it would see many more years of development if it survived. Had the analysis been done, approximate numbers would have been

d₀=50, r=1.5, m=3, p=2, d₁= 100, n=12, k=80

Plugging these numbers into the above formula produces

50(1.5 - 3) + 2 x 100(12 - 1.5) > 80/2 ⇒ 2025 > 40

Today, the value of n has reached 22, though the size of the development group (d₁) is now far smaller because the product has matured and has fewer feature requests. But the formula would have reinforced the decision to rewrite.

Managing a Rewrite

This article doesn't advocate specific project management or software development methodologies. It simply assumes that you will continue to use your existing processes or adopt new ones if you think it necessary. The article is limited to recommendations that are especially relevant to rewrites.

Consider the branding.

To obtain approval for the rewrite, it might be better to refer to it as reengineering, which will sound more professional and a little less scary to some senior management types.

Limit the scope.

Strive to reuse some things from your Big Ball of Mud. Entire subsystems might fit into the new architecture without compromising it, particularly if you build wrappers around them. Limiting the rewrite to the areas that have become quagmires significantly improves its chances for success. That's why the tagline of this article uses the metaphor of open-heart surgery on your Big Ball of Mud, not shooting it in the head.

Managers will be fearful of the rewrite and may therefore try to overly restrict its scope. As an architect, you may therefore need to downplay how much is being rewritten. Managers will eventually learn the truth, but by then it will be too late to divert the rewrite from its proper path.

Case study. The product was built on a proprietary platform, whose operating system was reused in its entirety. So too were several subsystems involved with configuring and operating the product. The rewrite focused on applications. Its scope was downplayed, and the product's VP was surprised when the full story eventually came out.

Avoid customer-visible changes.

If you plan to change how your product operates—its GUI, for example—it is essential that you consult with the people who operate it. Not their management, but the people who actually operate it. It is easy to fall into the trap of delivering what you are certain is a better user interface, only to meet resistance from those who already know how to operate your product and don't want to have to learn how all over again.

Case study. The rewrite focused on applications and reused subsystems involved with configuring and operating the product. This avoided introducing customer-visible changes.

Keep the train moving.

You may have noticed the assumption, so far unstated, that your product already has customers. If it doesn't, you've built a prototype, not a product, so follow Fred Brooks' advice and throw it away.¹ Or if marketing it now is just too important, commit to starting the rewrite immediately after your initial release. If your product is that innovative, it should be able to keep its customers captivated while you're busy with the rewrite.

Customers always want new features, so they will be unhappy if you tell them that you're halting development for a rewrite. They might even start to wonder if you're incompetent and whether they should replace your product. So it's crucial that you keep giving them the important features that they need, which will also bring in revenue during the rewrite.

You therefore need to split your development team in two. One group will continue to deliver features on the existing code base while the other group works on the rewrite. When the rewrite has reimplemented enough of your product's capabilities, move everyone onto the new software for the next release, during which you finish rewriting any missing features while also delivering new ones.

Case study. The development team was split in two, as described, so that new features could still be delivered during the rewrite.

Build it in stages.

Splitting the development team in two during the rewrite, so that you can continue to deliver new capabilities, has another benefit. It prevents you from trying to rewrite too much at once, which would significantly increase the risk of failure.

Particularly when the rewrite involves developing an application framework, time will be needed to prepare it for widespread development. It is therefore prudent to start small, by assigning a small group of skilled developers to the rewrite. Their task is to implement the framework, along with some capabilities that cut across all of the system's layers and components. Only after a stable foundation exists should you assign a much larger group to the new software.

Case study. The rewrite spanned three releases—about 18 months—and involved a team of about 50 developers.

In the first release, six developers implemented the new framework and a handful of applications, with the rest of the team continuing to work on the old code base.
During the second release, half of the team reimplemented existing features on the new code base, while the other half continued to deliver features on the old code base.
In the third release, the entire team moved onto the new code base, with about half of them still rewriting previously existing features.

Hold a weekly design meeting.

As an architect leading the rewrite, schedule a weekly meeting that can last up to three hours. The focus of this meeting will change over time. At first, its main purpose will be to teach developers who have recently joined the rewrite about the application framework. Later, as these developers begin to rewrite various capabilities, they will start to ask how to implement challenging requirements. Sometimes, you will learn about a requirement that was overlooked during the framework's design. If that requirement recurs, you must seriously consider evolving the framework so that the requirement can be implemented cleanly.

Start the meeting at 9:00 a.m. This helps to impose a noon deadline for its conclusion and allows developers to get work done in the afternoon, when discussions held during the meeting are still fresh in their minds. And schedule it for Wednesdays. No one wants to begin their week in a meeting, and too many Tuesdays fall into that category because of Monday holidays. Fridays are out because if everyone isn't working to meet a deadline, some of them are taking it off. Wednesdays are therefore best, although Thursdays are also reasonable.

Publish minutes of the meeting so that those who couldn't attend also benefit from the discussions. Write the minutes yourself, because they need to include summaries of how to use the framework and how it will evolve to accommodate things that were overlooked.

Meetings can be unproductive and tedious, but this one has a clear purpose—design—that developers enjoy. Its benefit to you is that, without it, you will be constantly interrupted for consulting. The meeting cuts this to a manageable level by consolidating much of your consulting time and by reducing the number of times that you answer the same questions. It is popular with developers because they learn about the entire system: they hear what colleagues are working on, what kinds of issues are arising, and how to implement challenging requirements or actually evolve the framework to support them.

Case study. During the rewrite, the weekly design meeting was well attended even though it was optional. Sometimes it ran the full three hours, and sometimes it ended in under two. It ran largely as just described.

Summary

Refactor to avoid the eventual need for a rewrite. To encourage refactoring, automate your tests.

Before deciding to rewrite, be confident that

your product's requirements are well understood;
you will be able to efficiently determine if the new software satisfies these requirements;
your architect knows that the rewrite will significantly improve productivity; and
you will recover the cost of the rewrite over your product's lifetime.

After you decide to rewrite, improve your chances of success by

limiting the rewrite's scope;
avoiding customer-visible changes;
continuing to deliver features on the existing code base;
building the new software in stages; and
scheduling a weekly design meeting.

Notes

¹ "Plan to throw one away; you will, anyhow." In The Mythical Man-Month, 1975 (first edition), page 116. Note that Brooks uses architect to mean a product architect who specifies a system's capabilities and behaviors, whereas this article uses it to mean a software architect who determines its high-level design.

History

1^stNovember, 2020: Initial version

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)

Written By

Greg Utas

Architect

United States

Author of Robust Services Core (GitHub) and Robust Communications Software (Wiley). Formerly Chief Software Architect of the core network servers that handle the calls in AT&T's wireless network.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.