4
Why Is It Hard?

On the surface, each legacy modernization project starts off feeling easy. After all, a working system did exist at one point. Somehow the organization managed to figure out enough to put something into production and keep it running for years. All the modernizing team should need to do is simply repeat that process using better technology, the benefit of hindsight, and improved tooling. It should be easy.

But, because people do not see the hidden technical challenges they are about to uncover, they also assume the work will be boring. There’s little glory to be had reimplementing a solved problem. An organization about to embark on such an undertaking craves new features, new functionality, and new benefits. Modernization projects are typically the ones organizations just want to get out of the way, so they usually launch into them unprepared for the time and resource commitments they require.

I tell my engineers that the biggest problems we have to solve are not technical problems, but people problems. Modernization projects take months, if not years of work. Keeping a team of engineers focused, inspired, and motivated from beginning to end is difficult. Keeping their senior leadership prepared to invest over and over on what is, in effect, something they already have is a huge challenge. Creating momentum and sustaining it are where most modernization projects fail.

By far, the biggest momentum killers are the assumptions that tell us the project should be easy in the first place. They are, in no particular order, the following:

  • We can build on the lessons learned from the old system.
  • We understand the boundaries on the old system.
  • We can use tools to speed things up.

Let’s take a little time discussing why these obvious truths might not be as useful as they seem.

The Curse of Hindsight

In poker, people call it resulting. It’s the habit of confusing the quality of the outcome with the quality of the decision. In psychology, people call it a self-serving bias. When things go well, we overestimate the roles of skill and ability and underestimate the role of luck. When things go poorly, on the other hand, it’s all bad luck or external forces.

One of the main reasons legacy modernization projects are hard is because people overvalue the hindsight an existing system offers them. They assume that the existing system’s success was a matter of skill and that they discovered all the potential problems and resolved them the best possible way in the process of building it initially. They look at the results and don’t pay any attention to the quality of the decisions or the elements of luck that produced those results.

Of course, more often than not, very little documentation regarding the original decisions remains for them to study in the first place. Still, overlooking the role that plain luck plays in the success of any project means the team thinks they have room for extra innovations on top of the original challenge.

Software can have serious bugs and still be wildly successful. Lotus 1-2-3 famously mistook 1900 for a leap year, but it was so popular that versions of Excel to this day have to be programmed to honor that mistake to ensure backward compatibility. And because Excel’s popularity ultimately dwarfed that of Lotus 1-2-3, the bug is now part of the ECMA Office Open XML specification.

Success and quality are not necessarily connected. Legacy systems are successful systems, but that does not mean every decision made in designing and implementing them was the right decision. Most people think they know that, but they go in the wrong direction with it. They’re cynical about the system, but despite that, they overload the road map with new features and functionality. No matter how critical of the system they appear to be, they still assume the underlying problem has been solved.

We struggle to modernize legacy systems because we fail to pay the proper attention and respect to the real challenge of legacy systems: the context has been lost. We have forgotten the web of compromises that created the final design and are blind to the years of modifications that increased its complexity. We don’t realize that at least a few design choices were bad choices and that it was only through good luck the system performed well for so long. We oversimplify and ultimately commit to new challenges before we discover our mistakes.

Being dismissive of legacy systems is no guarantee that we won’t also fall into the trap of relying on context that is lost. Remember the game I described in Chapter 3 when looking at what parts of the system shouldn’t be in COBOL? It’s a useful technique even when COBOL is not a factor. By challenging my team to design a system with the same requirements of our legacy system using only technology available at the time the legacy system was built, we’re forced to recover some context. Many of the “stupid” technical choices from the legacy system seem very different. Once forced to look directly at the context, we realize how innovative some of those systems really were. This gives us a little insight into which decisions were skill and foresight and which were luck.

A successful system could have a design pattern that will not survive past a certain scale of usage but that was able to achieve its operational goals without ever crossing that threshold. Is that skill or luck? If the designers knew the system would not scale but also knew the system would never reach the point where it would need to scale that way, we could assume the design was a deliberate decision. For example, perhaps the system is accessible only to certain people for internal purposes. Scaling to millions of requests was not necessary, because it would only ever get hundreds of requests per second at most.

On the other hand, if the system was designed with the idea that its usage would continue to grow indefinitely and the designers chose a pattern that will survive only up to a certain point, their success is a matter of luck. They simply did not reach that tipping point. Twitter was a well-designed system until it became so popular it started falling apart, serving users the notorious “fail whale” cartoon instead of their content. Overnight, the engineers who built the social media platform and the technology it used went from being perceived as skillful operators with superior code to a bunch of rank amateurs with an overhyped, dumbed-down programming language. They were neither geniuses nor dummies.

Scale always involves some luck. You can plan for a certain number of transactions or users, but you can’t really control those factors, especially if you’re building anything that involves the public internet. Software systems tend to incorporate multiple technologies working together to complete some task. I don’t know anyone who can predict how multiple technologies will behave in every potential scale condition, especially not when they are combined. Engineering teams do their best to mitigate potential problems, but they will never be able to foresee every possible combination of events. For that reason, whether a service works at its initial scale and then continues to work as it grows is always a mix of skill and luck.

Easy and Also Impossible

In 1988, computer scientist Hans Moravec observed that it was really hard to teach computers to do very basic things, but it was much easier to program computers to do seemingly complex things. Skills that had been evolving for thousands of years to solve problems like walking, answering questions, and identifying objects were intuitive, subconscious, and impossibly difficult to teach a computer how to do. Meanwhile skills that had not been a part of the human experience for thousands of years—like playing chess or geolocation—were relatively straightforward. His theory connecting this paradox to evolution, which had been observed by other contemporaneous AI researchers, gained enough traction that the paradox itself was named after him.

In Moravec’s own words, “It is comparatively easy to make computers exhibit adult-level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility.”1

Those wishing to upgrade large complex systems would do well to keep Moravec’s paradox in mind. Systems evolve much faster than nature, but just as in nature, as the system evolves, more and more of its underlying logic becomes obscured. When we get used to something just working a certain way, we tend to forget about it. Once we’ve stopped thinking about it, we fail to factor it into our plans to modernize.

We assume that successful systems solved their core problems well, but we also assume things that just work without any thought or effort are simple when they may in fact bear the complexity of years of iteration we’ve forgotten about.

This is especially true when the system has multiple layers of abstraction and even more so when those abstractions grow past the application boundaries itself—when they leverage operating system APIs or even hardware interfaces. When was the last time you thought about whether your favorite software is compatible with the chip architecture on your computer? When was the last time you needed to hunt down a specific driver to get a new accessory to work with your operating system? If you were born after the 1990s, you might never have thought about these things at all. Hardware and software interfaces haven’t gotten simpler in the last two decades, we’ve just abstracted away a lot of annoying differences that made the issue of x86 versus x64 or downloading drivers a normal part of working even casually with computers.

With very old legacy systems, the abstraction layers might not be there, or worse, they themselves might be out-of-date. I like to call this problem overgrowth, and it’s worth describing in detail.

Overgrowth: The Application and Its Dependencies

Overgrowth is a particular type of coupling between the software and the layers of abstraction making up the platform on which it runs. The perils of dependency management are well known, but with legacy systems, dependency management is about more than just what a package manager might install. The older a system is, the more likely the platform on which it runs is itself a dependency. Most modernization projects do not think about the platform this way and, therefore, leave the issue as an unpleasant surprise to be discovered later.

We’ve made huge leaps in cross-compatibility, but we’ve not yet reached the state where applications are 100 percent platform-agnostic, nor are we ever likely to achieve that completely.

For that reason, we cannot modernize a system without considering the underlying platform. Which features of that platform are unique, and which are found in other options? How old is the platform, and has it since been supplanted by a completely different way of doing things?

What makes major migrations so tricky is that as software ages, elements of the platform on which it was defined to run fall out of fashion, and support for those elements on other platforms becomes less and less common. This means that on our oldest systems, there is typically logic that either must be written out of the system or must be reproduced on a modern platform. The existing platform becomes auxiliary software that grows around whatever is being migrated. If you’re switching databases, for example, you’re not just moving the data. You might have to rewrite your queries in a different language or a different implementation of SQL. You may need to rethink hooks or stored procedures. One software language often has any number of minor languages that facilitate specific functions. There are command processors like bash or JCL that trigger jobs, templating languages to build interfaces, querying languages to access data, and so on. How well is business logic separated out between these layers? Does logic stay where it is sensible, or is it injected to where it is convenient?

Most web development projects, for example, run on Linux machines. Therefore, it is not uncommon for web applications to include shell scripts as part of their code base—particularly as part of the setup/installation routine. Imagine what migrating those applications would feel like 20 years in the future if Linux were supplanted by a different operating system. We would potentially have to rewrite all the shell scripts as well as migrate the actual application.

Smart engineers will point out that with containerization and configuration management tools, such scripts should be a thing of the past, but that’s precisely why overgrowth is an issue for legacy code. At one point, doing certain tasks via shell script was commonplace; this has since been overtaken by a different approach. If we want to migrate an older application, we may find that this older approach is not supported by the technology we want to use. We must migrate the auxiliary software first.

For modern applications, overgrowth is not usually a significant blocker. Languages from the same general era of computing tend to share ecosystems, so it is easier to pull out one language and replace it with another while making only minimum changes to the auxiliary software around it. Remember, overgrowth is just another form of coupling. Coupling is not necessarily a bad thing if the value add is there.

In older applications, however, people seem to have trouble seeing where this type of coupling is. We tend to forget about auxiliary software, just as we forget the complex processes behind the simple tasks Moravec struggled to program computers to do. The longer a piece goes without being upgraded, the less likely modern platforms and tools will support it. As auxiliary software slides out of support, the challenge of modernizing the actual code becomes more complicated.

Look for overgrowth at integration points, places where the communication layer changes. There are a few different transitions where you are likely to find it.

Shifting Vertically: Moving from One Abstraction Layer to Another

Many layers exist between modern software and the physical voltage moving through circuits in a machine. On the most basic level, we can define three layers: the software, the hardware, and an operating system between them. Overgrowth when shifting up or down these layers typically takes the form of proprietary standards, especially with older technology where the manufacturer of the hardware would also provide the software. Look out for situations where your application code depends on APIs specific to your operating system or, worse, when it’s specific to the chip architecture of the physical machine on which it runs. This was a common problem with old mainframes. Software was written in a variant of Assembly specific to both the company that built the mainframe and usually the model of the machine itself.

Shifting Horizontally: Moving from One Application to Another

Just as there is legacy code, there are also legacy protocols. When two applications pass data back and forth between each other, if they are running on machines or communicating on networking equipment developed by a corporation with proprietary protocols, you may see some overgrowth around the connection. This is less of a concern with web development, because the decentralized nature of the internet pushed things toward standard protocols like TCP/IP, FTP, and SMTP—all of which have a robust ecosystem of tooling and broad support across multiple platforms. In other areas of software development, proprietary protocols have a larger footprint. How difficult these protocols are depends on how common the technology in question is. Proprietary protocols from large vendors are probably supported by other options. For example, Microsoft Exchange Server protocols are proprietary but well supported, while an application dependent on AppleTalk might prove difficult to migrate.

Shifting from Client to Server

This shift can take the form of specific software development kits (SDKs) for specific tools and integrations, drivers for specific database connections, or frontend to backend movement. It might horrify some engineers to know this, but internal web applications are still sometimes built to run on certain web browsers and rely on features or functions not available in others. Internet Explorer is the most likely culprit. Whenever you see IE as the preferred default browser for internal applications, double-check that the frontend of these applications is not using IE-specific JavaScript features. We also see this frequently with Adobe Acrobat. Early-generation digital forms frequently were built to leverage Acrobat-specific PDF features and may be difficult to move between versions of Acrobat. A famous story about this comes from my time at US Digital Service where one of the Department of Veterans Affairs’ websites refused to work unless you downgraded your version of Acrobat.2

Shifting Down the Dependency Tree

As programming languages mature, they occasionally introduce breaking changes to their syntax or internal logic. Not all dependencies upgrade to handle those changes at the same pace, creating a mess where the application cannot be upgraded until the dependencies are upgraded. In applications that are very old, it is likely that some of those dependencies are no longer in active development. For instance, perhaps the maintainers never rolled out a version that is compatible with the newest version of Java or Node.js, and to get that support, the application must switch to a completely different option.

Cutting Back the Overgrowth

Cutting back overgrowth is not technically hard; it is just frustrating and demoralizing. Overgrowth slows things down, and if not accurately assessed, it creates unfortunate surprises that affect a team’s confidence. To minimize its impact, start off by mapping the application’s context. What does it run on? What is the process around creating a new instance of it? Map its dependencies two levels down.3 Attempt to trace the flow of data through the application to complete one request. This should give you a clearer picture of where there are likely to be problems. If you can put these problems on a road map, they have less dramatic impact on morale.

You might be tempted to think that modern software development is improving this situation. Cross-compatibility is much better than it used to be, that’s true, but the growth of the platform as a service (PaaS) market for commercial cloud is increasing the options to program for specific platform features. For example, the more you build things with Amazon’s managed services, the more the application will conform to fit Amazon-specific characteristics, and the more overgrowth there will be to contend with if the organization later wants to migrate away.

Automation and Conversion

The last assumption people make about legacy systems is that because computers can read the code they are trying to modernize, there must be some way to automate the process. They introduce tools like transpilers and static analysis with the intention of making modernization faster and more efficient.

Those tools are useful, but only if the expectations for them are realistic. If you use them as guides to help inform the process, your modernization team can move strategically, side-stepping critical mistakes and maybe reducing some costs. However, if you use them as shortcuts and skimp on making a true investment in modernization, they will likely let you down. Organizations that think the tools are the solution typically end up with longer, more painful, and more expensive modernizations.

So, what do these tools do exactly, and what’s the right way to use them?

Transpiling Code

Transpiling is the process of automatically translating code written in one programming language into another programming language. It makes sense to use a transpiler when the difference between the language being read and the language in which the output will be written is not significant. For example, Python version 3 had enough breaking changes in it that the transition actually required engineers to migrate their code bases rather than simply upgrade them. At the same time, Python 3 did not change any of the fundamental philosophies of Python itself, just some implementation details. Transpiling worked so well that tools for Python 2 to Python 3 conversion and Python 3 to Python 2 conversion are now built in to Python 3.

Another great use case for transpilers is when the language that the transpiler is reading was specifically designed to enforce good practices on the language the transpiler is writing. JavaScript has many different variants of this approach, such as CoffeeScript and TypeScript.

When the differences between the input and output languages are significant, transpiling becomes more problematic, and time-saving expectations need to be managed properly to ensure a successful outcome. The classic example of this use case is COBOL to Java. COBOL is procedural, imperative, and fixed-point by default. Java is object-oriented and floating-point by default. Transpiling COBOL to Java may produce code that works, but it will not be maintainable unless engineers go over the code and fine-tune it. Often this means rewriting parts of it.

If you are going to use a transpiler for that kind of upgrade, it is absolutely essential that the application has well-designed and comprehensive test suites, preferably automated ones. The bugs created by automatically translating one language to another, completely different language can be subtle and difficult to track down. For example, when you try to put an eight-digit number into a variable defined as having seven digits, COBOL truncates the last digit and moves on. Java, on the other hand, throws an exception. The transpiler will not add code to handle these exceptions.

People often invest in transpilers to help upgrade their legacy code because they think it will save engineering time to have a computer program do the first pass, or they think it will replace the need for experts in the original language to assist altogether. But when the two languages have significant differences, the output of such transpilers doesn’t usually follow the structure and conventions of the language in which it writes. Transpilers are not capable of rethinking how you organize your code. Transpiled COBOL is Java written as if it were COBOL, and therefore, it’s unintelligible to most Java programmers.

The success stories around this kind of transpiling typically come from companies that use their transpiling solution as a gateway to consulting services. That is to say, first you buy licenses to use the transpiler, and then you buy the talent to rewrite the transpiler’s output into something workable. This is a fine strategy, as long as you know that’s what you’re getting into.

Static Analysis

Although it hasn’t gained much traction outside a theoretical context, some interesting work in academia has been done around deploying various forms of static analysis to explore and ultimately improve legacy systems. So-called software renovation combines techniques from compiler design and reverse engineering to steer the refactoring process. Software renovation is intended to be semi-automatic: the analysis is automatic, but software engineers do the actual work of restructuring the code.

Some common types of static analysis used for software renovation include the following:

  1. Dependency graphs In this style of software renovation, the dependency graph is mapped, and clustering algorithms are used to determine where there is overlap, redundancy, unused libraries, or circular dependencies.4
  2. Grammars These are language-specific tools that produce analysis by parsing the abstract syntax tree. Typically they look for duplicate code or specific practices that are considered anti-patterns (like goto statements).
  3. Control flow/data flow graphs These graphs are tools that track how software executes. Control flow graphs map the order in which lines of code are executed, while data flow graphs map the variable assignments and references. You can use such analysis to discover lost business requirements or track down dead code.

Software renovation methodology hasn’t quite broken out of theoretical studies, but static analysis tools are available both as stand-alone products and as features of larger integrated development environments or continuous integration and deployment solutions. This is unfortunate because the methodology is what drives the bulk of the impact. The tools themselves are not as important as the phases of excavating, understanding, documenting, and ultimately rewriting and replacing legacy systems. Tools will come and go.

A Guide to Not Making Things Harder

Expectation management is really important. Typically organizations make the mistakes described in this chapter because they believe they are making the process more efficient. They misjudge how long modernization projects take, and they misjudge how much time they can save and how to save it.

Modernization projects have better outcomes when we replace the false assumptions described at the opening of this chapter with the following guidelines:

  • Keep it simple. Don’t add new problems to solve just because the old system was successful. Success does not mean the old system completely solved its problem. Some of those technical decisions were wrong, but never caused any problems.
  • Spend some time trying to recover context. Treat the platform as a dependency and look for coupling that won’t transfer easily to a modern platform.
  • Tools and automation should supplement human effort, not replace it.

Individual contributors often find the barrier to following that advice is not convincing themselves, but convincing others. Particularly when the organization is big, the pressure to run projects the same way everyone else does, so that they look correct even at the expense of being successful, is significant. In later chapters, we’ll tackle navigating the organization and strategies to advance your goals.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.214.215