10
Future-Proofing

The best way to complete a modernization project is by ensuring that you won’t have to go through the whole process again in a few years. Future-proofing isn’t about preventing mistakes; it’s about knowing how to maintain and evolve technology gradually.

Two types of problems will cause us to rethink a working system as it ages. The first are usage changes. The second are deteriorations. Scaling challenges are the change in usage type: we have more traffic or a different type of traffic from what we had before. Maybe more people are using the system than were before, or we’ve added a bunch of features that over time have changed the purpose for which people are using the technology.

Usage changes do not have a constant pace and are, therefore, hard to predict. A system might never have scaling challenges. A system can reach a certain level of usage and never go any further. Or it can double or triple in size in a brief period. Or it can slowly increase in scale for years. What scaling challenges will look like if they do happen will depend on a number of factors. Because changes to the system’s usage are hard to anticipate, they are hard to normalize. This is an advantage. When we normalize something, we stop thinking about it, stop factoring it into our decisions, and sometimes even forget it exists.

Deteriorations, on the other hand, are inevitable. They represent a natural linear progression toward an unavoidable end state. Other factors may speed them up or slow them down, but eventually, we know what the final outcome will be. For example, no changes in usage were going to eliminate the 9th of September from the calendar year of 1999. It was going to happen at some point, regardless of the system behavior of the machines that were programmed to use 9/9/99 as a null value assigned to columns when the date was missing.

Memory leaks are another good example of this kind of change. System usage might influence exactly when the leak creates a major problem, but low system usage will not change the fact that a memory leak exists that will eventually be a problem. The only way to escape the problem is to fix it.

Hardware lifecycles are another example. Eventually chips, disks, and circuit boards all fail and have to be replaced.

These kinds of deteriorations are dangerous because people forget about them. For a long time, their effects go unnoticed, until one day they finally and completely break. If the organization is particularly unlucky, the problem is deeply embedded in the system, and it’s not immediately clear what has even broken in the first place.

Consider, for example, Y2K. An alarming number of computer programs were designed with a two-digit year, which became a problem in the year 2000 when the missing first two digits were different from what the program assumed they were. Most technical people know the Y2K story, but did you know that Y2K wasn’t the first short-sighted programming mistake of this nature? Nor will it be the last.

Time

It’s unbelievable how often software engineers have screwed up time in programs. In the 1960s, some programs had only one-digit years. The TOPS-10 operating system had only enough bits to represent dates between January 1, 1964, and January 4, 1975. Engineers patched this problem, adding three more bits so that TOPS-10 could represent dates up to February 1, 2052, but they took those bits from existing data structures, thinking they were unused. It turns out that some programs on TOPS-10 had already repurposed those areas of storage, which led to some wonky bugs.1

How much storage should be dedicated to dates is a constant problem. It would be unwise and impractical to allocate unlimited storage for time, and yet any amount of storage eventually will run out. Programmers must decide how many years will pass before the idea that their program will still be functioning seems unlikely. At least in the early days of computers, the tendency was to underestimate the lifecycle of software. It’s easy for a functioning piece of software to remain in place for 10, 20, 30 years, or more. But in the early days of computing, two or three decades seemed like a long time. If time was given only enough space to reach 1975, the fix might carry it over to 1986. Certain operating systems in 1989 programmed limits to reach maturity in 1997—and so on, and on, and on.

These programs are still with us, and we haven’t reached all of their maturity dates just yet. In , a date format created by the World Computer Corporation will reach its storage limit, and we have no idea whether any existing systems use it. Of greater concern is the year 2038 when Unix’s 32-bit dates reach their limit. While most modern Unix implementations have switched to 64-bit dates instead, the Network Time Protocol’s (NTP) 32-bit date components will overflow on February 7, 2036, giving us a potential preview. NTP handles syncing the clocks of computers that talk with each other over the internet. Computer clocks that are too badly out of sync—typically five minutes or more—have trouble creating secure connections. This requirement goes back to MIT’s Kerberos version 5 spec in 2005, which used time to keep attackers from resetting their clocks to continue using expired tickets.

We don’t know what kinds of problems NTP and Unix rollovers will cause. Most computers are probably long upgraded and will be unaffected. With any luck, the 2038 milestone will pass us by with little fanfare, just as Y2K did before it. But time bugs don’t need to trigger global meltdowns to have dramatic and expensive impacts. Past time bugs have temporarily cleared pension funds, messed with text messages, crashed video games, and disabled parking meters. In 2010, 20 million chip and PIN bank cards became unusable in Germany thanks to a time bug.2 In 2013, NASA lost control of the $330 million Deep Impact probe thanks to a time bug similar to the 2038 issue.

Time bugs are tricky because they detonate decades, or sometimes centuries, after they were introduced. IBM mainframes built in the 1970s reach a rollover point on September 18, 2042. Some Texas Instruments calculators do not accept dates beyond December 31, 2049. Some Nokia phones accept dates only up to December 31, 2079. Several programming languages and frameworks use timestamp objects that overflow on April 11, 2262.

It’s not that programmers don’t know these bugs exist. It’s just hard to imagine the technology of today sticking around until 2262. At the same time, people who were programming room-sized mainframes in the 1960s never thought their code would last for decades, but we now know programs this old are still in production. By the time the year 2000 came around, that old software (and sometimes the machines that came with it) had not only not been retired but was also being maintained by technologists two or three generations divorced from its creation.

Resolving time bugs is usually fairly straightforward—when we know about them. The problem is we tend to forget that they’re approaching. We have already seen systems fail thanks to the 2038 bug. Programs in financial institutions that must calculate out interest payments 20 or 30 years into the future act like early warning detection systems for these types of errors. Still, organizations must know the state of their legacy systems (in other words, whether they’ve been patched) and be aware that these incidents are happening.

Unescapable Migrations

Future-proofing systems does not mean building them so that you never have to redesign them or migrate them. That is impossible. It means building and, more important, maintaining to avoid a lengthy modernization project where normal operations have to be reorganized to make progress. The secret to future-proofing is making migrations and redesigns normal routines that don’t require heavy lifting.

Most modern engineering organizations already know how to do this with usage changes—they monitor for increased activity and scale infrastructure up or down as needed. If given proper time and prioritization, they will refactor and redesign components of the system to better reflect the most likely long-term usage patterns. Making updates to the system early and often is just a matter of discipline. Those that neglect to devote a little bit of time to cleaning their technical debt will be forced into cumbersome and risky legacy modernization efforts instead.

One of my favorite metaphors for setting a cadence for early and often updates comes from the podcast Legacy Code Rocks (https://www.legacycode.rocks/). Launching a new feature is like having a house party. The more house parties you have in your house before you clean things up, the worse condition your house will be in. Although there isn’t a hard-and-fast rule here that will work for everyone, automatically scheduling some time to reevaluate usage changes and technical debt after every n feature launches will normalize the process of updating the system in ways that will ensure its long-term health. When people associate refactoring and necessary migrations with a system somehow being built wrong or breaking, they will put off doing them until things are falling apart. When the process of managing these changes is part of the routine, like getting a haircut or changing the oil in your car, the system can be future-proofed by gradually modernizing it.

Deteriorations require a different tact. Sometimes they can be monitored. As batteries age, for example, their performance slides in a way that can be captured and tracked. Some deteriorations are more sudden. Time bugs don’t give any warning before they explode. If the organization has forgotten about it, there’s nothing to monitor.

It would be naive to say that you should never build a deteriorating change into your system; those issues are often unavoidable. The mistake is assuming it is not possible that the system will still be operational when the issue matures. Technology has a way of extending its life for much longer than people realize. Some of the control panels for switches on the New York City subway date back to the 1930s. The Salisbury cathedral clock started running in 1368. There’s a lightbulb over Livermore California’s Fire Station 6 that has been on since 1901. All around the world, our day-to-day lives are governed by machines long past their assumed expiration dates.

Instead, managing deteriorations comes down to these two practices:

  • If you’re introducing something that will deteriorate, build it to fail gracefully.
  • Shorten the time between upgrades so that people have plenty of practice doing them.

Failing Gracefully

The reason Y2K and similar bugs do not trigger the end of human civilization is because they do not impact every system affected by them with uniform intensity. There is a lot of variation in how different machines, different programming languages, and different software will handle the same problem. Some systems will panic; some will simply move on. Whether it is better for the system to panic and crash or to ignore the issue and move on largely depends on whether the failure is in the critical path of a transaction.

Failing gracefully does not always mean the system avoids crashing. If a bug breaks a daily batch job calculating accrued interest on bank accounts, the system recovering from the error by defaulting to zero and moving on is not failing gracefully. That’s an outcome that if allowed to fail silently will upset a lot of people very quickly, whereas a panic would alert the engineering team immediately to the problem so it could be resolved and the batch job rerun.

How close is the error to a user interface? If the error is something potentially triggered by user input, failing gracefully means catching the error and logging the event but ultimately directing the user to try again with a useful message explaining the problem.

Will the error block other independent processes? Why is it blocking other processes? Blocking implies shared resources, which would suggest that processes are not as independent as originally thought. For truly independent processes, it is probably okay to log the error but ultimately let the system move on.

Is the error in one step of a larger algorithm? If so, you likely have no choice but to trigger a panic. If you could eliminate a step in a multistep process and not affect the final outcome, you should probably rethink whether those steps are necessary.

Will the error corrupt data? In most cases, bad data is worse than no data. If a certain error is likely to corrupt data, you must panic upon the error so the problem can be resolved.

These are good things to consider when programming in unavoidable deteriorations. This thought exercise is less useful when you don’t know that you have no choice but to program in a potential bug. You can’t know what you don’t know.

But, it’s worthwhile to take some time to consider how your software would handle issues like the date being 20 years off, time moving backward for a second, numbers appearing that are technically impossible (like 11:59:60 pm), or storage drives suddenly disappearing.

When in doubt, default to panicking. It’s more important that errors get noticed so that they can be resolved.

Less Time Between Upgrades, Not More

A few years ago, I got one of those cheesy letter boards for my kitchen—you know, the ones you put inspirational messages on like “Live life in full bloom” or “Love makes this house a home.” Except, mine says “The truth is counterintuitive.” Our gut instinct with deteriorations is to push them as far back as possible if we cannot eliminate them altogether. Personally, I feel this is a mistake. I know from experience that the more often engineers have to do things, the better they get at doing them, and the more likely they are to remember that they need to be done and plan accordingly.

For example, in 2019 there were two important time bugs. The first was a rollover of GPS’s epoch; the second was a leap second.

The GPS rollover is a problem identical to the time bugs already described. GPS represents weeks in a storage block of 10 bytes. That means it can store up to 1024 values, and 1024 weeks is 19.7 years. As with Y2K, when GPS gets to week number 1025, it resets to zero, and the computer has no way of knowing that it shouldn’t backdate everything by 20 years.

This had happened only once before, in 1999. Although commercial GPS has been available since the 1980s, it had not really caught on by 1999. The chips that powered the receiver were too expensive, and their convenience would not be realized until computers became fast enough to overlay that data with calculations determining routes or associating physical landmarks with their coordinates. As the helpful bits of GPS were not yet market-ready, consumers were more sensitive to the privacy concerns of the technology. In 1997, employees for United Parcel Service (UPS) famously went on strike after UPS tried to install GPS receivers in all of their trucks.

So, the impacts of the first GPS rollover were minor, because GPS was not popular. By 2019, however, the world was a completely different place. Twenty years is a long time in technology. Not only were virtually all cellphones equipped with GPS chips, but any number of applications had been built on top of GPS.

As it turns out, people replace GPS-enabled devices a lot. Mobile app updates for many users are seamless and automatic. We are so used to getting new phones every two or three years that the rollover of 2019 was mainly uneventful. Users with older-model mobile phones experienced some problems but were encouraged to buy new phones from their vendors instead.

The second time bug of 2019, a leap second, went a slightly different way. A leap second is exactly what it sounds like: an extra second tacked on to the year to keep computer clocks in sync with the solar cycle. Unlike a leap year, leap seconds are not predictable. How many seconds between sun up and sun down depends on the earth’s rotational speed, which is changing. Different forces push the earth to speed up, and others push the earth to slow down.

Here’s a fun fact: one of the many forces changing the speed of the earth’s rotation is climate change. Ice weighs down the land masses on Earth, and when it melts, those land masses start to drift up toward the poles. This makes the earth spin faster and days fractions of a second shorter.

There have been 28 leap seconds between 1972 and 2020, but as some forces slow the earth down and some forces speed it up, there can be significant gaps between years with leap seconds. After the leap second in 1999, it was six years before another was needed. There were no leap seconds between 2009 and 2012. There was a leap second in both 2015 and 2016, but nothing in the next three years.

Leap seconds are never fun, but if the reports of problems experienced during each recent leap second can be considered comprehensive, they are worse after a long gap than they would be otherwise. Even gaps as short as three years are long enough for new technologies either to be developed or to get much more traction than they had before. Abstractions and assumptions are made, and they settle into working systems and then are forgotten.

The industries around cloud computing and smartphones started to grow just as a multiyear gap in leap seconds was approaching. By the time the next leap second event occurred, huge platforms were running on technologies that had not existed during the last one. These technologies were built by engineers who may not even have been familiar with the concept of a leap second in the first place. Some service owners failed to patch updates to manage the leap second in a timely manner. Reddit, Gawker Media, Mozilla, and Qantas Airways all experienced problems.

This was followed by another multiyear gap before the leap second of 2015 created issues for Twitter, Instagram, Pinterest, Netflix, Amazon, and Beats 1 (now Apple Music 1). By comparison, 2016’s leap second went out with a whimper. With just a six-month gap, it seems to have triggered problems only in a small number of machines across CloudFlare’s 102 data centers.

And the 2019 leap second at the end of another multiyear gap? It cancelled more than 400 flights when Collins Aerospace’s Automatic dependent surveillance–broadcast (ADS–B) system failed to adjust correctly. ADS–B was not new, but the FAA had released a rule requiring it on planes by 2020, so its adoption was much greater than it had been at the time of the previous leap second.

As a general rule, we get better at dealing with problems the more often we have to deal with them. The longer the gap between the maturity date of deteriorations, the more likely knowledge has been lost and critical functionality has been built without taking the inevitable into account. Although the GPS rollover came at the end of a 20-year gap, it benefited from the accelerated upgrade cycle of devices most likely to be affected. Few people have 20-year-old cellphones or tablets. Leap seconds, on the other hand, have pretty consistently caused chaos when there’s a gap between the current one and the last one.

Some deteriorations have such short gaps at scale and don’t need the organization to do any extra meddling. For example, the average storage drive has a lifespan of three to five years. If you have one drive—for example, the one in your computer—you can mitigate the risks of this inevitable failure by regularly backing things up and just replacing the computer when the drive ultimately fails.

If you are running a data center, you need a strategy to keep drive failure from crippling operations. You need to back up regularly and restore almost instantaneously. That might seem like a huge engineering challenge, but the architecture to create such resilience is built in to the scale. Data centers don’t have just a few hard drives and three- to five-year gaps when they need to be replaced. Data centers often have thousands to hundreds of thousands of drives that are failing constantly. In 2008, Google announced it had sorted a petabyte of data in six hours with 4,000 computers using 48,000 storage drives. A single run always resulted in at least one of the 48,000 drives dying.3 A formal study of the issue done at about the same time pegged the annual drive failure rate at 3 percent.4 At 3 percent failure rate, once you get into the hundreds of thousands of drives, you start seeing multiple drives failing every day.

While no one would argue that drive failures are pleasant, they do not trigger outages once data centers reach a scale where handling drive failure is a regular occurrence. So rather than lengthening the period between inevitable changes, it might be better to shorten it to ensure engineering teams are building with the assumption of the inevitable at the forefront of their thoughts and that the teams that would have to resolve the issue understand what to do.

A Word of Caution About Automation

The second solution people gravitate to if a deteriorating change cannot be eliminated altogether is to automate its resolution. In some cases, this kind of automation adds a lot of value with relatively little risk. For example, failing regularly to renew TLS/SSL certificates could cause an entire system to grind to a halt suddenly and without warning. Automating the process of renewing them means the certificates themselves can have shorter lifespans, which increases the security benefit of using them.

The main thing to consider when thinking about automating a problem away is this: If the automation fails, will it be clear what has gone wrong? In most cases, expired TLS/SSL certificates trigger obvious alerts. Either the connection is refused, at which point the validity of the certificate should be on the checklist of likely culprits, or the user receives a warning that the connection is insecure.

Automation is more problematic when it obscures or otherwise encourages engineers to forget what the system is actually doing under the hood. Once that knowledge is lost, nothing built on top of those automated activities will include fail-safes in case of automation failure. The automated tasks become part of the platform, which is fine if the engineers in charge of the platform are aware of them and take responsibility for them.

Few programmers consider what would happen should garbage collection suddenly fail to execute correctly. Memory management used to be a critical part of programming, but now the responsibility is largely automated away. This works because the concern is always top of mind for software engineers who develop programming languages that have automated garbage collection.

In other words, automation is beneficial when it’s clear who is responsible for the automation working in the first place and when failure states give users enough information to understand how they should triage the issue. Automation that encourages people to forget about it creates responsibility gaps. Automation that fails either silently or with unclear error messages at best wastes a lot of valuable engineering time and at worst triggers unpredictable and dangerous side effects.

Building Something Wrong the Correct Way

Throughout this book and in this chapter especially, the message has been don’t build for scale before you have scale. Build something that you will have to redesign later, even if that means building a monolith to start. Build something wrong and update it often.

The secret to building technology “wrong” but in the correct way is to understand that successful complex systems are made up of stable simple systems. Before your system can be scaled with practical approaches, like load balancers, mesh networks, queues, and other trappings of distributive computing, the simple systems have to be stable and performant. Disaster comes from trying to build complex systems right away and neglecting the foundation with which all system behavior—planned and unexpected—will be determined.

A good way to estimate how much complexity your system can handle is to ask yourself: How large is the team for this? Each additional layer of complexity will require a monitoring strategy and ultimately human beings to interpret what the monitors are telling them. Figure a minimum of three people per service. For the purposes of this discussion, a service is a subsystem that has its own repository of code (although Google famously keeps all its source code in a monolith repository), has dedicated resources (either VMs or separate containers), and is assumed to be loosely coupled from other components of the system.

The minimum on-call rotation is six people. So, a large service with a team of six can have a separate on-call rotation, or two small services can share a rotation among their teams. People can, of course, be on multiple teams, or the same team can run multiple services, but a person cannot be well versed in an infinite number of topics, so for every additional service, expect the level of expertise to be cut in half. In general, I prefer engineers not take on more than two services, but I will make exceptions when services are related.

I lay out these restrictions only to give you a framework from which to think about the capacity of the human beings on which your system relies to future-proof it. You can change the exact numbers to fit what you think is realistic if you like. The tendency among engineers is to build with an eye toward infinite scale. Lots of teams model their systems after white papers from Google or Amazon when they do not have the team to maintain a Google or an Amazon. What the human resources on a team can support is the upper bound on the level of system complexity. Inevitably the team will grow, the usage of the system will grow, and many of these architectural decisions will have to be revised. That’s fine.

Here’s an example: Service A needs to send data to Service B. The team maintaining the complete system has about 11 people on it. Four people are on operations, maintaining the servers and building tooling to help enforce standards. Four people are on the data science team, designing models and writing the code to implement them, and the remaining three people build the web services. That three-person team maintains Service B but also another service elsewhere in the system. The data science team maintains Service A, but also two other services.

Both of those teams are a bit overloaded for their staffing levels, but the usage of the system is low, so the pressure isn’t too great.

So, how should Service A talk to Service B?

The first suggestion is to set up a message queue so that communication between A and B is decoupled and resilient. That would be the most scalable solution, but it would also require someone to set up the message queue and the workers, monitor them, and respond when something goes wrong. Which team is responsible for that? Cynical engineers will probably say operations. This is usually what happens when teams cannot support what they are building. Certain parts of the system get abandoned, and the only people who pay attention to them are the teams that are in charge of the infrastructure itself (and usually only when something is on fire).

Although a message queue is more scalable, a simpler solution with tighter coupling would probably get better results to start. Service A could just send an HTTP request to Service B. Delegation of responsibilities on triage is built in. If the error is thrown on the Service B side, the team that owns Service B is alerted. If it’s thrown on the Service A side, the team that owns Service A is alerted.

But what about network issues? It’s true that networks sometimes fail, but if we assume that both of these services are hosted on a major cloud provider, the chances of a one-off network issue that causes no other problems are unlikely. Networking issues are not subtle, and they are generally a product of misconfiguration rather than gremlins.

The HTTP request solution is wrong in the correct way because migrating from an HTTP request between Service A and Service B to a message queue later is straightforward. While we are temporarily losing built-in fault tolerance and accepting a higher scaling burden, it creates a system that is easier for the current teams to maintain.

The counterexample would be if we swapped the order of the HTTP request and had Service B poll Service A for new data. While this is also less complex than a message queue, it is unnecessarily resource-intensive. Service A does not produce a constant stream of new data, and by polling Service B, it may spend hours or even days sending meaningless requests. Moving from this to a queue would require significant changes to the code. There’s little value to building things wrong this way.

Feedback Loops

Another way to think about this is to sketch out how maintaining this system will create feedback loops across engineering. Thinking about how work gets done in terms of flows, delays, stocks, and goals can help clarify whether the level of work required to maintain a system of a given complexity is feasible.

Let’s take another look at the question of Service A and Service B. We know we have seven people working on these two services and that each person has an eight-hour workday. Service B’s team is split between that and another service, so we can assume they have a budget of four hours per service they own. With three people, that’s about 12 hours per day. Service A’s team is maintaining a total of three services, so they have a budget of 2.5 hours per person and 10 hours per service per day. A model like this might have the following characteristics:

  1. Stocks A stock is any element that can accrue or drain over time. The traditional example of a system model is a bathtub filling with water. The water is a stock. In this model, technical debt will accrue for each service constantly regardless of the level of work. Debt will be paid down by spending work hours. The tasks in our workweek are also a stock that our teams will burn down as they operate. That eight-hour day is also a stock. When the system is stable, the eight hours are fully spent and fully restored each day.
  2. Flows A flow is a change that either increases or decreases a stock. In the bathtub example, the rate of water coming out of the faucet is a flow, and if the drain is open, the rate of water coming out of the bathtub is another flow. In our model, at any time, people can work more than eight hours a day, but doing so will decrease their ultimate productivity and require them to work less than eight hours a day later. We can represent this by assuming that we’re borrowing the extra hours for the next day’s budget. Tasks are completed by spending work hours; we might keep our model simple and say every task is worth an hour, or we might separate tasks into small, medium, and large sizes with different number-of-hours costs for each option. Spending work hours decreases the stock of technical debt or work tasks, depending on how those hours are applied.
  3. Delays Good systems models acknowledge that not everything is instantaneous. Delays represent gaps of time in how flows respond. With our model, new work does not immediately replace old work; it is planned and assigned in one-week increments. We can view the period between each task assignment as being a delay.
  4. Feedback Feedback loops form when the change in stock affects the nature of the flow, either positively or negatively. In our model, when people work more than their total eight-hour budget, they lose future hours. The more hours they work, the more hours they have to borrow to maintain a stretch of eight-hour days in a row. Eventually, they have to take time off to normalize. Alternatively, they could borrow hours by spending more of their budget on Service A or Service B, but that means the other services they are responsible for will be neglected, and their technical debt will accrue unchecked.

Visually, we might represent that model like in Figure 10-1. The solid lines represent flows, and the dotted lines represent variables that influence the rate of flows.

Work hours come into the model via our schedule but are affected by a stock representing burnout. If burnout is high, work hours fall; if work hours are high, burnout rises. How much of our available work hours on any given day is devoted to tasks on one service depends on the size of our team and the number of services or projects the team is trying to maintain at the same time. The more we are able to devote to work tasks, the more we ship. When work tasks are completed, whatever extra time is left is directed to improving our technical debt.

f10001

Figure 10-1: Feedback loops in the team’s workload

Although this visual model might just look like an illustration, we can actually program it for real and use it to explore how our team manages its work in various conditions. Two tools popular with system thinkers for these kinds of models are Loopy (https://ncase.me/loopy/). and InsightMaker (https://insightmaker.com/). Both are free and open source, and both allow you to experiment with different configurations and interactions.

For now, let’s just think through a couple scenarios. Suppose we have a sprint with 24 hours of work tasks for Service A and for Service B. That shouldn’t be a problem; Service A’s team has a weekly capacity of 50 hours a week for Service A, and Service B’s team has a weekly capacity of 60 hours a week for Service B. With 24 hours of sprint tasks, each team has plenty of extra time to burn down technical debt.

But what happens if a sprint has 70 hours of work? Service A’s team could handle that if every one of the team’s four people borrowed five hours that week from the next week, but the team would have no time to manage technical debt and would have only 30 hours of time for Service A the following week.

What if 70 hours of work were the norm for sprints? The teams would slowly burn out while having no ability to rethink the system design or manage their debt. The model is unstable, but we can restore equilibrium by doing one of the following:

  • The team transfers ownership of one of their services to another team, giving them more hours a day to spend on their tasks for Service A or Service B.
  • The team allows technical debt to accrue on one or all of their services until a service fails.
  • The team works more and more until individuals burn out, at which point they become unavailable for a period of time.

One of the things that the teams might do to try to reestablish equilibrium is change the design so that the integration pattern means less work for Service A’s lower-capacity team. Suppose that instead of connecting over HTTP, Service B connected directly to Service A’s database to get the data it needs. Service A’s team would no longer have to build an endpoint to receive requests from Service B, which means they could better balance their workload and manage their maintenance responsibilities, but the model would reach equilibrium at the expense of the quality of the overall architecture.

If you’re a student of Fred Brooks’sThe Mythical Man-Month, you might object to the premise of this model. It suggests that one possible solution is to add more people to the team, and we know that software is not successfully built in man-hours. More people do not make software projects go faster.

But the point of this type of model is not to plan a road map or budget head count. It’s to help people consider the engineering team as a system of interconnected parts. Bad software is unmaintained software. Future-proofing means constantly rethinking and iterating on the existing system. People don’t go into building a service thinking that they will neglect it until it’s a huge liability for their organization. People fail to maintain services because they are not given the time or resources to maintain them.

If you know approximately how much work is in an average sprint and how many people are on the team, you can reason about the likelihood that a team of that size will be able to successfully maintain X number of services. If the answer is no, the design of the architecture is probably too complex for the current team.

Don’t Stop the Bus

In summary, systems age in two different ways. Their usage patterns change, which require them to be scaled up and down, or the resources that back them deteriorate up to the point where they fail. Legacy modernizations themselves are anti-patterns. A healthy organization running a healthy system should be able to evolve it over time without rerouting resources to a formal modernization effort.

To achieve that healthy state, though, we have to be able to see the levels and hierarchy of the systems of systems we’re building. Our technical systems are made up of smaller systems that must be stable. Our engineering team behaves as another system, establishing feedback loops that determine how much time and energy they can spend on the upgrades necessary to evolve a technology. The engineering system and the technical system are not separate from each other.

I once had a senior executive tell me, “You’re right about the seriousness of this security vulnerability, Marianne, but we can’t stop the bus.” What he meant by this was that he didn’t want to devote resources to fixing it because he was worried it would slow down new development. He was right, but he was right only because the organization had been ignoring the problem in question for two or three years. Had they addressed it when it was discovered, they could have done so with minimum investment. Instead, the problem multiplied as engineers copied the bad code into other systems and built more things on top of it. They had a choice: slow down the bus or wait for the wheels to fall off the bus.

Organizations choose to keep the bus moving as fast as possible because they can’t see all the feedback loops. Shipping new code gets attention, while technical debt accrues silently and without fanfare. It’s not the age of a system that causes it to fail, but the pressure of what the organization has forgotten about it slowly building toward an explosion.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.186.164