10

Maintainability

“Temet nosce” (Know thyself)

- Ancient Greek aphorism, here given in Latin.

In the last chapter, we examined security testing and the unique checks related to securing your application. These are vital tests that need special consideration because they do not form part of the core functionality of your product or feature. This chapter considers another area that can be overlooked: maintainability. Unlike the other methods of functional testing, maintainability is not customer-facing, so it is a lower priority when time is short. However, getting this area right makes working with this application easier for everyone involved – testers, the support team, and the developers assigned to fix and improve the code.

You need to consider your code’s instrumentation carefully – what events should it report on its behavior? Only by gathering this information can the product managers know which features are important and which are rarely used in practice. It can be tough to set the level of logging and events so that you can easily filter the important messages. This needs to be tested too.

In this chapter, you will learn about the following:

  • The requirements for monitoring new features
  • How to use monitoring to enhance your testing
  • The testing of maintenance operations
  • The requirements for logging
  • Ideal logging usage

Out of all the areas of testing described in this book, this one is closest to my heart. If you get this right, debugging all other testing areas becomes simpler and, more importantly, quicker. Good logging means that testers and junior developers can resolve issues without needing help from senior team members, making the entire development process faster.

Know thyself and know thy app! Great maintainability lets you understand your application, so spend the time learning how to get this foundation in place. However, it requires special consideration, so this chapter explores its possibilities and pitfalls. First, we consider the maintainability use cases for the internal teams using your product.

Understanding maintainability use cases

Testing maintainability features is a form of black-box testing that checks functions that a specific group of users needs, so it is quite separate from code maintainability.

Code maintenance requires that your code is well architected, with minimal dependencies and clearly defined interfaces between modules. It needs a clear directory structure and comments to allow developers to easily find code and understand it, among many other considerations. However, these are outside the scope of this book and aren’t the focus of this chapter.

Maintainability, here, refers to product features for users inside your company. That includes groups such as the following:

  • The operations team ensuring a hosted application runs reliably
  • The support team triaging customer issues
  • The product owner checking feature usage
  • The test and development teams debugging errors

You might release a brilliant new feature that your customers love, but it writes no logs and is completely opaque. When it fails, you have no way to see why. In that case, first, you must update the code to add log lines. This lack of logging is a bug you should identify during your testing. Other examples might be

  • Your next release might pass all its tests with flying colors, but the upgrade is missing a database migration so the operations team can’t roll it out live.
  • You release a feature and your product owner thinks it is a big success, but it has no metrics to see if customers have actually used it.
  • You release a feature and it seems to be running well, but later you realize it has been broken for a week and you had no monitoring alarms to warn you.

These classes of problems are unique to internal users and can often be neglected or misunderstood. However, out of everyone, internal users will spend the most time interacting with your product, and the problems they encounter cost your company time and money.

This is a distinct set of requirements that need dedicated tests. Each of the previous problems is a bug, either in the specification or implementation, and it’s your job as a tester to find them. With that in mind, next, we consider the advantages and disadvantages of maintainability testing.

Advantages and disadvantages of maintainability testing

An alternative to explicitly performing maintainability testing is to try to maintain your code and see what happens. You’ll soon discover which logs are unclear and which problems are impossible to debug due to missing information. Your task is to speed that process up by predicting what you need. The advantages and disadvantages of maintainability testing are as follows:

Advantages

Disadvantages

The best way to discover maintainability issues

Not a priority for product development

It makes your life easier

Requires knowledge of user experience design

Speeds up the development process

Hard to predict what information you will need

Benefits from feedback late in the development cycle

Table 10.1 – Advantages and disadvantages of maintainability testing

The biggest issue for maintainability is its low priority. It doesn’t directly help customers, so there is always a challenge to give it sufficient consideration. Testers can be less willing to raise maintainability issues, developers are less inclined to work on them, and product managers are unwilling to spend time on them. Maintainability improvements often need to come for free – by being included as part of other work that is considered a high priority. As with so many aspects of development, excellent maintainability can be achieved very quickly if you know what is required and don’t have to think it through from first principles or learn as you go along. In the long run, great maintainability testing is a huge benefit for your product, but it takes time in the short term. This chapter aims to make great maintainability as easy to achieve as possible.

However, designing good maintainability requires a good knowledge of user experience design. You need to consider different use cases, as testers’ tools and approaches might differ from developers’. As a tester, you will need to let the engineering team know what you need and argue for it.

This is especially hard because it can take time to determine the most important information. It’s far easier to see what you need when using the product, but by then, the development team may have moved on to other work. Often, requests for improvements to maintainability will come sometime after that code is live and the team is working on other features. The product manager may be reluctant to devote time to an area that should already be finished. There’s no easy way around this other than planning or arguing your case that the feature is not yet complete. Ideally, you should add time to revisit functions after they have shipped to make changes based on live feedback. If nothing needs changing, that is a bonus, and you can use that time elsewhere.

Planning maintainability work is challenging because every issue you debug is unique. You are not polishing a single happy-path scenario for users – every problem is different, probably because it wasn’t considered fully during the development process. Maintainability is one unhappy-path scenario after another, so predicting what you need is always tricky. However, some guiding principles apply to many situations, and we’ll consider those here.

Good maintainability helps everyone on the team. You can debug issues faster, understand the code better, and get greater visibility of what the code is doing and why. It’s well worth putting the time in to ensure your product is easy to maintain. Next, we will consider the three main goals of maintainability.

Goals of maintainability

The maintainability of your product includes three main use cases, which we will discuss in this chapter:

  • To quickly know if your system is suffering from degraded performance, for example:
    • Checking on hardware health
    • Checking on system resources
    • Identifying patterns in software issues
  • To easily maintain and improve your system, for example:
    • Seeing the current running state
    • Performing upgrades and improvements
    • Enabling new features and capabilities
  • To debug specific failures, for example:
    • Why didn’t a user receive their password reset email?
    • Why did a web page fail to load?
    • Why did the application fail to update to the latest version?

Maintainability is closely tied to observability because you can only make improvements if you thoroughly understand your product’s current running state. This section considers these three use cases in more detail.

Tools for observability

Complex products with multiple interacting systems will generate extensive data indicating their performance. Rapidly diagnosing issues depends on using various data types to gain an overall view of your system. The four main data types for observability are as follows:

  • Logs: Logs are the most traditional method of observability, being statements your application outputs as part of its operation. There is a spectrum of formats, from plain text logs stored locally in simple files to the structured output you can search and filter on dedicated web interfaces. As well as a particular message, logs will almost always include the following:
    • Timestamps
    • Severity
    • Which part of the system they were written by

The challenge with logs is managing the huge volume that can be generated and sifting them to find the key lines. Logs will be a major part of your life as a tester, so it’s in your best interest to ensure they are written well. Like no other part of the system, when it comes to testing logs, it’s personal.

  • Traces: Traces let you track a single request through your system to see the services it interacted with, for instance, from an Edge server through a Gateway, to a core machine and then the Database:
Figure 10.1 – Tracing a command through layers of a system

Figure 10.1 – Tracing a command through layers of a system

You can see the relevant loglines and timings and quickly zoom in on the source of a problem if one of the steps fails. On the downside, you’ll need to set up tracing across your system to avoid having gaps in the analysis. This might be an enormous task in a mature, distributed system, but is hugely valuable.

  • Metrics: Metrics are measurements of your system’s performance that are sampled regularly, then gathered and presented in a central location. This can be everything from disk, memory, and CPU utilization to database latency and response times up to configured numbers for entities and rates of application operations. These measures and their graphs provide an invaluable view of the workings of your system. You’ll need to check that these produce correct output, so you can use it to identify issues while you test. If these measures cross the configured threshold or have anomalies detected, then they can raise error events to indicate the issue.
  • Events: The final type of observability data is events. These are more abstract than logs. Whereas logs might record output from each function involved in creating a user, an event will simply record that user X has been created. This lets you plot data points easily to measure trends over time and correlate additional data. If your user creation event includes information about which country they are in or what method they used to sign up, you can monitor that information and how it varies too.

Generating events from loglines is a shortcut, but it isn’t ideal. Events should be generated deliberately so they are independent of refactoring the code and can include all the information you need. Using a logline may break if the log wording is changed and might not record the right details, although it is quicker if you can configure event generation but not make code changes as easily.

An important subclass of events is alarms that indicate issues with your system. These alerts get the operations team out of bed at three in the morning, or if they are less severe warnings, give developers a to-do list of problems to investigate. Sometimes events and alarms are generated from other data types, such as error log lines or metrics showing issues, for instance. In that sense, they are outputs from other forms of observability. At other times there is logic in their generation – if so many log lines are seen within a specific period, for example – so they can be separate, or written entirely independently of any other logs.

Each form of observability works with the others, solving a different problem and providing another part of the picture. Metrics will tell you about the health of your hardware and the load on your application, whereas events will give the headlines of operations that have occurred. Traces will tie related functions together across disparate machines, and logs will detail exactly what happened. You will need each to debug issues you find in your system.

As a tester faced with a new feature, the questions are: Are the right tools in place to ensure this feature can be maintained and upgraded? When there is a problem, will we be alerted quickly? Based on that alert, can we quickly diagnose the issue? Failures in any of those areas are bugs you need to raise. We will consider each of those three key use cases next.

Identifying system degradation

Consider this scenario: to improve system performance, your product has added a queue in front of its core servers. You’ve tested its functionality and security, applied stress and load testing, and everything looks good. But your queuing feature isn’t ready to go live until you can measure its performance and detect when it has failed.

Of the four types of observability data listed previously, metrics and events are the most important for identifying system degradation. Standard metrics should be measured for every standalone machine, including the following:

  • CPU
  • Memory
  • Disk usage
  • The number of threads or processes running

These should be monitored automatically for any new machine you start to run. Then, add load metrics specific to this application based on the events it generates. For our queue, for example, measure the following:

  • The rate of requests arriving
  • The rate of requests being serviced
  • The response time
  • The current queue length
  • Dead letters (requests that can’t be processed)

You will need to design the metrics for the applications in your system and the events to track its primary functions. Based on those, you can choose what alarms you want to raise. For the queuing system, this will be issues such as high rates of requests arriving, slow servicing of the queue, increasing queue length, and lots of dead letters.

When you have identified the relevant metrics and placed alarms on them, you are ready to enable that part of your system. For more on detecting degradation, see the Designing monitoring section.

Improving your system

As we saw in Chapter 8, User Experience Testing, you aim to let users complete their tasks in just a few obvious steps. The same applies to monitoring and maintaining your system. You need to make it as simple as possible to roll out changes and upgrades, ideally automating the entire process from a change being approved. Other core maintenance operations include the following:

  • Testing the upgrade process
  • Configuration transitions
  • Security updates
  • Database migrations
  • Capacity expansion

Those topics are considered in more detail in the Testing maintenance operations section.

Debugging issues

If your system is running the correct version with the proper configuration and you still hit a problem, then you need to go down to the next level and look through the logs. Work through common failure modes for your product and explicitly plan out how you would diagnose them. What information do you start with? How do you know where to look? How many steps does it take to get there? Traces can be an invaluable way to link processes in disparate parts of a system and identify the sources of issues.

At some point in this process, you cannot specify debugging steps any further. You have no idea what kind of issues might be reported next or what failures you might find in your testing today, but you know the sort of issues you’ve hit in the past and the errors you’ve had to investigate. By extrapolating from them, you can see the kinds of problems you might face in the future and take as many steps as possible toward a solution. See the Using logging section for more detail on what you should require from logging on your system.

In the following sections, we’ll walk through the primary solutions to those three main use cases: what logging you need to debug issues, how to test the maintenance operations that improve your system, and the monitoring systems you need to identify system degradation.

Designing monitoring

Monitoring involves checking system performance metrics and regularly running simple automated checks. These overlap with tests you should have run before release, but they fulfill a different function since they find test escapes and operational issues affecting your live environment. Since monitoring checks are run after a system’s release, they are not tests as such but have so much overlap with your test plan that they’re worth considering here.

Being a tester, you’ll interact with the monitoring in two ways:

  • Firstly, you’re a user, using monitoring to check for errors on your test system. New code must complete its test run without causing errors in the monitoring system.
  • Secondly, you need to ensure that every new feature has relevant monitoring. If your system can now fail in a new way that you’re blind to, that’s a bug that needs fixing.

Monitoring requires a dedicated system that gathers all monitoring data so it can be aggregated, processed, and presented to users. Programs such as Nagios, Datadog, and Amazon CloudWatch perform this function, and a prerequisite for monitoring is that you have set up your chosen system ready to receive this information.

Instrumentation

The requirements for each new feature should specify what measurements you need to take from it. How often is that function invoked? From which interface? What was the outcome of the request, and what did the user do next? The product owner should routinely ask for this information, so if they don’t, double-check with them. This information can be provided with events, and checking they are correctly generated should be part of your testing.

Careful instrumentation and analysis can provide fascinating insights into your product’s usage without having to ask your customer for feedback. See Chapter 8, User Experience Testing, for more details.

Filtering alerts

Aim to have three levels of alerts in your system: critical, error, and warning. Instead of basing these on the state of your system, it’s more pragmatic and useful to define these based on the action you should take:

Alert level

Action

Examples

Critical

Requires immediate attention

System unavailable

Users unable to sign up

Data loss in progress

A minor function is unavailable

An error preventing internal users from doing their jobs

Consistent rejections from third-party systems

Security issues

Error

Requires attention

High disk usage

A transient error not currently occurring

Intermittent rejections from third-party systems

An internal problem with a workaround

Unhandled exceptions

Warning

Does not require engineering attention

An expected failure

An outgoing API request fails

Customer mistakes (incorrect passwords, misconfiguration, invalid API queries, and so on)

Temporary loss of connection

Attempting to access an invalid address on your server

Table 10.2 – Alert levels and their definitions

The most important part of monitoring is critical errors that indicate an ongoing outage. These require constant, careful tuning to ensure that you check every relevant metric, but you have to ruthlessly remove spurious errors so that alerts correlate to genuine issues. In monitoring, adding checks is relatively easy; avoiding false alarms is a much tougher job.

Every critical and error-level alert must require manual intervention to either fix or downgrade it. You can’t leave any of them alone. If it doesn’t need someone’s attention, it should be downgraded so you can see other errors more clearly. See Chapter 7, Testing of Error Cases, for more on the distinction between expected and unexpected errors.

Possibly the easiest fix for an alert is simply to downgrade it. Suboptimal but possible scenarios should be logged as warnings, for instance, occasional rejections on a third-party API. You can’t stop them from happening and want to record that there was a problem, but they’ll be retried so no action is needed from the development team. That’s a warning.

There is a gray area between warnings and errors. A one-off rejection by a third-party API is only a warning, but 100 in 5 minutes is an error, or possibly a critical ongoing outage. Some warnings need to be promoted to errors based on their frequency.

When a developer writes the code for one part of the system and it fails, they may be tempted to raise it as an error since their function was unsuccessful. However, that may be a minor part of the system, or there may be retries or fallbacks designed to correct that state without intervention. They mustn’t fill the logs with error messages, which may obscure far more important problems. Errors should only be for impossible states of the system. Anything bad but possible is just a warning.

More problematic are the cases where a single error has two causes; a common, boring one and an important, rare one.

Real-world example – It’s always packet loss

In my video conferencing days, we regularly received reports of poor video quality. However, they were almost always due to packet loss on poor networks. The support team patiently guided our customers through troubleshooting, generally finding a different network that performed better, to show that our equipment wasn’t at fault, which was almost always the case.

Because there were so many reports and there was a well-known cause, when we did have a bug in our decoder, it took us a long time to isolate and fix it. We needed more than a report of an issue – we needed a consistent pattern of problems, even on reliable networks, before it was escalated to the development team to investigate.

The solution for serious issues hiding behind a common one is to monitor them separately, to find a difference in the conditions, which lets you raise two different alarms. Then, one can be downgraded to a warning, and the other left as a critical alert.

When filtering errors, expect to be under constant attack. On a public web server, amateur hackers will regularly try to reach configuration files for common servers in an attempt to access your system. So long as your security configuration is correct, these will be rejected with a 400 error. Since this is a predictable problem, you can’t add an alarm for it. If you want to raise alarms about incorrect access, you need to find a different way.

Real-world example – Raising client-side errors

One company I worked for wanted warnings if any of our systems received 400 errors when communicating internally with other parts of our system. Unfortunately, the servers they contacted were public and constantly returned 400 errors to attackers trying to access invalid addresses.

We couldn’t add an alarm on the server because of all the attacks, but we could raise a critical alarm if one of our clients received a 400 error while querying an internal system. Since we controlled both the client and server sides of those requests, an error response showed a misconfiguration in our system that we needed to fix urgently.

Performing active checks

In addition to passively checking for errors in your system, your monitoring should actively perform actions to ensure that key functions are working. Typical monitoring checks include the following:

  • API queries to ensure the API is running and returning responses correctly
  • Loading landing pages
  • Connecting clients
  • Signing in
  • Signing up
  • Performing a simple standard operation, such as creating a new entity

You should run these checks on both user-facing and administrator interfaces, raising different alerts for each.

Monitoring checks will typically run every few minutes, with frequencies varying from every 30 seconds for brief, lightweight checks, to once an hour for non-critical systems or checks that involve significant complexity or cost. You should run checks as often as possible to detect outages quickly. The goal is to identify problems before your users. The worse the outage is, the easier it is to find with monitoring, but the less time you have before your customers hit it too.

Monitoring creates a dull background noise you must filter out of the logs. You’ll need to remove any users and other entities your monitoring makes, and this deletion is a useful check that removing configuration is successful. Ideally, mark all entities created by monitoring as test entries that can be deleted. Filtering monitoring usage out of logs can be harder but is worthwhile to make the actual activity easier to see.

System resources versus customer outcomes

The most vital checks are of customer-facing outcomes: can users actually log in? These checks are far more useful than checks on the internal system state. Outages fail to correlate with resource monitoring in both directions: a machine may have an unacceptably high CPU load, but the service is still available and working. Conversely, your application may be down even though resource loads are fine.

Consider customer-facing metrics, such as the following:

  • Screen loading success rates and times
  • User action success rates and times
  • Connection success rates and times
  • Download success rates and times
  • Upload success rates and times

If your application lets users upload data, for instance, then there should be metrics tracking every upload attempt, whether it worked, and how long it took. You should be able to produce graphs to easily see anomalies, and to run anomaly detection software to produce automatic alarms if there are problems. All of this relies on having metrics in the first place, so it is important to test that they are present and working.

As defined previously, approaching system resource limits is an error-level event: it requires intervention but is not a current outage. This event should warn you in time to mitigate it. In practice, if your system is awash with error-level events that you don’t triage regularly, you will need to escalate approaching resource limits into critical alarms so you can act on them, but this sort of inflation in error classification isn’t sustainable. You need to downgrade your current, spurious errors to let you see the real ones clearly.

Hierarchies of system failures

One failure can have many expected effects. For instance, if a hardware server fails, it’s not a surprise that none of the machines it hosts are running. In that case, you’ll be flooded with errors, listing all the different machines that are unavailable, and the knock-on effects throughout your system. To quickly identify the hardware failure’s root cause, you’ll need to set up hierarchies of dependencies within your monitoring.

By letting your system know that specific machines rely on that hardware, if it fails, your monitoring can ignore those alarms and only report the root cause, speeding up your diagnostics.

Automatic problem reporting

As well as errors from your servers, you should ensure that your clients or any other remote programs report issues to you. Clients connected to cloud infrastructure should regularly upload their logs and inject events into your event management system, whatever that may be, including errors that require attention, such as crashes.

Even software following the traditional model of an application installed on-site or on users’ computers should also report crashes back to a central server. These reports should include a complete dump of their logs and state to allow for thorough debugging without needing any extra information from the customer.

Monitoring overview

Monitoring is not a replacement for testing; it’s a tool you can use to detect issues while testing and it’s a functional area new features must consider. If a new function introduces a failure mode – a new subsystem that can crash, for instance – you need to ensure those issues are reported back into your main monitoring system. A new function that can’t be monitored, or requires its own unique process, can’t be easily maintained and needs to be changed.

In the next section, we will move on from monitoring your system overall to internal operations, which are another crucial part of ensuring your system is maintainable.

Testing maintenance operations

While only internal staff perform maintenance operations, they are often among the riskiest on your system. Upgrades and configuration changes are the sources of most outages, in my experience. You’ll need to test the content of each modification, but this section considers changes in the abstract – what should you try for every upgrade?

Real-world example – Which version is that?

I once tested a minor plugin that supported our main product, which was installed locally on a desktop. We shipped the first version, then a couple of maintenance releases. Only after a couple of versions did we realize there was a piece of functionality that we’d missed. Users kept reporting issues that we thought we’d already fixed. Were our fixes failing, or had the user not upgraded? The first question that the support team asked was what version they were running.

Unfortunately, the plugin didn’t report its version number anywhere, not on the user interface and not in the logs. Instead of being able to tell what was happening, we had to advise users to upgrade just in case they weren’t up to date. Not reporting the version was a slight oversight, but made the support team’s job much harder.

Firstly, all parts of your system need to report their current version and running configuration. That’s not the version that’s supposed to be running and the configuration it’s supposed to have – your system needs to report what is actually in use. Sometimes showing the configuration is more straightforward because you just read the contents of the configuration files. It takes more effort to check the machines and see what was actually applied. However, pushing configuration can fail, leaving it out of sync with reality. If you don’t have a quick way to check your system’s state, that’s an important feature request.

Worst-case maintenance operations

Carefully consider maintenance operations, as they can have a multiplicative effect on your test plan. Backup and restore is one such process, covered in Chapter 11, Destructive Testing; other examples are upgrades or migrations between servers. For every new feature that you add, you’ll need to check it works correctly for those operations.

To test these, it’s helpful to prepare the most convoluted, complex configuration possible for a customer, a user, or whatever entities your application deals with. Ours were known as evil organizations. Add all possible types of extra details, with multiple instances if possible. Then, perform upgrades, migrations, and restores to see if all the configuration survives. Where settings are mutually exclusive, you will need a set of evil organizations to cover them all and have at least one massive evil organization. This tests loading on the system, even though it may be too slow for regular use. If those evil organizations work, then simpler entities should be fine.

Real-world example – Losing our largest customer

Often your most important customers are also the largest ones, who push your system to its limits. At one place I worked, we needed to move our largest customer to a new geographic node. This meant a migration, copying all its data from a source node database to the destination. That would involve downtime, so we scheduled the maintenance for overnight. I started the command and waited. These operations were slow and would take minutes at the best of times, so with that many users, it wasn’t a surprise when this took over an hour.

Then it hit an error. After all that waiting, the operation had timed out. I was frustrated but not worried yet; our system was designed to only delete information from the source once the destination copy was complete. A rollback should have left us back where we started, but unfortunately, it didn’t. The failure meant the configuration never arrived at the destination, but the source thought the operation was a success and deleted it. I frantically triple-checked both locations, but the customer had vanished.

We’d tested failures where the destination rejected the transfer but not where the destination accepted it but took too long. I scrambled to restore the configuration from the backup and got them back online in their original location before they came online for the morning.

Centralized commands

Rollouts should be centrally controlled with commands to completely upgrade the independent parts of your product. This can be difficult in a varied environment with legacy systems, but you will be upgrading it a lot, so it’s well worth the effort to improve the process. Likewise, the configuration should be centrally managed to provide easy visibility and control.

In Chapter 8, User Experience Testing, I discussed the problems I’d encountered with locally configured feature flags. When you have a local configuration, there is no easy way to set it and check it across your entire infrastructure. The wrong way to solve this is to add a checking mechanism – you’ll still have thousands of individually configured settings, but you can search for discrepancies. This is a waste of your time, as it needs constant checking and fixing. To solve the problem for good, a better fix is to make the configuration centrally controlled. Best yet – use an external service that your application queries and gives you visibility and control.

Watch out for backsliding in the development team. Possibly your system already has excellent centralized management, so your job as a tester is to ensure that you maintain those high standards. On a rushed project I once worked on, a new feature was enabled with a database field. We had a perfectly good system for allowing new features, but the developer hadn’t used it. That’s a maintainability issue for you to catch.

Testing upgrade types

Upgrades are some of the riskiest operations on a live system, both at the point when you ship bugs live and due to the risk of the upgrade process going wrong. You need to test this process, checking for failures for each upgrade type in your product. In a heterogeneous system, there will be a different process for each kind of server, and various upgrades even within the same server. Consider cases such as the following:

  • Edge servers/load balancers
  • Gateways
  • Core servers
  • Databases
  • Web server versions
  • Serverless cloud infrastructure
  • OS upgrades of all the systems listed previously

Upgrades are generally the first operation you test. They get on the new code running, ready to perform all the other tests, but ensure you perform upgrades realistically. You may have development shortcuts to get you running faster, and those are fine for everyday use, but make sure you have at least one test to go through the customer steps.

Testing upgrade control

Upgrades should be tracked, with a complete history available for running versions and configuration. This control also lets you roll out versions easily and see the requested settings. Remember that the running version may differ, so don’t simply trust that your configuration has been applied.

Real-world example – What was running on the 16th?

In one company, we carefully tracked the version history of our servers, but our client’s software version was just controlled with a configuration setting.

After one release, a new error appeared. Looking back through the logs, we had first hit it several days before, around the time of the upgrade. Had this new release introduced an issue? If we rolled back to the previous version, would that fix it? That depended on when exactly we had performed the upgrade. Without a version history, we had to trawl through the logs to piece together what had happened.

While core servers and configuration may be tracked, are all your settings controlled that way? New settings or features, such as DNS, may simply be set rather than version controlled. Is the configuration described all the way to the server level, or does it assume a set of servers are in place and you must keep them synchronized? For maintainability, get as much configuration as possible into your settings files.

Once configuration files are ready, you can run extensive checks on them before they go live. Does it correctly reject invalid inputs such as the following ones?

  • Invalid syntax
  • Code versions being unavailable
  • Duplicate addresses and IDs
  • References to other machines and configurations being unavailable

It’s vital to get this feedback as early as possible instead of having to loop through the entire cycle of upgrading in practice, which can take minutes or hours.

Testing upgrade processes

Certain upgrade types can avoid downtime by running the old and new versions simultaneously and directing traffic to the new instance. If that’s the case, you’ll need to verify that system is in place and working. See the Testing transitions section for more details.

If there’s downtime, you’ll need to check its duration for any errors. Upgrades should be as automated as possible, with checking, scheduling, uploading code, restarts, and further checking performed without manual intervention. This automation is a massive benefit, but you need to check its functionality, including the following:

  • Upgrade scheduling:
    • Refer to the time tests in Chapter 5, Black-Box Functional Testing
  • Downtime during upgrades:
    • Duration
    • Errors while coming back online
  • Dependencies are upgraded in the correct order
  • Upgrades are as conservative as possible
  • Upgrades under heavy load
  • Upgrade failure cases such as the following:
    • An invalid upgrade image due to the following:
      • Incorrect sizes
      • Corrupt contents
      • Hash validation failing
    • A failure mid-upgrade due to the following:
      • Lack of system resources
      • Communication failure
    • Failure to start the new image
  • Recovery checks such as the following:
    • Machines are running
    • All processes have restarted
    • Network access is available
    • The application is responding

Ensure upgrades are run in the proper order. If an image needs to be uploaded to intermediate servers, check that it’s there before it’s required, for instance. Upgrades should be as conservative as possible. It’s tempting to add an upgrade or restart for a service just in case, which adds unnecessary load on the system. If one service isn’t being developed but is upgraded at the same time as a rapidly developing service, it can have orders of magnitude more restarts than necessary. This slows down the upgrade process for no reason.

Client upgrades

So far, we’ve considered server-side upgrades, but some products also have to control client rollouts. If you have a web application, your clients are browsers, and you don’t need to worry about their upgrades. Similarly, if you’re testing an app, then the App Store or Play Store will perform upgrades, and be grateful they handle that complexity. If you have dedicated clients connecting to your infrastructure, however, you will need to manage those upgrades, which can be a challenging task.

Real-world example – Breaking down upgrades

At one place I worked, we specified a single version of client software that could connect to our cloud. When we pushed upgrades, clients went offline until they had downloaded the new code and could run it. We performed upgrades at night, so if users left their clients online, they wouldn’t even notice, but if they connected in the morning, then they’d have to wait for the upgrade. That was fine until we gained large numbers of users.

We only supported a few simultaneous downloads, so clients had to wait until their turn, all while being offline. It caused many support cases, so we fixed it by letting clients connect to older versions of the software and download the new code in the background while running. Until then, the poor operations team had to stagger upgrades, only doing a small fraction at a time so we could complete them quickly.

When testing client upgrades, consider the following:

  • Slow internet connections, causing long downloads:
    • What is the timeout?
  • Connections with significant packet loss
  • Interrupted connections
  • Mass simultaneous requests for upgrades
  • Server restarts during downloads
  • Limited client resources preventing the upgrade:
    • Insufficient disk or memory

Both client and server upgrades should be backward compatible whenever possible, but if there’s a breaking change, you’ll need to test the process that forces clients to upgrade. Where you support previous versions, you’ll need to check that the oldest supported clients still work with the latest infrastructure to ensure that backward compatibility works in practice.

Recovery and rollback

Testing rollback mechanisms is usually low down the priority list since it’s an operation you never plan to use. However, on the day you need it, rolling back becomes the most critical task in the world, so make sure you try it in advance.

There are several ways of performing a rollback; you’ll only need to test the one you plan to use in practice:

  • Roll back to a fixed state
  • Revert a change
  • Manually undo a change
  • Make further changes to affect a fix

If you can take a snapshot of your system code, you can roll back by reverting to that snapshot. You’ll need to separately fix your code to make sure the bug isn’t deployed again. That automates returning to a known-good state and is the best recovery method if it’s available. Reverting the change is also a controlled way to return to the previous state, although the code needs to be rebuilt and redeployed, so this can take longer.

Unfortunately, simply reverting the code to a previous state isn’t possible if there are stateful changes such as database migrations or cache updates. Then, you’ll need manual steps to roll back those data stores, as well as move to older code.

Finally, instead of reverting the change that caused a breakage, you can update the rest of the system to support the altered behavior. This may be tempting if you want the breaking change but had released it too early when its dependencies weren’t in place. Instead of moving backward, you could update those dependencies and ship the change correctly.

This is the riskiest form of fix because you start running the new code in an untested configuration. Having shown that your testing did not catch the initial problem, shipping untried code even faster to fix it is dangerous. If the change is simple enough, and you can perform sanity testing in advance, this route might be best.

These tasks are headaches for the operations team. As a tester, needing a rollback probably shows that your testing was inadequate and needs to be improved. Your job is to learn from any test escapes and test that steps such as reverting to snapshots and reversing database migrations will work correctly if needed.

Testing transitions

Suppose that your current live version is working well, and the next version with a new feature fully implemented and rolled out has passed testing. But how do you get from here to there? While some features are simple and only require a single upgrade or feature flag, others require several steps in their rollout.

The classic example is removing a database column, which must be deprecated first. Simply removing a database field in a single step risks system failures until you upgrade the database and all processes that access it. Depending on how long your rollouts take, this could take a significant amount of time.

To catch these transient issues, you’ll need to monitor your system during the upgrade and follow up on any problems, even if they appear to resolve themselves. Rollouts can be far quicker on test systems with the same logical setup as the live system but at a smaller scale. A brief problem on the test system might translate to a far longer one in production, so monitor your upgrades for any errors and then follow them up.

Where an upgrade has several distinct steps, you will need to check the functionality of each one. When handing over between blue/green systems, this has three phases:

  1. Run the upgraded system alongside the live system:

A. Check that both work as expected

  1. Direct traffic to the upgraded system:

A. Check that the new system is processing requests correctly

B. Check that the old system is unused

  1. Decommission the old system:

A. Check that its removal hasn’t affected live traffic on the new system

You will need to break down each step for more complex transitions and design test plans for each phase.

Testing maintenance operations overview

Maintenance operations are some of the riskiest changes performed on an application. They can involve fundamental changes to how a system works, and they are generally rarer than customer operations. Configuration management, upgrades, and system transitions all need to be carefully tested for each release to check that they work correctly for new functions.

When there are failures and you need more information to diagnose the issues, you will need to turn to the logs. As your primary debugging tool, you need to ensure that each new feature is writing logs well, as described in the next section.

Using logging

Logging is the most useful form of output from your system for diagnosing issues and understanding the details of its behavior. This section considers how logs should be written so that you can use them to the best effect.

Finding “the” log

It’s vital that you get the correct information to the developer to help diagnose an issue in their code. While I always try to include the relevant information, I’ve lost count of the number of times a developer has asked me for the log. It’s a simple enough request; the only problem is the word the. Which log do they need for any given issue? Within that log, which line is the clue? Other sections of this chapter will recommend approaches to finding problems within logs; here, we consider finding the correct log in the first place.

Real-world example – Logging on dual redundant hardware

In two of my jobs, I’ve improved the presentation of logs, and in both cases, I’ve been incredibly glad that I did. At one company, we made redundant hardware systems for processing SMS text messages. Two blades ran live simultaneously, both handling the system load, each writing logs of the messages they sent and received.

This made debugging a nightmare because the messages were interleaved between the two logs. I wrote a program that downloaded the logs from both blades and arranged them in time order so that you could see the overall flow of messages through the system. This wasn’t the most complicated program ever written, but it made our lives much easier.

Users of the system must have a minimum level of knowledge, such as where the logs are and how the different parts of the code interact. Where necessary, you can document the steps users should take, for instance, by writing a troubleshooting guide that considers the most likely failures and the areas to check in each case.

For instance, if a user didn’t receive their password reset email, you would first check the system that sends out the emails, which is likely to be a third party. Was the email sent at all? Did the destination acknowledge receipt? Or did the request not make it that far? If not, you need to check the part of your system that sends those requests. If that is also missing this request, then look at the web page itself – did the page send the information correctly?

Even better than documenting the systems involved in different processes is automatically gathering the logs you need. Adding traces to your system lets you link together transactions from different parts of your system. A little work in this area can have a transformative effect, as described in the next section.

Understanding the debugging steps

Debugging can have the kind of user experience that would make a professional UX designer break down in tears. While the user experience may be atrocious, that’s okay because very few people carry debugging out, and all of them are paid by your company. The difficulty of debugging – both the complexity of the tasks it requires and the problems you are trying to solve – highlights some of the fundamental ideas behind user experience design, for instance, the number of actions a user has to perform to reach their goal.

Once you know which logs contain the information you need, you can work out how many steps it takes to retrieve it. What information do you start with – for instance, the email address of the user who experienced the issue? Does the log you need contain that email address so you can search for it? Or does the log contain a unique identifier that you’ll need to look up?

Break down each step to see what a tester or support engineer would do in practice, and look for areas that can be improved. Do you need to add identifiers into a log to make it easier to search? Is there anywhere you need to do a database lookup to convert between different identifiers? Can you write a tool to make that step easier, or use the same identifier everywhere, so you don’t have to convert it?

When it comes to identifiers, make sure you have as few as possible. For example, for a customer name, you might have one user-facing, editable identifier and a second unique identifier used throughout the system. It is surprisingly easy for identifiers to proliferate, especially in a complex product. Part of maintainability is making sure that doesn’t happen.

One product I worked on had six different identifiers for our customers: the frontend had a database ID and a different ID when loading web pages, as well as the primary identifier, which was shared with the backend. The backend had another database with a different ID, as well as a final ID (I was never sure of its purpose), not counting the actual customer name. Wherever related logs use different IDs, you need to translate between them to track a command through the system. Every translation is more work for you and another place to make mistakes, so if you see a new identifier creeping into usage, raise that as a maintainability problem.

In a distributed system with multiple servers handling each request, check how easy it is to track messages through the system. If you have 10 different edge servers, do your core servers record which server they received messages from so you can check back through the system? How obvious is that step? Do you have to look up IP addresses, or is it named? Is it written in the logline with the user identifier, or do you have to search back in the log for the transaction? Each requires a different step, and each step takes time to perform and is a chance to make a mistake.

For example, if a user complains that a particular web page is failing to load, but it works fine for you, you’ll need to check the logs to investigate their failure. A typical process might be as follows:

  1. Using the user’s email address, look up their ID in the database.
  2. Use that ID to find which server answered that request.
  3. On that server, find the correct log.
  4. Search in the log for the ID, and find the lines processing that request.
  5. The request was handled successfully and sent on to a subsystem for processing; look up which subsystem took the request.
  6. In that subsystem, find the correct log.
  7. The subsystem log only includes a different ID, so convert the main ID to the subsystem ID.
  8. Search the subsystem log for the subsystem ID and find the lines processing that request.
  9. Find the error and start to investigate why it happened.

Sound familiar? This is the problem that traces are designed to solve. The first block of processing, or span, generates the identifier and then includes it in its message to subsequent services. You’ll need to translate from the user or customer you’re interested in and obtain that transaction ID, but with it, you can instantly retrieve the relevant logs from across your system. If you are suffering from these issues, this is the best solution.

If you can’t implement complete tracing, steps 1 to 5 could be handled by a single tool. Enter the user’s email address, and it will convert it to the ID and find all server logs for that user, including the subsystems that handled each request. Similarly, for steps 6 to 9 – you could specify a transaction, and a tool could look up the relevant lines for you. So you need at least two steps – to look up the user in the main system and then look up lines on the subsystem involved. Even without that automation, you can avoid step 7 by using the same ID everywhere. You could avoid step 6 by having one log link to the other.

How many steps does that kind of debugging take on your system, and how could you improve it?

The importance of messages

Different logs are written for different types of users. In particular, the developers will need records that detail the exact working of the code, so they can trace where problems came from. On the other hand, testers and support engineers only need to identify where a problem occurred, and within which system. A simple rule of thumb is that non-developers only need to worry about the messages sent within the system. Give them a log where they can see the messages, and that should be enough to isolate issues.

Logging the messages first lets you see all the inputs and outputs to your system. Whether API commands, web requests, or information from client applications, one of the most important interfaces is where your system interacts with the outside world. A whole host of problems start there, so make sure your logging records exactly what you sent and received.

Split the logging into different levels for the developers and testers. The testers only need to worry about the messages at first, and that will let them track down the source of issues. Did we receive an invalid request? Or was the request valid, but one subsystem failed to handle it? The message between the systems will indicate this.

Real-world example – Displaying messages

The second tool I wrote to help debug messages was for a hardware system for video conferencing. Each call required dozens of messages to be set up. There was a great interface that recorded every message, its title, where it was sent from and to, and when. But even that interface struggled when you had many people simultaneously dialing into a call.

You could download an .xml file with the messages, so I wrote a web page that loaded those files and split up the messages into a table, where each conference participant had their own column. You could see at a glance which message came from each caller, including us, so you could focus on the ones with a problem. Again, it was a straightforward program but it really clarified what was happening.

On the other hand, the developers need to know how a message is handled. What is the state machine for the subsystem, and how far through did the message get? This level of detail is vital but should be hidden away from the testers. As a tester, you only need to trace the problem based on the messages sent. The best testers will get to know some systems they have worked on and may use the developer logs too, but it shouldn’t usually be necessary for a tester to trawl through the developer logs.

It should be possible to refactor an area of code and not change the messages it sends. The message logs should be completely unchanged, even if the developer logs are different. Only the developers need to know about the implementation, so long as the behavior is correct when testers use it. This distinction in responsibilities should also be present in the logs that the different groups need to work with.

Note that for this purpose, timeouts also count as messages. Failing to receive a response causes the next stage of processing to start as if a message had arrived, so timeouts should also be logged at this level.

By only recording the messages and not every function and state transition, it should be possible for testers to learn what a healthy message sequence looks like and to rapidly see any problems or divergence from it. If a log has 50 messages, for instance, then a tester has the hope of remembering that. If a log has 3,000 messages, then that won’t be possible. Focusing only on messages lets testers learn the system successfully.

Simple logs also help newcomers to learn faster. This might be a new member of the test or development team or someone from another team whose investigations have led them in this direction. How quickly can they understand what this code is doing? Again, having a simple, short log recording the messages in and out is an excellent first step; then, they can check the full logs when they know the location of the problem.

Having a message log brings many benefits, but only if done well. The following section considers exactly how messages should be logged.

How to log messages

It’s not enough just to log the messages in your system; you have to log them well. Here is a short checklist to make sure they include the key information:

  • Are all the messages in and out of the system recorded?
  • Messages between subsystems should be recorded; messages within subsystems should not.
  • Does each message include its name, a timestamp, and a clear description?
  • Is the content of each message available?
  • Does each message clearly state where it is sent from and to?

These may seem obvious, but it’s easy to leave gaps. Standard components, such as web servers, should log the messages they receive but you need to ensure that any custom protocols are also logged. They are also more likely to have issues and need logging since they’ve had less testing and usage overall.

It’s easy to miss messages such as DNS lookups, which are vital to send a message but aren’t part of the main flow. Many protocols are text-based, making them easy to record in logs, but binary protocols also need to be registered. They need to be translated. This will be more work, but it’s vital to also be able to read them.

While external messages and messages between your subsystems should be recorded, internal messages within those subsystems shouldn’t be added to the log. They would add too much information and aren’t necessary at this level. The message log is there to identify which subsystem has a problem. When you have determined that, you can go down to the next level and see its internal messages. They shouldn’t be included in the message log, however.

Each message should have a title so that you can see the entire sequence of requests, just by seeing their names. You should be able to see that at a glance, so if message 37 was an error, you don’t have to walk through 36 successful steps. Features such as these transform the workflow of everyone using the logs.

While it should be possible to see the names of the messages, once you have identified which one is the problem, you need to see its entire contents and those of the messages before it, which may have triggered the error. Summaries or headers are no good; at this point, you need to see the value of every field. Each message should also have a timestamp synchronized across the system. With the Network Time Protocol (NTP) everywhere, the days of mismatched clocks are thankfully long behind us.

The wording for each message should be clear. Ideally, an engineer who has never read this log line before should be able to tell what is happening. That may sound like a low threshold, but logs are sometimes more like notes that are highly dependent on context.

For instance, in one system I used, when a server received a request, it produced a log line saying, “Dispatched message ABC…”. A message arriving caused a log line saying it had been sent! It wasn’t completely wrong – the request arrived from a third-party system and was sent on to an internal system for processing. However, I was much more interested in the external message flow than the internal one. You must raise these issues with the development team or train testers to fix them.

It may seem like a small change, but making your log lines understandable will help all your future debugging.

Finally, each message should record where it came from and where it was sent. This sounds like a simple requirement for a message but isn’t always the case. Being able to deduce it from context isn’t enough. Maybe some message types are always sent in a particular direction, but others might not be, and adding this information helps people to learn and remember what is going on. For external messages, you might log IP addresses named with a reverse DNS lookup, if possible. Messages between different internal subsystems should always have names.

Real-world example – Subsystem in the middle

The architecture of one system I tested involved a core controller, a subsystem for sending and receiving messages, and then interactions with third parties. Unfortunately, its logging was hard to use, and it didn’t clearly record the sources and destinations of messages.

When debugging one issue, we could see that the messaging subsystem had received a command to cancel a transaction. But had it arrived from outside, or had it come from our controller? Missing such a basic piece of information meant we couldn’t tell whether the problem was ours or the third party’s, and we couldn’t debug further without additional information.

The Goldilocks zone of logging

You will need to tune the level of logging from your system to get it exactly right since it can be tough to predict in advance. Some logs are too noisy, producing millions of useless lines; others omit vital information, which you only notice when you can’t debug a critical issue. That delays fixes until you can release extra logging and wait for the problem to happen again. Both too much and too little logging are problems you need to fix, and the Goldilocks zone of sufficient logging lies somewhere in the middle.

The massive benefit of tuning the level of logging from your system is that users of that system can learn precisely what a successful sequence should look like. If there are tens of lines, you have a hope of remembering them all so that you can spot absences or anomalies. If there are hundreds of lines written for a single action, that becomes impossible.

Logging is such an essential part of your job as a tester that it must be written well. If you can’t change the logs, go on whatever courses you need and set up mentorships so you can in the future. Changing log lines is simple and safe in the code but can transform the experience of debugging an issue. Combine the motivation with the ability to empower those who care. Testers are the biggest users of the logs, so you should be able to change them as needed.

Loglines should each be a readable account of a single action. I’ve seen loglines that simply stated OK! or User added. But what was okay? Which user was added? In structured logging, metadata sometimes provides more context, but the text itself should be comprehensible. The goal is for someone new to that area of code to be able to read the log and understand what has happened without extra details.

Any numeric fields should be labeled in the logs. While the developer might know what they all mean, newcomers won’t. I’ve seen log lines with the following form:

Message received. 4: 243, 7: 32, 14: 23833

The ENUM field values are only useful for developers who have worked on the code for so long that they’ve memorized them. Naming fields is another small change that can transform the experience of using the logs, as you no longer have to look up what each field means.

Examine the logging for each new feature, checking for spam or gaps in the amount it provides, and that each line is an understandable sentence. If possible, fix these issues yourself.

Logging usability

The interface to read logs is also important to get right. The ideal is an interface that is aware these are messages and displays only their names, expanding them where necessary. This lets you quickly see an overview of what happened and examine the critical sections in more detail.

Visibility is vital. Seeing an overview of messages at a glance lets you instantly spot oddities. Maybe you were investigating a problem with a particular hypothesis in mind. If it takes effort to look through logs, then you will only realize you are wrong after you’ve taken the time to check that idea. Easily seeing summaries of events, especially the messages in and out, gives the chance to spot oddities and find other leads. If accessing the logs requires work, you are less likely to see problems.

Logs can be displayed in various ways:

  • Network traces with no formatting
  • Network traces formatted by a plugin
  • Text logs
  • Formatted logs
  • Dedicated tools

While network traces will show you the messages being sent, a raw log requires significant effort to use. As well as setting it up on the correct interface, it needs to filter key messages from all the other traffic. Each message then needs to be decoded to understand its content. If these are text fields, some may be readable but you’ll still need to know the meaning of each field. Crucially, it won’t show you the names of each message so that you can narrow down the failure points.

One way to improve that is to write a plugin. If network captures are your best tool, make sure that you at least have a plugin to make them readable. When properly filtered and decoded, network traces can help to track messages in and out of your system. However, they still need to be captured at the right place and may not track messages between internal systems. It’s much better if logging is higher in the application, only using network captures as a last resort.

Most simply, applications will write out text logs of their actions. This is better than nothing, but logs are most valuable when they can be filtered and searched. For standard log formats such as web server logs, there are numerous solutions to read these text logs and process them. For more custom logs, you need to do more work to make them easily searchable, and text logs can’t provide a summary of just the message names, leaving you having to scan through all their contents. Structured logging fixes both those problems, so if you are using simple text logs, that is an important feature request.

When logs are structured, for instance, as JSON or XML, tools can filter on key fields such as time and severity. For more customized handling, for example, identifying log lines with message names versus those describing internal processing, you need tools to read out the logs in customized ways. Using dedicated tools to view structured logs is the best choice. These might be custom-made within your company, or numerous commercial options can be configured to view your logs. With this in place, you can easily see messages.

This is all very well, I hear you say. It would be lovely to have logs that clearly presented all and only the messages between the different systems in our product, but we don’t have them. Our logs are a mixture of messages and implementation, scattered across various servers and accessed by hand.

This section showed what you should aim for. If your logging isn’t like this today, plan to improve it by doing a little work alongside each release. Start with the most pressing issues – maybe your log lines are incomprehensible and need to be rewritten, or perhaps you need a new tool to view the logs. If you are adding a new feature, make sure its log lines are clear and understandable. All these investments pay back when investigating future issues.

Logging overview

Logs are a major part of your role as a tester, so pay close attention to getting them right. Check them as you use a new feature, even if everything is working and you don’t need their output, to see how the code reports its activity. You need to check that the logs are written with the right details, are easy to find, and contain sufficient information to debug issues. There should be black-box level logging that records the messages sent between systems for testers to use, and white-box level testing detailing the code behavior for developers. This will need constant tuning as you see how it behaves in practice. With high-quality information displayed in easy-to-use tools, you have the best view of your application’s behavior, ready to understand any surprises and fix issues.

Summary

In this chapter, we’ve seen the importance of testing the maintainability of maintenance operations, system monitoring, and logging. None of these functions are customer-facing, but each has a massive impact on the engineering team working on the product and, through them, the improvements they can deliver to customers. How fast can your company fix bugs, recover from outages, and roll out changes?

Those things affect customers, and they need maintainable code. That doesn’t happen on its own; if you don’t check that the requirements for maintainability are written and implemented, it’s possible no one else will. This piece of testing is in your own best interest.

In the next chapter, we’ll consider destructive testing in which you deliberately degrade your system to see how it copes with internal issues. These are vital to recovering from errors but can be hard to plan for, making it a critical area to test.

What we learned from Part 2, Functional Testing

Part 2, Functional Testing, has considered the behavior of your application in a wide range of situations, from security to usability to core application functions. These are all functional tests: when you do X, your application does Y. In that simple view, you check if your program's output is correct for all the relevant inputs.

That may sound exhaustive, but there are still types of testing we have not covered yet. In Part 3, Non-Functional Testing, we will consider destructive testing, in which you deliberately disable part of the working system to check its resilience and ability to recover. Load testing ensures that your application performs consistently within its specified limits, with no unexpected latency or intermittent errors.

Finally, stress testing checks what happens when your system is pushed beyond its limits and its ability to protect its core functions even when asked to do unreasonable workloads. Some of the most serious bugs can only be found with these stringent tests, so they are also vitally important.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.111.33