7. Testing

image

Tester
Test Manager

“Lesson 1: You are the headlights of the project.

A project is like a road trip. Some projects are simple and routine, like driving to the store in broad daylight. But most projects worth doing are more like driving a truck off-road, in the mountains, at night. Those projects need headlights. As the tester, you light the way. You illuminate the road ahead so that the programmers and managers, however they bicker over the map, can at least see where they are, what they’re about to run over, and how close they are to the cliff. The detailed mission of the testing group varies from company to company. Behind those details, though, is a common factor. Testing is done to find information. Critical decisions about the project or the product are made on the basis of that information.”1

—Cem Kaner, James Bach, and Bret Pettichord,
Lessons Learned in Software Testing

image

Figure 7.1 Thinking of testing as the headlights of a project can keep your activities focused on providing useful information.

For any professional software application, there are an infinite number of possible tests. Accordingly, most software development and IT organizations devote a significant amount of their budget to testing, whether done in-house or outsourced. Yet surprisingly, most test managers and project managers express much frustration at their inability to answer fairly basic questions about the effectiveness of their testing.

Similarly to earlier chapters, I start with a couple pages on the value-up tenets of the discipline, in this case, testing. The rest of the chapter drills into those frustrating basic questions and illustrates how VSTS helps you answer them.

A Value-Up View of Testing

Probably no discipline acts as more of a lightning rod for confused discussions of software process paradigms than testing. A great many books discuss testing in isolation and assume a work-down paradigm.2 The negative effect of these works has been so great that in early days of the Agile movement, it was unclear whether there was a role for testing at all because developers were responsible for their own unit testing.3 Lots of confusion remains.

The first source of confusion is that testing activities have two purposes:

To support development activities. In MSF, these tests belong to the development role, and I described them in the previous chapter. I’m not going to repeat them here.

To assess customer value. In MSF, this is the responsibility of testers, and it is the subject of this chapter.

The next source of confusion concerns the appropriate output of testing activity. Before continuing to describe value-up testing, indulge me in a small exercise that brings some of this confusion to light.

What Makes a Good Tester?

Consider this exercise.4 Caliban and Ariel are testers on a project. They work independently. It is close to the release date. Their project has five subsystems, all equally important, and all bugs that they might find are of equal priority. (Of course, this oversimplification never happens, but for the exercise, please suspend disbelief.)

Prospero, their project manager, is reviewing the state of the project. He sees that Caliban reported 100 bugs, and Ariel reported 74. Then, Prospero decides to look at the breakdown of their bug reports (see Table 7.1).

Table 7.1 Caliban’s and Ariel’s Found Bug Counts

image

Here’s what he sees: Caliban has found 100 bugs in one component and none in the others. Ariel has found some in all components. Prospero asks the two testers to explain their work.

Caliban: I have tested Subsystem 1 very thoroughly, and we believe we’ve found almost all of the priority 1 bugs. Unfortunately, I didn’t have time to spend on the remaining five subsystems.

Ariel: I’ve tested all subsystems in moderate depth. Subsystem 1 is still very buggy. The other subsystems are about 1/10th as buggy, though I’m sure bugs remain.

Now consider whose information is more useful. Many professional testers believe that the only role of testing is to find bugs, and that the more bugs found, regardless of context, the better the testing. Where organizations ferociously track bug rate curves, this assumption can be reinforced. Caliban behaves perfectly for such a situation. He focused on the buggiest component and reported the highest number of bugs. On the other hand, he only reported 20 percent of the information that Ariel reported.

Prospero, on the other hand, must decide whether it would be better to hold on to the solution for more work or to ship it now. If more work is needed, where should the investment be made? Prospero can make much more informed decisions from Ariel’s work than from Caliban’s, even though she found far fewer bugs. Indeed, Prospero might decide based on Ariel’s investigation to scrap Subsystem 1 and replace it. In that case, 20 percent of her work will need to redone, whereas all of Caliban’s work will be thrown away.

Ariel also follows some useful tenets that are captured in three great value-up heuristics:5

1. Important problems fast. Testing should be optimized to find important problems fast rather than attempting to find all problems with equal urgency.

2. Focus on risk. Test strategy should focus most effort on areas of potential technical risk while still putting some effort into low-risk areas just in case the risk analysis is wrong.

3. Maximize diversity. Test strategy should be diversified in terms of test techniques and perspectives. Methods of evaluating test coverage should take into account multiple dimensions of coverage, including structural, functional, data, platform, operations, and requirements.

In a value-up view of testing, Ariel’s work is far more useful. Ariel and Caliban work to optimize different values—information versus bug counts—because they were working toward different missions. (Of course, in a work-down paradigm, Caliban might be considered more productive.) Like the opening quote of this chapter, Ariel’s informational approach acts as “the headlights of the project.”

Basic Questions

What is the information you need to assess customer value? It comes down to some pretty basic questions:

• Are we delivering the customer value?

• Are the qualities of service, such as performance and security, fit for use?

• Have we tested the changes?

• What haven’t we tested?

• Does it work in production as well as in the lab?

• Are we testing enough?

• When should we be running the tests?

• Which tests should be automated?

• How efficient is our team, or our outsourced team?

These are fundamental questions that members of a well-run project should be able to answer, and they are a good way to think of the testing that you do. In the rest of this chapter, I’ll show how VSTS helps with the answers.

Are We Delivering the Customer Value?

As discussed in Chapter 3, “Requirements,” scenarios are a primary statement of customer value. Good scenario testing requires putting yourself in the role of customer advocate. This means testing well beyond the stated scenario requirements. Cem Kaner has a good checklist to put you in the right mindset:6

Designing scenario tests is much like doing a requirements analysis, but is not requirements analysis. They rely on similar information but use it differently.

• The requirements analyst tries to foster agreement about the system to be built. The tester exploits disagreements to predict problems with the system.

• The tester doesn’t have to reach conclusions or make recommendations about how the solution should work. Her task is to expose credible concerns to the stakeholders.

• The tester doesn’t have to make the solution design tradeoffs. She exposes the consequences of those tradeoffs, especially unanticipated or more serious consequences than expected.

• The tester doesn’t have to respect prior agreements. (Caution: testers who belabor the wrong issues lose credibility.)

• The scenario tester’s work need not be exhaustive, just useful.

Scenario tests need to represent the primary business flows you expect to cover 80% of the software’s usage. Typically, they use broad data sets, representing a realistic mix of business cases that you expect your solution to handle.

As much as testing should cover identified requirements, such as known scenarios, good testing also discovers new ones. A value-up tenet is that your knowledge grows throughout the project, often emerging from use and discovery. Do not hesitate to recognize and capture the new scenarios that you find in planning and executing tests.

Scenario tests can be manual or automated. If you have experienced testers who can play customer advocates, you may want to document the manual tests only to the extent of customer goals and appropriate data. In this case, you do not need to enumerate manual tests step by step, but can provide a more general test ideas list. On the other hand, if you delegate manual test execution to people who do not understand the customer well, you may need to document the steps in detail (see Figure 7.2).

image

Figure 7.2 In VSTS, you can describe manual tests in documents like this. The test results are captured, tracked, and fed to the warehouse in the same manner as automated tests.

Automated Scenario Tests

The primary way to automate scenario tests in VSTS is as web tests (see Figure 7.3). These test any application that presents itself through a web browser. Whether you create web tests by recording or programming, you can enhance and maintain them in Visual Basic or C#.

image

Figure 7.3 When you add a web test in VSTS, you drive the scenario as a user would—through an instrumented web browser. This captures the interaction not just at the GUI level but also at the HTTP protocol level and produces a parameterized test.

Although web tests are created by recording, they are not dependent on the browser UI for running because they exercise the software under test where it counts—at the server level. During playback, you can see both the browser interaction and the HTTP or HTTPS traffic (see Figure 7.4).

image

Figure 7.4 The playback of the test shows you both what was rendered in the browser and what happened in the HTTP stream so that you can watch the full server interaction, including the invisible parts of the traffic.

Using Test Data

Varying test data to represent a mix of realistic inputs is an important part of scenario testing (see Figure 7.5). Seeing a test pass for one data set does not mean it will pass for all, especially when you’ve paid careful attention in designing your equivalence classes. Accordingly, you can have your web tests access external test data from any OLEDB data source, including .csv files, Excel, Access, and SQL Server databases.

image

Figure 7.5 In almost all cases, you should vary the data used for testing either to cover different combinations based on different equivalence classes or to apply unique values for each transaction present in a multiuser workload.

Because web tests are easy to create, they are also easy to recreate when a scenario changes.

Insulating Your Tests from UI Changes

There’s an old joke about Mr. Smith kneeling on a deserted sidewalk at night under a street lamp. A policeman comes by and asks if there’s a problem.

Mr. Smith: I lost my car keys.
Policeman: Where did you lose them?
Mr. Smith: Around the corner in the parking lot, near my car.
Policeman: Then why are you looking here?
Mr. Smith: Because it’s dark there and the light’s so much better over here!

Most software is tested through the user interface, and the “light” is indeed brightest there. However, in modern distributed architectures, most of the program’s logic is actually implemented on a server, and the user interface is a thin veneer on top of that logic. Accordingly, most of the potential failures are on the server side. The UI is also likely to change considerably based on usability testing and other customer feedback. You don’t want those changes to break the tests, or worse, you don’t want to use the test maintenance cost as a reason not to improve the UI.

If you limit your tests to user interface tests, you almost certainly must rewrite over the course of the project as the UI changes. Fortunately, VSTS web tests exercise server-side APIs (see Figure 7.6). They can be turned into C# or Visual Basic programs that enable you to maintain the tests independent of the UI changes. Of course, the test data that they use is already separately bound and maintained.

image

Figure 7.6 An alternate view of the test shown in Figure 7.3 as generated code. Note that the test code is really driving a set of server interactions, not clicking in the user interface.

Are the Qualities of Service Fit for Use?

As discussed in Chapter 3, qualities of service (QoS) are captured in the work item database. Tests specialize according to the different qualities of service. For example, load tests, configuration tests, security tests, and usability tests are all radically different.

Load Testing

Load testing aims to answer two primary questions:

1. Does the software respond appropriately under expected load conditions? To answer this, you compose performance tests that combine reasonable scenario tests, data, and workloads.

2. Under what stress conditions does the software stop responding well? For this, you take the same scenarios and data and crank up the workload progressively, watching the corresponding effect on performance and system indicators.

All the automated tests managed by VSTS—web tests, unit tests, and any additional test types you create—can be used for load testing (see Figures 7.7 through 7.10). With VSTS, you can model the workload to represent a realistic mix of users, each running different tests. Finally, VSTS automatically collects diagnostic data from the servers under test to highlight problems for you.

image

Figure 7.7 In VSTS, a load test is a container for any arbitrary set of tests with workload settings. First, you choose how to ramp the load. Often you want to observe the system with gradually increasing user load so that you can spot any “hockey stick” effect in the response time as the user load increases.

image

Figure 7.8 Next, you choose the tests (unit, web, or other) and the percentage of load to create from each of the atomic tests.

image

Figure 7.9 The next steps are to choose the browser and network mixes that best reflect your end-user population.

image

Figure 7.10 Load tests can generate huge amounts of data from the servers under test, and it’s often hard to know what’s relevant. VSTS simplifies this by asking you to choose only which services to watch on which machines and automating the rest of the decisions.

Understanding the Output

While a load test runs, and after it completes, you need to look at two levels of data (see Figure 7.11). Average response time shows you the end-to-end response time for a page to finish loading, exactly as a user would experience it. That’s straightforward, and you can assess whether the range is within acceptable limits. At the same time, while the test runs, all the relevant performance data is collected from the chosen servers, and these counters give you clues as to where the bottlenecks are in the running system.

image

Figure 7.11 This graph shows two kinds of data together. Average Response Time is the page load time as a user would experience it. Requests/Sec is a measurement of the server under test, indicating a cause of the slowdown. Note additionally the warning and error icons that flag problems among the tree of counters in the upper left. Some of these may lead you to configuration problems that can be tuned in the server settings; others may point to application errors that need to be fixed in code.

Diagnosing

When a load test points to a likely application performance problem, the developer of the suspect code is usually the best person to diagnose the problem. As a tester, you can attach the test result to a bug directly to forward it to an appropriate teammate, and when your teammate opens the bug, the same graphs will be available for viewing. He or she can then use the Performance Wizard to instrument the application and rerun the test that you ran, as shown in Figure 7.12.

image

Figure 7.12 In addition to the information offered by the perfmon counters, you can rerun the test with profiling (or attach the test result to a bug and have a colleague open it and rerun with profiling). This takes you from the system view to the code view of the application and lets you drill into the specific methods and call sequences that may be involved during the slowdown.

The profiling report can rank the actual suspect functions and lead you straight to the code that may need optimizing. This sequence of load testing to profiling is a very efficient way to determine how to tune an application. You can use it in any iteration as soon as enough of the system is available to drive under load.

Security Testing

Security testing is a specialized type of negative testing. In security testing, you are trying to prove that the software under test is vulnerable to attack in ways that it should not be. The essence of security testing is to use a fault model, based on vulnerabilities observed on other systems, and a series of attacks to exploit the vulnerabilities.

There are many published attack patterns that can identify the vast majority of vulnerabilities.7 Many companies provide penetration testing services, and many community tools are available to facilitate security testing. You can drive the tools from VSTS test suites, but they are not delivered as part of the VSTS product itself.

Usability Testing

In Chapter 3, I discussed usability testing as part of the requirements process, so I won’t repeat it here. It is equally relevant, of course, throughout the lifecycle.

Have We Tested the Changes?

Throughout the lifecycle, the application changes. Regressions are bugs in the software under test that did not appear in previous versions. Regression testing is the term for testing a new version of the software to find those bugs. Almost all types of tests can be used as regression tests, but in keeping with the tenet of “Important problems fast,” your regression testing strategy must be very efficient.

Ideally, you should test the most recent changes first. Not only does this mitigate the risk of unforeseen side effects of the changes, but also if you do find bugs and report them, the more recent changes are more current in everyone’s memory.

One of the challenges in most test teams is identifying what exactly the changes are. Fortunately, the daily build report shows you exactly what changesets have made it into the build and what work items (scenarios, QoS, tasks, and bugs) have been resolved, thereby identifying the functionality that should be tested first (see Figure 7.13). Moreover, if you have reasonable build verification tests (BVTs), then you can check their results and code coverage.

image

Figure 7.13 One of the many purposes of the daily build report is to focus testing activity on the newly built functionality. This list of work items resolved in the build is like an automatic release note, showing what new functionality needs to be tested.

What Haven’t We Tested?

Identifying gaps in testing involves looking at the test results from multiple dimensions:

• Requirements. Have the tests demonstrated that the intended functionality was implemented?

• Code. What code has been exercised during testing?

• Risks. What blind spots might we need to guard against? What events do we need to be prepared for?

Requirements

One view of coverage in VSTS is the results of tests against requirements such as scenarios or QoS (see Figure 7.14). Tests and their results can be tracked to these specific work items.

image

Figure 7.14 One way of measuring test coverage is by scenario or other requirement, based on the useful discipline of clearly identifying the scenario that a test tests. The aggregation is, by requirement, showing how many requirements are in which state of testing.

Code

A second view of coverage is against source code (see Figure 7.15). This is just a second, orthogonal measure that can be taken from the same test runs that demonstrate the testing against requirements shown in Figure 7.14.

image

Figure 7.15 Code coverage is the second main meaning of “coverage.” Here you see the test results viewer expanded to show the coverage statistics against the source code.

From this view, you can also paint the source code under test to see exactly which lines were and were not exercised, as shown previously in Figure 6.9.

As discussed in Chapter 4, “Project Management,” use these metrics descriptively, not prescriptively. Coverage metrics against code and requirements are very useful, but they are only two dimensions. Do not let coverage statistics lull you into a false sense of confidence. “100% coverage” merely means that this metric does not reveal any more gaps. It does not tell you anything about the extent to which you have tested conditions for which no code or no requirements have been written. Indeed, the fastest way to increase a code coverage measure is usually to remove error-handling source code, which is probably the last behavior you want to encourage.

Assume that there are gaps in more dimensions that these coverage measures don’t reveal.

Risks

In Chapter 2, “Value-Up Processes,” I described the MSF views of constituency-based and event-driven risk management. Both sets of risk need to be considered for testing.

Most risk testing is negative testing, that is, “tests aimed at showing that the software does not work.”8 These tests attempt to do things that should not be possible to do, such as spending money beyond a credit limit, revealing someone else’s credit card number, or raising an airplane’s landing gear before takeoff.

Note that coverage testing does not provide any clue about the amount of negative testing that has been done, and requirements-based coverage helps only to the extent that QoS requirements capture error prevention, which is usually at much too cursory a level. In testing for risks, you are typically looking for errors of omission, such as an unwritten error handler (no code to cover) or an implicit (and hence untraceable) requirement.

To design effective negative tests, you need a good idea of what could go wrong. This is sometimes called a “fault model.” You can derive your fault model from any number of knowledge sources. Example sources of a fault model illustrating constituency-based knowledge are listed in Table 7.2.

Table 7.2 Typical Sources and Examples for a Fault Model

image

VSTS lets you capture these potential faults in the work item database as risks. Typically, you would start during early test planning, and you would review and update the risk list in planning every iteration and probably more frequently. The same traceability that tracks test cases to scenario work items enables you to trace tests to risk work items so that you can report on testing against risks in the same way (see Figures 7.16 and 7.17).

image

Figure 7.16 Risks are captured as work items so that they can be managed in the same backlog, tracked to test cases, and reported in the same way as other work item types.

image

Figure 7.17 Because risks are a type of work item, you can measure test coverage against risks in a manner similar to the coverage against scenarios.

Does It Work in Production as Well as in the Lab?

Have you ever filed a bug and heard the response, “But it works on my machine”? Or have you ever heard the datacenter complain about the cost of staging because the software as tested never works without reconfiguration for the production environment?

These are symptoms of inadequate configuration testing. Configuration testing is critical in three cases:

1. Datacenters lock down their servers with very specific settings and often have a defined number of managed configurations. It’s essential that the settings of the test environment match the datacenter environment in all applicable ways.

2. Software vendors and other organizations that cannot precisely control customers’ configurations need to be able to validate their software across the breadth of configurations that will actually be used.

3. Software that is used internationally will encounter different operating system settings in different countries, including different character sets, hardware, and input methods, which will require specific testing.

Fortunately, VSTS supports explicit configuration testing in two ways: by enabling you to set up test labs with virtual machines and by explicitly tracking test run configurations and recording all test results against them.

Test Lab Setup

When you have combinations to test, cycling test lab machines among them can be a huge drain on time. Normally, you must clean each machine after a previous installation by restoring it to the base operating system, install the components, and then configure them. If you are rotating many configurations, this preparation time can dwarf the actual time available for testing.

An alternative is to set up the configurations on “virtual machines” using Microsoft Virtual Server, included in VSTS (see Figure 7.18). Rather than installing and configuring physical machines, you install and configure a virtual machine. When the virtual machine is running, it appears to the software and network to be identical to a physical machine, but you can save the entire machine image as a disk file and reload it on command.

image

Figure 7.18 Your solution may need to run in different target environments. These might be different localized versions of the OS, different versions of supporting components, such as databases and web servers, or different configurations of your solution. Virtual machines are a low-overhead way of capturing the environments in software so that you can run tests in a self-contained image for the specified configuration.

Setting up a library of virtual machines means that you will go through the setup and configuration time once, not with every test cycle.

Reporting

The other major issue with testing configurations is tracking and reporting what has been tested so that you can identify gaps in configuration coverage and prioritize your next testing appropriately. Fortunately, VSTS tracks the configuration used on every test run (see Figure 7.19). Reports make it easy to track the configurations that have been used and those that lack good test coverage.

image

Figure 7.19 Run configurations can capture the representative target environments of the systems under test. The metrics warehouse accumulates test results by configuration so that you can build a picture over time of the test coverage against configurations.

It’s usually a good idea to vary the configurations with every round of testing so that you cycle through the different configurations as a matter of course. Because the test results are always tracked against the test run configuration, you will also have the information for reproducing any results, and you will improve your coverage of configurations this way.

Are We Testing Enough?

Defining “Good Enough”

In Chapter 3, I presented Kano Analysis as a technique for thinking holistically in terms of satisfiers and dissatisfiers for a project, and in Chapter 4, I discussed planning an iteration. The iteration’s objectives should determine the test objectives.

Although it may be hard to determine what constitutes “good enough” testing for a project as a whole, it’s not that hard for the iteration. The whole team should know what “good enough” means for the current iteration, and that should be captured in the scenarios, QoS that are planned to be implemented, and risks to be addressed in the iteration. Testing should verify the accomplishment of that target through the course of the iteration.9

A value-up practice for planning “good enough” is to keep the bug backlog as close to zero as possible. Every time you defer resolving or closing a bug, you impose additional future liability on the project for three reasons: The bug itself will have to be handled multiple times, someone (usually a developer) will have a longer lag before returning to the code for analysis and fixing, and you’ll create a “Broken Windows” effect. The Broken Windows theory holds that in neighborhoods where small details, such as broken windows, go unaddressed, other acts of crime are more likely to be ignored. Cem Kaner, a software testing professor and former public prosecutor, describes this well:10

The challenge with graffiti and broken windows is that they identify a community standard. If the community can’t even keep itself moderately clean, then: (1) Problems like these are not worth reporting, and so citizens will stop reporting them. (We also see the converse of this, as a well-established phenomenon. In communities that start actually prosecuting domestic violence or rape, the reported incidence of these crimes rises substantially—presumably, the visible enforcement causes a higher probability of a report of a crime, rather than more crime). In software, many bugs are kept off the lists as not worth reporting. (2) People will be less likely to clean these bugs up on their own because their small effort won’t make much of a difference. (3) Some people will feel it is acceptable (socially tolerated in this community) to commit more graffiti or to break more windows. (4) Many people will feel that if these are tolerated, there probably isn’t much bandwidth available to enforce laws against more serious street crimes.

Similarly, in projects with large bug backlogs, overall attention to quality issues may decline. This is one of many reasons to keep the bug backlog as close to zero as possible.

Exploratory Testing

Most testing I’ve discussed so far is either automated or highly scripted manual testing. These are good for finding the things that you know to look for but weak for finding bugs or issues where you don’t know to look. Exploratory testing, also called ad hoc testing, is an important mindset to bring to all of the testing that you do. In exploratory testing, the tester assumes the persona of the user and exercises the software as that persona would. Kaner, Bach, and Pettichord describe exploratory testing this way:

By exploration, we mean purposeful wandering; navigating through a space with a general mission, but without a prescripted route. Exploration involves continuous learning and experimenting. There’s a lot of backtracking, repetition, and other processes that look like waste to the untrained eye.11

Exploratory testing can be a very important source of discovery, not just of bugs but also of unforeseen (or not yet described) scenarios and QoS requirements. Capture these in the backlog of the work item database so that you can use them in planning the current and future iterations. As a manager, plan for a certain level of exploratory testing in every iteration. Define charters for these testing sessions according to the goals of the iteration. Tune the charters and the resource level according to the value you get from these sessions. In short, plan capacity for exploratory testing.

Testing as Discovery

Embrace testing that discovers new scenarios, QoS requirements, and risks in addition of course to finding bugs. Capture the new scenarios, QoS, and risks as work items in the product backlog. This is vital information. It makes the quantitative coverage measurement a little harder in that you’re increasing the denominator, but that’s a small price to pay for helping the project deliver more customer value.

A particular type of scenario test is the “soap opera.” Hans Buwalda describes the technique in his article “Soap Opera Testing” as follows:

Soap operas are dramatic daytime television shows that were originally sponsored by soap vendors. They depict life in a way that viewers can relate to, but the situations portrayed are typically condensed and exaggerated. In one episode, more things happen to the characters than most of us will experience in a lifetime. Opinions may differ about whether soap operas are fun to watch, but it must be great fun to write them. Soap opera testing is similar to a soap opera in that tests are based on real life, exaggerated, and condensed.12

Soap operas are harsh, complex tests—they test many features using intricate and perhaps unforeseen sequences. The essence of soap operas is that they present cases that are relevant to the domain and important to the stakeholders but that cannot be tested in isolation. They are a good test of robustness in iterations where the software is mature enough to handle long test sequences.

False Confidence

When you have automated or highly scripted testing, and you do not balance it with exploration, you run the risk of what Boris Beizer coined as the “Pesticide Paradox”:13

Every method you use to prevent or find bugs leaves a residue of subtler bugs against which those methods are ineffectual.

In other words, you can make your software immune to the tests that you already have. This pattern is especially a concern when the only testing being done is regression testing, and the test pool is very stable. There are three ways to mitigate the Pesticide Paradox:

1. Make sure that the tests the software faces continually include fresh ones, including good negative tests.

2. Look at gaps in test coverage against scenarios, QoS, risks, and code. Prioritize the gaps and think about tests that can close them.

3. Use progressively harsher tests, notably soap operas and exploratory testing, to confirm, from a knowledgeable domain expert’s perspective, that the software doesn’t have undiscovered vulnerabilities.

Exploratory testing, soap operas, and risk identification all mitigate against a false sense of confidence.

When Should We Test?

In principle, it seems obvious that at any time in development project, you should test the functionality that needs to be tested at that time. In practice, however, teams historically experience tremendous churn because testers do not know when to test. In these cases, testers claim they’re blocked, developers feel the testers are wasting their time, and collaboration quickly disintegrates.

VSTS alleviates this issue in three ways:

1. The iteration structure makes clear what functionality is planned when, and course corrections are frequent enough to enable plans to be fine-tuned (refer to Chapter 4).

2. The common product backlog makes clear what scenarios and QoS have actually been resolved and are therefore ready for test at a particular time.

3. The build report makes clear for every daily build what tasks and scenarios have actually been delivered and integrated in the build and whether the build itself has passed BVTs and is ready for further testing (see Figure 7.20).

image

Figure 7.20 The daily build produces an automated report that includes the work items that have been resolved in the build. In other words, the report accumulates all the items that were marked resolved when code was checked in. This lets you see immediately whether intended functionality can be expected to work in the build and correspondingly which tests can now be run.

In addition to using the transparency provided by VSTS, a test manager should carefully think about which tests are appropriate for which cycles (see Figure 7.21).

image

Figure 7.21 MSF uses the concept of interlocking cycles to group activities to the appropriate frequency and to allow the right things to happen first. Testing activities vary according to these cycles.

Check-In Cycle

The atomic programming cycle is the set of activities leading to checking in source code. In VSTS, source should be delivered with unit tests and resolution of the corresponding work items. As a developer, before checking in, you should run the unit tests for any functionality that you deliver and any that is dependent on your delivery. Depending on the software project, it might be appropriate to supplement unit tests with component integration tests. (You should also run static code analysis.) Check-in policy can enforce this practice.

Daily Build Cycle

BVTs run with the nightly build automation, as discussed in the previous chapter. For BVTs, it is best to have lots of tests, each for relatively fine-grained scenarios, rather than a few, more complex tests. Typically, these are a superset of the pre-check-in unit tests that the individual team members ran before delivering new source code, plus resilient scenario tests that other team members have delivered. Having small, self-contained tests makes it possible to quickly isolate the failure cases and drill into them. It also makes it possible to update the tests when the corresponding scenarios change.

It is important that the suite of BVTs grow from iteration to iteration. As the software functionality matures, completed work from every prior iteration should be verified continually with BVTs.

Accepted Build Cycle

After a software build has passed the BVTs, it is ready for further testing. Now you can test the new scenarios and QoS that have been implemented in the current iteration. This list should be obvious from both the work items of the product backlog and the build reports. As scenarios and QoS first become available, you should run supportive tests. As these easier tests pass, you should run ever more challenging tests.

Which builds should be promoted for further testing? The answer depends on the goals of the iteration and the length of time it takes to test a build thoroughly. When there is significant new functionality from one nightly build to the next, the new work should be tested and feedback should be given to the developers as soon as practical. This prioritization is a “Last In, First Out” stack, and it ensures that the developer receives bugs on recent work while it is still fresh in mind.

On the other hand, some testing cycles for broad business functionality or many configurations take days or (unfortunately) weeks to complete. Setup requirements, such as configuring and loading a database, can make changing builds disruptive during a test cycle. If (and only if) that’s true, and the particular testing is not dependent on the new work, it may be worth delaying the acceptance of the new build into test. Even better is to find a way to shorten the amount of time needed to run the full test pass.

Iteration Cycle

The scenarios and QoS scheduled for an iteration should be considered exit criteria that need to be verified by testing. The work items that track these scenarios and QoS should not be closed until the corresponding functionality has passed testing in the software as a whole. If you reach the end of an iteration and discover that you have not completed the planned scenarios and QoS, this is essential information for planning the next iterations. (See the discussion of project velocity in Chapter 4.)

However, you should not let an exit criterion be declared satisfied until testing has confirmed that the expected code works. This makes the iteration cycle the most visible testing cycle. In a well run project, where testing happens concurrently with development, there will not be a large testing bottleneck at the end of the iteration. On the other hand, if testing is not kept in lockstep with development, there may be a big bulge. This growing bulge will be visible during the iteration in the Remaining Work chart. It is undesirable, creates a significant risk, and will disrupt the rhythm of the project substantially. As a test manager or project manager, you should keep your eye on this chart daily to avoid becoming trapped at iteration end (see Figure 7.22).

image

Figure 7.22 The bulge in this Remaining Work chart indicates that although the coding of scenarios is progressing well, there is a bottleneck in testing. This may be due to inadequate testing resources or inadequate quality of scenarios delivered in the builds. (You would check the Reactivations report next if you suspected poor quality.)

Project Cycle

At the end of an iteration, the software should be passing the tests planned for that iteration, and in subsequent iterations, it should continue to pass them. When the software keeps passing the tests put to it, you develop great momentum. The more automated the tests, the more frequently they are run, and the more frequently this confidence is reinforced.

Of course, the more mature the project, the harsher the tests should be. Continue to use soap operas, negative tests, and exploration to check for blind spots. Use configuration testing to make sure that production constraints have been addressed. Check trends on coverage against code, requirements, and risk.

Which Tests Should Be Automated?

In the last ten years, a lot has been written about the pitfalls and benefits of test automation.14 I will simplify the arguments here. Automation is useful when it achieves high coverage and when the tests will be used many times across changes in the software under test (SUT). However, automation is expensive and hard to maintain, especially if based on the SUT’s user interface. Moreover, automation often leads to a false sense of security, especially when its results are not balanced against views of coverage and when its test cases are not balanced with harsh exploratory and negative testing.

These considerations lead to some guidelines:

1. Automate tests that support programming, such as unit tests, component/service integration tests, and BVTs, and make sure that they achieve very high code coverage, as discussed in Chapter 6, “Development.”

2. Automate configuration tests whenever you can.

If you expect your software to be long-lived, then

3. Automate scenario tests when possible, but expect that they will need maintenance. Where possible, do not rely on the UI for testing and instead code more durable tests against the appropriate APIs.

4. Automate load tests, but again, expect that they will need maintenance.

And. . .

5. Guard against a false sense of confidence with exploratory testing, negative tests, soap operas, and tests against risks. Most of these will be manual because you will be more interested in maximizing the diversity of your testing than repeating the same tests every time.

How Efficient Is Our Team, Or Our Outsourced Team?

Together, the reports from the metrics warehouse answer the question of team effectiveness and efficiency. Specific patterns of problems are covered in Chapter 9, “Troubleshooting the Project,” but here is a general guide to using the reports to answer this question.

The Remaining Work report, as illustrated previously in Figure 4.4, shows you the cumulative flow of intended work through testing. The middle band, from Resolved to Closed, is the work in process of testing. If you see a relatively consistent width, as in Figure 4.4, then you know that you have a smooth flow. The smooth flow gives you a clear indication of the match of your resources to capacity. (This contrasts with the bottleneck shown in Figure 7.22, indicating a resource mismatch or problem with quality at the time of resolution.) Velocity (Figure 4.5) in the Resolved series drills into the details of the capacity and its variance.

Requirements Test History (Figure 7.14) shows the progression of the testing against scenarios and QoS, while Quality Indicators (Figure 4.7) puts test results, bugs, and code coverage together. By putting these series together, you can make sure that the independent dimensions are progressing as expected.

Of course, in judging testing, you need to look at upstream quality that is being passed into testing. The Build History report (Figure 9.11) shows the status of daily builds, which should be completing successfully and passing their BVTs with rare exceptions. Quality Indicators provides two series that should be watched together—code churn, the indicator of how much new code needs to be tested, and code coverage, the measure of how much of it actually is being tested. Reactivations (Figure 4.9) are bugs that have been reopened after being resolved, that is, reported fixed. If these are high or rising, it’s a clear indicator of upstream problems.

Summary

This chapter has covered testing in a value-up paradigm. I started with an exercise to illustrate the importance of testing as a key source of information and its tenets of creating information fast, addressing risk, and maximizing diversity.

Next, I went through basic questions that testing should answer and illustrated how to use VSTS to help answer them:

• Are we delivering the customer value?

• Are the qualities of service, such as performance and security, fit for use?

• Have we tested the changes?

• What haven’t we tested?

• Does it work in production as well as in the lab?

Are we testing enough?

• When should we be running the tests?

• Which tests should be automated?

• How efficient is our team, or our outsourced team?

These are the simple value-up questions that testing needs to address.

Endnotes

1. Cem Kaner, James Bach, and Bret Pettichord, Lessons Learned in Software Testing (New York: Wiley, 2002), 1.

2. The root of many of these is IEEE STD 829-1983.

3. For example, Beck 2000, op. cit.

4. Adapted from Brian Marick, “Classic Testing Mistakes,” 1997, available at http://www.testing.com/writings/classic/mistakes.pdf.

5. Originally called context-driven testing by Kaner, Bach, and Pettichord (257), I’ve included it as part of the Value-Up Paradigm.

6. [Kaner 2003] “Cem Kaner on Scenario Testing,” STQE Magazine, September/October 2003, 22, available at www.stickyminds.com. For a detailed view of Kaner’s course on this subject, see http://www.testingeducation.org/BBST/ScenarioTesting.html

7. James A. Whittaker and Herbert H. Thompson, How to Break Software Security: Effective Techniques for Security Testing (Boston: Addison-Wesley, 2004). Whittaker and Thompson have identified 19 attack patterns that are standard approaches to hacking systems.

8. Boris Beizer, Software Testing Techniques (Boston: International Thomson Computer Press, 1990), 535.

9. James Bach has written extensively on the heuristics for defining when our software is good enough for its purpose. See http://www.satisfice.com/articles.shtml for a collection of his essays.

10. Cem Kaner, private email. Malcolm Gladwell, The Tipping Point (Little Brown & Co., 2000), 141, has popularized the discussion, based on Mayor Giulini’s use in New York City. The statistical evidence supporting the theory is disputable; see Steven D. Levitt and Stephen J. Dubner, Freakonomics: A Rogue Economist Explores the Hidden Side of Everything (New York: HarperCollins, 2005). Nonetheless, the psychological argument that communities, including software teams, can become habituated to conditions of disrepair is widely consistent with experience.

11. Kaner, Bach, and Pettichord 2002, 18.

12. Hans Buwalda, “Soap Opera Testing,” Better Software, February 2004, 30–37, available at www.stickyminds.com.

13. Boris Beizer, Software Testing Techniques (Boston: International Thomson Computer Press, 1990), 9.

14. For a classic discussion of the risks of bad automation, see James Bach, “Test Automation Snake Oil,” originally published in Windows Tech Journal (November 1996), available at http://www.satisfice.com/articles/test_automation_snake_oil.pdf.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.75.236