Chapter 9. Use Examples

In the first two parts of this book, the actual work started with examples derived from the business goals. For good examples, you need to have domain knowledge—or at least an easy access to someone who has. In the first example, the team did not have enough domain knowledge—and they received that feedback from their earlier failings. The team at Major International Airport Corp. therefore sat together in a workshop. To compensate for their lack of domain knowledge, they invited business expert Bill to join them. Together with Bill’s help, Phyllis and Tony were able to derive examples for the business rules regarding airport parking lot costs.

In the second example, we worked together through a problem where I felt I had enough domain knowledge. We could have found ourselves trapped until it was too late. For the depth that we covered with this example, we had sufficient domain knowledge, so we could start after organizing some of our thoughts.

Examples are the first pillar in working with acceptance test-driven development, specification by example, behavior-driven development, or whatever you might call it. You can always work with examples of any kind, regardless of whether you automate the examples later.

Once I applied this approach in a company that was doing product development using a derivation of waterfall development. I was working in the System Integration department. The system we were integrating was a mobile phone rating and billing solution. We configured that system for use by the end customer while the Product Development department developed the product for a larger customer base. The product was highly configurable for different tariffs in the domain of mobile phones. One day our product development came up with a redesign of the GUIs used to configure the product. Unfortunately, a feature from an earlier release dealing with different workflows and approvals for new tariffs had become an obstruction. Since we were the customer and domain experts on configuring the product, the colleagues from product development reached out to us.

We held three meetings to clarify the problem and come up with a redesign that would benefit both departments. Previously, the system had a workflow for bringing in new tariffs into the system. The old workflow never worked for a real customer. When the product should support modular pieces for the single configuration file up to that point, this old workflow would have made things worse. The configuration component would have needed to support the validation of multiple working sets and to validate changes and dependencies across multiple working sets. This would eventually end in a mess for the product’s testing department as well.

Within a three-hour workshop we discussed the whole workflow. With the vision of the new solution in mind, we came up with examples describing the future workflow with the split-up configuration. When the updated version arrived six months later, everyone was happy with it.

The examples we identified in the workshop were very high level. The testing department at that time did not have the abilities to automate these tests at the level we discussed them. The conversation about the requirements alone led to a better implementation of the product and a better customer experience on our side.

Regardless of whether you automate your examples, there are some things to watch out for. This chapter is about these things to watch for while using examples. The key factors that influence the success of your approach are the format the examples are written in, how detailed the examples are, and whether you considered any gaps in your overall approach.

Use a Proper Format

Expressing your tests using examples raises the question of how to express these examples. Although you might think that expressing examples is pretty easy, some patterns emerged during the past decade within teams applying ATDD.

The proper format for you depends on factors in your team, your project, and your organization. First of all, you need to consider who is going to read the tests after you created them. This probably includes yourself, the programmer who is going to develop the feature described by the examples, and your product owner or customer. You may also consider a future test maintainer if you hand over your tests to a different team or department or even company. All these future test readers may want to get a quick picture of the feature that you describe today. That’s where a common way to express your examples becomes relevant.

Keeping all these possible target groups in mind, you should strive for examples that can be understood by any of these people. This means that you shouldn’t bother writing technical descriptions of the workflow unless all people involved can make meaning out of them. We could have written the examples for our traffic light controller in the following way:

If the green signal circuit for direction A and direction B are turned on, shut down the traffic light controller to a failsafe state. In the failsafe state, the yellow lights will be blinking constantly. Only a technician is allowed to bring the traffic lights out of this failsafe state after inspecting it. The controller has to log all state changes for the technical analysis in such cases.

This is a more narrative style such as you might find in a traditional specification. Now, reconsider the way we wrote down our examples. The first thing you might notice is that we didn’t bother with the logging requirement at all—maybe this will be part of some future development. But we also noted all the circumstances when the failsafe mode shall be enabled. This more declarative style to write down examples provides a lot more information in the future when we have to read and extend the examples. And this information will also serve as communication device for real customers—and even users of the system such as you and me.

ATDD-friendly test automation frameworks split up test data and test automation code necessary for executing the tests. Since the test automation code glues together the development of your tests and the development of your application, I deliberately also call it glue code. The glue code may be implemented in different languages within the different available tools. Most tools are built upon some convention for calling functions in the glue code. With the splitting of the test data from the glue code, you can define the test examples independently of any particular implementation of the application.

There are many ways to write down examples for your application requirements. The most popular approach in the past few years is the one from behavior-driven development (BDD). The BDD format emerges around the terms Given-When-Then, which help to structure the expectations of a particular feature.

Another popular way to express examples is tabulated formats. There are some variations, but all have in common three different formats: one form of input-output mapping, one form of queries, and one form of actions as in workflows.

A third way to express examples are keywords or data-driven tests. With keywords you can combine different levels of abstraction in your test language. You can decide to use lower-level functions for tests focused on lower levels of your application, but you can also combine lower level functions to yield a higher level of abstraction. With keywords, these different levels of abstraction become fluent in your test description thereby forming a language for your domain in your tests.

Behavior-Driven Development

BDD was first described by Dan North [Nor06]. North mentioned the Given-When-Then paradigm in this article for the first time. While BDD consists of more than using Given-When-Then as a method to express the behavior of the features, this particular syntax has become almost a synonym for BDD in the past years. We saw the Given-When-Then format in the airport example. The three steps in the BDD format are arranged around the keywords Given, When, and Then. As an example, consider Listing 9.1, which searches Google for the term ATDD and expects to find lots of results.

Listing 9.1. A basic search on the Internet

1 Feature: Google Search
2   As an Internet user I want to use Google for searches
3
4   Scenario: Search for ATDD
5     Given the default Google page
6     When I search for 'ATDD'
7     Then I will find lots of results

In the Given part you express all preconditions for the particular behavior. The preconditions form the context for the description of the behavior. This includes any particular settings in an expressive manner. In the example above line 5 ensures the precondition that I have a browser open and that it has opened the default Google page. In the airport example we didn’t have any givens, since all tests were based upon the parking lot calculator being open in a browser window. There was no reason to deviate from this default. In general, you want to express in the Given part anything that differs from a default case. For a system with different types of accounts, you will express which account you are working on in this particular example. For a system of various web pages, you may want to express the particular page where your example is operated. For a workflow in an application, you may want to express all the data that is relevant and that you expect to have been entered in the previous steps of the workflow, right before beginning the operation on the described part of the workflow. There may be more than one Given part, they are usually combined through consecutive And statements.

In the When part, you describe the operation and all the parameters that you are triggering. In the example above, we express in line 6 that we would like to search for a specific term. The operation that we want to trigger is basically entering some text in a textfield on a web page and hitting the default button. Other examples might be that you want to express that you reloaded an account with some money, that you entered different values to a web page form, or that you proceeded to the checkout of an online shop after filling your cart. The When part should describe something that happens [dFAdF10]. This may be a particular trigger from a user, an event outside the current subsystem, such as asynchronous messages from a third-party Internet service or a function within the computer system like a time-out. Additionally, there should be exactly one action step in any scenario description. This gives the described scenario a small enough focus.

The Then part describes post-conditions after executing the action from the When part. In most cases these are assertions against the system. In the example on Google search, we check in line 7 that there are lots of search results. As another example after reloading an account with some amount, you can check the bonus amount that your system shall provide. For a web page form, such as the parking lot calculator, you can assert calculated parking costs based on your own expectations. Like Given statements, several Then statements may be chained by using And as connector.

One of the mantras in behavior-driven development is outside-in development. BDD favors the development of code based upon requirements from the customer. The purpose of the workshop in the airport parking lot example was to derive the scope for the calculator based upon the business goals. The team achieved this by expressing the requirements without spoiling the implementation of the code by a predefined solution. The purpose of the requirements elicitation is to explore the space of possible solutions [GW89], which means to explore any solutions that fulfill the business rules. During the design process the designer seeks the best trade-off between possible solution parameters. The team at Major International Airport Corp. achieved this by abstracting the examples they identified without having a particular implementation of the user interface in mind. In fact, the examples can be hooked up to a new user interface by replacing the step definitions with definitions for the new user interface. Since the business rules probably will not change when using a different user interface, examples that are based upon the business rules are still valid.

Tabulated Formats

One of the most popular approaches using tabulated formats is the Framework for Integrated Tests. [MC05] In 2008 Robert C. Martin introduced the Simple List Invocation Method (SLiM) in FitNesse [Mar08b], which provides similar table structures to express requirements. In both approaches there are three commonly used styles: tables that take inputs and check outputs from the system, tables that query the system and check collections of values against the ones obtained from the system under test, and tables that support workflows.

Decision Tables

The first set of tables take input values to the system, execute something in the system, and then check some values from the system against the ones provided. We had such an example in the traffic lights example in SLiM where these are called DecisionTables. In FIT they are called ColumnFixture, and in FitLibrary, an extension for FIT, these tables are called CalculateFixture.

You can express the majority of requirements using a decision table format. The tables for the airport parking lot example are decision tables. The parking duration is the input to the tested system, the parking costs are the output values from the system which the test runner will check. Listing 9.2 shows the airport examples for valet parking formatted as a decision table. Do they look familiar to you?

Listing 9.2. The valet parking tests expressed as a decision table in SLiM

 1 !|Parking costs for|Valet Parking |
 2 |parking duration  |parking costs?|
 3 |30 minutes        |$12.00        |
 4 |3 hours           |$12.00        |
 5 |5 hours           |$12.00        |
 6 |5 hours 1 minute  |$18.00        |
 7 |12 hours          |$18.00        |
 8 |24 hours          |$18.00        |
 9 |1 day 1 minute    |$36.00        |
10 |3 days            |$54.00        |
11 |1 week            |$126.00       |

There is no restriction to a single input value, nor is there a restriction to check a single output value. In fact, if you leave out the output values, you get a special table for setting up values in the system. This may be handy if you want to describe three accounts that your tests are later going to operate on. See Listing 9.3 for an example. This special decision table without outputs is often called a setup table, or a SetUpFixture.

Listing 9.3. A setup table preparing three different accounts

1 !|Account Creator                         |
2 |account name  |account state|role        |
3 |Susi Service  |active       |service user|
4 |Tim Technician|active       |technician  |
5 |Uwe Unemployed|unemployed   |service user|

While there is no restriction on how many columns your table will have, in practice the readability will heavily suffer if you come up with a table with more than 10 columns. I have seen tests expressed using 30 or so columns with more than hundreds of lines.1 The problem with these tests is that they are not easy to understand. A future test maintainer—remember this could be you—is unlikely to understand in a few seconds what a particular row is about. This is a test smell that you may want to avoid.2

Query Tables

Query tables are necessary for collections from the system. In SLiM these are called Query Tables, in FIT they are called RowFixture, and in FitLibrary there are ArrayFixture, SetFixture, and SubsetFixture. Often there is a need to check the order of a collection of entries, or to ensure subsets of entries in a collection. In SLiM they may be prefixed with the word Subset or Ordered to yield the special variants. In FitLibrary there are different classes that you need to subclass from. Listing 9.4 shows an example checking all the users in a system.

Listing 9.4. A query table checking the existence of some previously set up data

1 !|Query:Users in the System                     |
2 |user name           |role           |user state|
3 |Tim Tester          |Tester         |active    |
4 |Paul Programmer     |Programmer     |active    |
5 |Petra Projectmanager|Project Manager|active    |

Query tables are used after gathering data from the system and receiving a list of different things stored in the system. They come in handy for a collection of accounts in the system, for example, after a search operation. For an online shopping site you can query the shopping cart of your customer and check the contents of the cart against an expected list.

You can check the entries in the collection against just one of their attributes, like the name of an item, or against a combination of name, price, and amount for a shopping cart. Usually the amount of attributes denoted in the table expresses which attributes will be checked. The granularity of your tables then dictates how thoroughly you want to check data from the system.

The more data you express in your tables, the more possible adaptations you have if the name of an attribute changes. For the design of your tests, there is a trade-off between how thoroughly you want to test the application and how much effort you will put into future maintenance of your test suite. On one hand, you may include fewer details in the query tables, leading to a higher risk that something breaks unnoticed by this test. You can tackle this risk by extending the test suite with more examples for the details you left out there or by setting time aside for exploratory testing to attack the risk. Of course, each of these decisions has side effects as well. Another decision could be to include all details you can think of. This may lead to fragile tests if the attributes are renamed, for example. Between these two extremes there are approaches to solve this problem as well. But this also means that you will have to decide the number of details to include in your table structure in your particular situation.

Probably the easiest thing is to go with your first gut feeling and reflect on your decision from time to time. If you find yourself maintaining a lot of tests in your test suite, you should revisit some of your decisions and refactor your test.

Script Tables

You can use script tables to express workflows in your system. You can also use script tables to combine decision tables with query tables in one test. An alternative to the Given-When-Then style of BDD in a tabulated format is a pattern called Setup-Execute-Verify or Arrange-Act-Assert. For Setups I often use DecisionTables without checked outputs in order to set up any precondition in the system. Then I invoke a single operation in a script table, like reloading a balance of the account I just prepared. As a final verification step, I can use a query table to gather all the balances stored at my account and check the values after the reload. You can see a full example in Listing 9.5.

Listing 9.5. A script table with a whole flow through the system

 1 !|script|Account Reload|
 2
 3 !|AccountCreator        |
 4 |account name   |tariff |
 5 |prepaid account|prepaid|
 6
 7 |reload|50.00 EUR|on|prepaid account|
 8
 9 !|Query:BalanceChecker|prepaid account|
10 |balance name         |balance value  |
11 |Main balance         |50.00 EUR      |
12 |Bonus balance        |5.00 EUR       |

Script Tables from SLiM are named ActionFixture in FIT, or DoFixture in FitLibrary. In general, you can express anything you want in your test table with them. The tools use a convention for calling particular functions in your glue code based upon how you wrote your examples. This is highly dependent on the particular tool you chose. Usually there is a way to separate parameters to the function from the text. The text is concatenated—camel-cased3—to derive the name of the function to call.

Script tables are useful whenever you want to express larger sets of operations. For example, you can express to start a system, input some values, click a button, and then check some values before finally shutting down the system again using script tables. You may often want to use a combination of different tables in your tests.

Keyword-Driven Automation

You can use keywords to express combinations of operations in a single one. The operation that combines several keywords becomes a higher level operation at that time. You can virtually combine multiple levels of keywords to match the right level of abstraction for your tests.

In the traffic lights example we used the invalid combination scenario to express a higher level concept in its own keyword (see Listing 9.6). In the test execution report we could also dive into the trace of this keyword. Another popular framework built on keywords is RobotFramework.4 RobotFramework comes with a rich number of test libraries and provides add-ons for web-based tests, SSH support, databases, and Swing.

Listing 9.6. The usage of a scenario table as a keyword in the traffic lights example (repeat of Listing 7.15)

 1 ...
 2 !2 Invalid combinations
 3
 4 !|scenario       |invalid combination|firstLight||secondLight|
 5 |set first light |@firstLight                                |
 6 |set second light|@secondLight                               |
 7 |execute                                                     |
 8 |check           |first light        |yellow blink           |
 9 |check           |second light       |yellow blink           |
10
11 !|script|FirstLightSwitchingCrossingController|
12
13 !|invalid combination    |
14 |firstLight  |secondLight|
15 |green       |red, yellow|
16 |green       |green      |
17 |green       |yellow     |
18 |yellow      |red, yellow|
19 |yellow      |green      |
20 |yellow      |yellow     |
21 |red, yellow |red, yellow|
22 |red, yellow |green      |
23 |red, yellow |yellow     |
24 |red         |red, yellow|
25 |red         |green      |
26 |red         |yellow     |
27 |yellow blink|red        |
28 |yellow blink|red, yellow|
29 |yellow blink|green      |
30 |yellow blink|yellow     |

When using keywords, the border between test data representation and test automation code becomes fluent. You may have keywords implemented in a programming language driving the application or interacting on a lower level with the operating system. You may as well have keywords that are constructed by a combination of lower-level keywords. And you may also combine these higher-level keywords in new ways. With this organization you get multiple levels of keywords.

This approach is a powerful way to organize your test data around a set of low-level test automation drivers. With new combinations of keywords the language for your test automation code evolves and becomes even more powerful. On the downside, this approach may lead to deep and complex levels of test automation keywords. At the time of this writing there were few refactoring tools for keywords available. Maintaining several levels of keywords then can become a mess. Although it should be simple to search and replace text in some files, the fluent style of this test data makes it complicated to find every occurrence. Renaming a keyword, for example, or extending the parameter list of a keyword becomes difficult. Most modern IDEs for programming languages support these refactorings now in a safe way. In the long run we will hopefully see more refactoring tools for keyword-driven automation as well—eventually even refactoring across lower-level and higher-level keywords at the same time.

Glue Code and Support Code

How your examples get automated heavily depends on the framework in use. For example, for Java some frameworks rely on Java annotations to find the code to execute for a given phrase in the example. Other frameworks use camel-cased function names. Some frameworks support the usage of bean methods—getters and setters for properties in your classes, like first name and last name of an account.

Regardless of the framework or tool you use for automating your examples, you should take into account that you are developing your glue code. Some teams seem oblivious to the fact that they are developing code while writing the glue and support code. When you wire up your application to the tests, the code you write may seem simple and easy to maintain. Over time your growing code base will become hard to maintain if you don’t apply similar principles to the design of your glue and support code as you would apply to your production code.

For me this usually means starting with a simple code base. I want to be able to start automated acceptance tests fast. For one client we were able to come up with some automated tests on the acceptance level after spending roughly two hours with two persons on it. Two more hours, and we had a basic understanding of the driver code for our automated tests, as well as a basic glue code layer through which we could run most of our acceptance tests. We were aware that this code would be extended once we got more complex automated tests later.

Over time, when more and more functionality starts to grow, I consider actively extracting new components from the existing code. As we saw in Part II with the LightStateEditor, we added these components using test-driven development. We actually tested our test code. When I worked at one company, we actually applied several code metrics like static code analysis and code coverage for our automation code. We had a separate project with a separate build in our Continuous Integration system. When we compared the metrics to the ones from our production code, we were outperforming the production code with our test code. That test automation code was still in use and development two years later.

If this sounds rather extreme to you, consider the following story I heard in 2009. The team had a visit from an external contractor on software test automation. The contractor felt uncomfortable adding more support code with complex logic. So, he used test-driven development for the complex code that he was adding. After the contractor left again, the code coverage of the whole application had risen by 10% to a total of 15%.

Developing test automation code is software development. Therefore, you should apply the same principles to your glue and support code that you apply to developing your production code. In the past I have seen test automation code that lacked design, was not well documented, and was sometimes even fragile with bugs in it. When we developed a new test automation approach at one company, we overcame all these drawbacks in the test automation code by applying incremental design together with test-driven development. Test-driven development helped us to understand the code while it also provided the necessary documentation. By using TDD we also ensured that our test automation code did not suffer from bugs, thereby leading to false positives—tests that pass although the software is broken—or false negatives—tests that fail though the software is working—when running our automated tests.

The necessity for adding unit tests to your glue and support code becomes obvious if you use ATDD in the way we used it in Part II. While working with the acceptance tests, we discovered the domain of the application. With the knowledge and experience with the domain we could easily extract or derive the domain code from the glue code. Without unit tests for the glue code at that time, we would add unit tests either by extraction first, then adding tests after the fact, or by developing new domain code using test-driven development.

The Right Format

There is something to say for the right format. The formats that I have shown, behavior-driven development, tabulated formats, and keyword-based examples, represent the variety of formats that were around when I wrote this book. In the past decade most of the tools available today underwent a development to support tests in either format shown here. For example, you can use a tabulated format using FitNesse’s SLiM framework with decision and query tables. Alternately, you can use a Given-When-Then format with SLiM by using script tables. For Robot Framework there are similar ways to use one of the three presented formats. For your particular combination of test data format—BDD, tables, or keywords—and test automation language there is probably at least one framework around that you can get started with.

This is great, since you can focus your attention on using the right format to express examples for your application in the most natural style and delay the tool decision until later. Most teams that started with an implementation of a particular tool just got that: an implementation of a tool, but not an implementation of a successful approach [Adz11]. Obviously, an automated testing tool is not a vital test strategy [KBP01].

For your examples as well as for your tests, once you have automated the examples there are several stakeholders. First, there is the programmer of the feature who needs to understand the example from the written description. Then there is the programmer who will automate the particular example. Your customer or product owner is also going to read through the test once it’s automated during the review. Your whole team might also consult your tests when the next extension is planned in several iterations in order to understand side effects. Finally, there is the future maintainer of the system who will adapt the test cases when necessary.

All these stakeholders should be able to quickly understand the intent of your example and your tests. In order to do this, they need to be able to read the tests and understand their context and specific focus. The longer it takes to read and understand a test, the more time is wasted in the sense of lean thinking. From a systemic point of view, more wasted time leads to less time for productive work. The less productive work, the more perceived pressure you will feel from the outside. The more pressure you feel, the more likely you take shortcuts to your tests. The more shortcuts you take, the less understandable your tests become (see Figure 9.1). At this point, you formed a vicious cycle, or downward spiral. You seem doomed.

Image

Figure 9.1. A systems view on intent-revealing tests

You can break this vicious cycle by realizing the role you play in this system. There is one obvious decision point in this cycle that influences whether the described system is a vicious cycle or a positive feedback loop leading to a balanced system: the decision to waste time on reading acceptance tests [Wei91].

If you reverse this decision, you spend less time trying to understand your tests. The less time you spend, the more time you gain for productive work. This relieves you from outside pressure and prevents you from taking shortcuts in your tests.

Teams use acceptance tests as a communication device. The variety of stake-holders makes this transparent. While the application is under development, there is a permanent handover from one team member to another. Successful teams ensure readability of their tests to reduce the friction that such handovers cause. In 2009 I heard a story from Enrique Comba-Riepenhausen, who said that his on-site customer could read not only his acceptance tests, but also his unit tests and even his domain code. His team had successfully implemented a concept that Eric Evans calls ubiquitous language [Eva03].

Every project shares a common language, a ubiquitous language. The more translation between different languages that is necessary in a project, the more potential for confusion exists. If the business domain is expressed using the same terms in your acceptance tests, you ensure that your customer will also understand your tests. If you also model your domain code around the same language, you ensure that everyone reaches the same understanding while avoiding confusions later once it’s too late.

Refine the Examples

Getting essential examples from your customer or product owner helps you getting started with the functionality. Unfortunately, a first discussion as we saw it in Part I might not give you all the examples that you need in order to build the software. You need to refine your examples after having identified a first set of it [Adz11].

Refinement of examples may happen in several ways. Usually, testers know how to flesh out hidden assumptions and boundary conditions from their first examples. Depending on the domain of the application, there may be constraints for a maximum length of a string, different validation rules, or combinations of business rules that exclude each other. If you miss these conditions in the initial Specification Workshop meeting, it pays to ask a tester to refine the examples.

Another case of refinement of examples is a particular business rule that the product owner was not sure about. After getting a clarification about the underlying business rules the product owner, a tester, and a programmer get together to refine the examples they identified in the first sitting. With the additional information, they can now extend the rough examples to form a more complete picture of the functionality.

As a tester, you will naturally apply boundary conditions, domain partitioning, or equivalence classes to refine examples. In fact, any testing technique may help to refine the examples and seek for gaps. The following section presents a brief discussion of testing techniques you may want to keep in mind while refining your examples. A full discussion of the techniques is well beyond the scope of this book. If you are looking for more details on testing techniques, you may want to take a closer look at Lee Copeland’s A Practitioner’s Guide to Software Test Design [Cop04] or Kaner’s Testing Computer Software [KFN99].

Domain Testing

In domain testing, tests are broken down from a larger domain into smaller subdomains. Values that are fed into the system are partitioned into several classes where the system behaves the same way. These classes are called equivalence classes. We saw this in the airport example. The first equivalence class was the split among the different parking lots. The next classes were derived from the business rules. Usually, there was one class for the first few hours, one class for the first few days, and finally one class for weeks. Within each of these classes we picked one example exactly at the boundary, at least one right in the middle, and one value right before the next boundary and put them into the set of examples.

Let’s consider the example of the long-term surface parking lot. The business rules describe costs of $2.00 per hour, $10.00 daily maximum, and $60.00 per week. The first equivalence class within this parking lot is the behavior for the first day up to the fifth hour. We pick examples for one hour, three hours, and five hours. Up from the fifth hour we have the second equivalence class up to the end of that day. We may pick five hours and one minute, ten hours, and twenty-four hours from this class. Now, we may pick to repeat similar values for the second day, too, so that we can observe that this condition holds there as well. This then becomes a combination of the first two equivalence classes. However, the behavior should be the same for the second as for the third or fourth day in this class. Finally, when we hit the weekly maximum at the end of the sixth day, we should get another equivalence class for the whole seventh day, which in turn may be combined with multiple weeks as well.

Taking a look back at the results after the workshop, the values reflect this partitioning (see Table 9.1). The first five values fall into the first equivalence class for hourly rates. The next four illustrations represent examples from the second equivalence class, sometimes combined with the first equivalence class of hourly payment. The final six illustrations represent examples form the third equivalence class, sometimes combined with the former two equivalence classes.

Table 9.1. Long-Term Surface Parking Examples at the End of the Workshop

Image

Boundary Values

Testing for boundary values becomes easy when combined with a domain analysis beforehand. The idea is that at each of the boundaries for the equivalence classes errors are likely to occur. Traditional theory therefore suggests to test one value right in front of the boundary, one at the boundary, and one beyond the boundary. To me this seems to be basic math, and looking at the example from before we seem to have applied it in a reasonable manner with some freedom.

Consider the airport example. In the case of the valet parking costs, there were two equivalence classes—one in which you would be charged $18.00, and another one in which you get charged $12.00. The boundary value here is five hours. Using this information, we should consider one test for exactly five hours (the boundary value), one case for less then five hours, i.e., four hours and 59 minutes, and one just above this boundary value, i.e., five hours and one minute (see Table 9.2).

Table 9.2. Boundary Values for Valet Parking

Image

Of course, there are more (hidden) boundary values like a parking duration of zero minutes. The value of testers in specification workshops comes from seeing such hidden boundaries and equivalence classes that otherwise will be assumed between the business and the development.

Pairwise Testing

Pairwise Testing is based on the observation that errors often happen for a combination of two different input values. If our application has an input A with values a1, a2, and a3 and an input B with values b1, b2, and b3, we can achieve great confidence with the test cases in Table 9.3.

Table 9.3. Pairwise Testing Examples

Image

If we add a third variable to our system C with the values c1 and c2, we end up with the test cases listed in Table 9.4.

Table 9.4. Pairwise Testing Examples with Three Variables

Image

The algorithm ensured that all combinations of each pair of variables are chosen as a test case. For combinatorial problems, this approach helps to reduce the amount of tests that you have to run while ensuring a basic coverage.

For the airport parking lot example, we can apply this approach to the equivalence classes. For example, I might take examples for 0, 1, and 3 weeks, for 0, 1, 3, 6, and 7 days, and for 0, 1, 3, 5, and 6 hours. The pairwise algorithm will then come up with the test cases as listed in Table 9.5.

Table 9.5. Pairwise Testing Examples for the Parking Lot Example

Image

You might note that we still have to calculate the output values for the parking costs in this example. I will leave this as an exercise for the reader.

Cut Examples

Over time your test suite will grow larger and larger. There seem to be two magic boundaries, depending on how fast your test suites execute.

The first boundary exists when you reach an overall duration of more than several minutes. Before this happens, you feel comfortable getting feedback from your test suite regularly. You execute your tests nearly as often as your unit tests. But when the acceptance test suite starts to take more than about ten(-ish) minutes, you start to get annoyed by the long execution times. You run your tests less regularly, most often once for every check-in, but seldom more. By doing so, you actively delay the feedback you can get from your tests. This comes with the risk of introducing bugs into your code base that go unnoticed for some time. When working in a development team, this may even lead to confusion if someone integrates the faulty code into his or her latest changes.

The second magic boundary exists at an execution time of about two to three hours for your regression test suite. Lisa Crispin commented that her team strives to keep their automated non-unit tests below a 45-minute boundary—otherwise programmers will stop paying attention to the results. Up to this point you can safely execute your tests. Behind this boundary, tests seem to start to degenerate. More and more tests fail in the continuous integration system [DMG07]. At this point, you slowly start to rerun more and more failed tests from the nightly build as part of your daily work. This eats up more and more of your time, giving you less and less time for new tests or to fix problems in the existing ones. Just like slowly boiled frogs, you become more and more obsessed by the Backward Blindfold [Wei01, page 125]

The fish is always the last to see the water.

In order to prevent this, you should seek opportunities to speed up your tests. This might mean reorganizing your acceptance tests so that they execute against a lower-level interface—for example, using the model in a MVC architecture instead of the user interface or view. You may also seek opportunities to mock out some portions of the slower subsystems like a database or a third-party component. This works just fine for some time but comes with at least two serious trade-offs. First of all, you create a larger gap between the application that your tests exercise and the application that users are facing. In the next section “Consider Gaps,” we will cover this risk. The second risk is that you will have fewer options to cut your examples at the next time you hit a boundary for your test execution time. When you have run out of options to speed up your executable examples, the only option you may have left is the Queen of Hearts’ option in Lewis Carroll’s Alice’s Adventures in Wonderland: Off with their heads [Car65].

While this may sound extreme, regression testing eventually just finds 23% of the problems [Jon07]. Automated regression tests find a minority of the bugs [KBP01]. Since your examples might serve as regression tests, you are probably fine keeping the 23% that actually catch all the regressions that you’re interested in. Of course, the problem is that you will seldom know which of the tests give you the 23% of regression failures before implementing them. That’s why teams start building up a large set of tests. While this might give you a warm and cozy feeling about confidence for some time, when you hit the second boundary of execution time with your regression test suite, it is surely getting in your way.

I remember a project where we had a regression test suite that executed for about 36 hours. We didn’t know exactly how long it took to execute it, since we were not able to execute the whole test suite in a single run. While the number of regression tests seemed overwhelming, the sheer number of tests was no longer serving any purpose at all. Two months later we had cut off the regression test to the 10% that provided us with quick feedback if something was broken, and thereby shortened the feedback loop for our regression test suite. Needless to say, that we would have been doomed without such a step.

When cutting down examples initially, standard testing techniques might serve you well. In retrospect, with the 36-hour test suite, we might have used a pairwise approach to cut down the examples. To recall the subsection in Chapter 9Pairwise Testing,” the pairwise approach relies on combining any two factors together once. The number of tests to execute may be reduced drastically while still providing a warm and cozy coverage of the underlying business rules.

If you already applied some cut-down techniques like combinatorial reduction, and still face a five-hour test execution time, you may want to separate even more tests. Some teams start to organize their automated tests into smaller suites by using tags for certain risk areas. You may tag some of your tests for smoke testing and apply a staged build [DMG07] to your test suites. With a staged build you execute a small set of tests that address the most high-risk areas first. If all of them succeed, you execute the larger set of tests next to address areas that were overlooked.

Finally, some teams even automate all the examples in the iteration in parallel to the production code. Once the tests pass, they delete most of the tests, eventually leaving a very basic set of regression tests for the automated test suite. Gojko Adzic had several such examples when he interviewed 50 teams in 2010 on their usage of Specification by Example [Adz11]. Effectively, the teams pushed their test execution times in front of the first boundary. They enabled themselves for rapid feedback while sacrificing the coverage they got from their automated tests—or stopped pretending such a coverage in first place. This approach is fine as long as you consider all the gaps.

Consider Gaps

When automating testing, we should consider which bugs we are not finding while automating [KBP01]. This lesson from software testing is mostly concerned about opportunity costs of test automation. For any test that we automate, there is a set of tests that we will not run—no matter whether they are automated or manual tests.

Speaking of gaps in your ATDD approach, you have to find the right balance for automating tests versus running manual tests for areas and risks that your automated tests will not cover. Test automation alone leads to an unbalanced approach. Consider, for example, an application where you can enter personal information for your clients. Your automated tests cover the creation of one client, the business rules for creating the client, and field interdependencies like invalid zip codes or a mismatch between zip code and city.

In this hypothetical scenario you don’t run any manual tests, and when your application gets released to your customers, and they start using it, you get lots of complaints. Almost all of them state that your application is hard to use since the order for navigating through your application is unnatural. If you enter a single data value, the problem does not become apparent, but if you have to enter thousands of client records, you will get screwed.

Once I had to use such an application. The problem was the tab-order for the fields in the data entry page. When I entered the last name, hit tab, the cursor would change to the birthday before switching to the first name. Previously, we had a DOS-based application in use, where the tab-ordering was not a problem at all. When switching to an application under Windows, and migrating the 300 accounts, I noticed the problem and got fed up with it pretty quickly.

What should you do if you were the vendor of that application? Of course, you could describe the tab-ordering feature using examples and automate these examples. With every change to the code base you could capture these sort of bugs. The problem is that such a test would be heavily bound to the particular implementation of the user interface. With every new field in your input screen you would probably have to change that test.

The alternative would be to schedule a time-boxed session of exploratory testing [KFN99]. In such a session you could explore the usability of your interface design. With every change to the user interface you would schedule a similar session to explore the usability of the new screen layout again.

Facing these two alternatives, almost all attendees of my courses and workshops explained that they would go for the exploratory alternative. Reasons for this include the creation and maintenance costs for the automated tests and the return on that investment.

While this may seem like an exceptional example, Jurgen Appelo teaches in his course on Management and Leadership the Unknown-Unknowns Fallacy. Pretending to have covered all risks is a trap that many managers fall into [App11]. Scientists believed for centuries that all swans were white. The first occurrence of a black swan proved them wrong. Hard to predict and rare events can have high impact [Tal10]. In fact thinking that disaster is impossible more often than not leads to an unthought disaster. Jerry Weinberg coined the term Titanic Effect [Wei91].

If you think that you considered all the gaps, you probably forget the black swan, the unknown-unknown. Despite relying on a single approach to test your software, compensate for the possibility that test automation alone will not solve all your problems. There will be gaps in your testing concert, and you as the conductor should bring the right orchestra to the audience.

Build Your Testing Orchestra

The testing quadrants as discussed in detail in Lisa Crispin and Janet Gregory’s Agile Testing [CG09] book and first described by Brian Marick [Mar03] might reveal additional testing activities to yield a balanced approach to your application. In brief, testing activities can be split along two dimensions. The first dimension covers tests that either support your team on your project or critique the product’s value to the customer. The second dimension spreads orthogonally between technical tests versus business-facing tests (see Figure 9.2).

Image

Figure 9.2. The testing quadrants can help you realize gaps in your testing approach

The testing quadrants separate activities along four quadrants. The first quadrant covers technical tests that support the team. Unit tests and intermodule integration tests on a low level usually fall into this category. In the second quadrant are tests that support the team and have a business focus. These tests include automated acceptance tests as covered in this book. The third quadrant concerns business-facing tests that critique the product. Some examples for these are alpha and beta tests, but also user acceptance tests, usability tests, and Exploratory Testing. The fourth quadrant reminds us about those tests that critique the product, but face a more technical level. The most prominent test techniques in this quadrant include performance, load, and stress tests.

As you can see, acceptance tests that you create as a side effect to ATDD are concerned with the second quadrant. If you focus on the business-level tests that support your team alone, you are likely to miss performance problems, usability, or simple problems in your code quality that will cause you pain in the long run.5

Not covering these three areas leaves your application and your project vulnerable to a lot of risks.

Finally, if you don’t agree with me on this, you may check out the parking lot calculator page. It is a real page based on real requirements. The requirements are fulfilled in the version I reference in this book. Although all the acceptance tests pass, there are subtle and even obvious bugs in the calculator. See how many you can find within half an hour, then come back and read this section again, and see if the experience changed your mind about it.

Summary

When thinking about getting started with ATDD, play around with different ways to express your data. Try writing down your examples in a tabulated form, then change it to BDD-style. When you come up with a representation that makes sense to you, your team, and your business representative, only then start to look for a framework that could suite your purpose. Also, you should make it a team decision on which framework to pick. You should also keep in mind that different drivers for your application—for example Selenium or Watir for web pages, SWTBot for SWT applications—have different side-effects. Start with the most promising guess, implement one test from end to end, and then think about reconsidering your decision. After getting some initial experience, you will see benefits and some drawbacks, and can come up with a better informed decision.

Getting down the first set of examples usually is not enough. Use test design knowledge to refine your initial set of examples. Boundary values, pair-wise approaches, and domain testing can help you with that and your testers should know about such techniques. If you want to learn more about test design techniques, take a look into “A Practitioner’s Guide to Software Test Design”. [Cop04]

Over time your test suite should grow. At some point you will face the problem that it takes too long to execute all the tests. For some time you might be able to stick with separating the tests into different sets or run them over night. But eventually you will find that you have to get rid of some of your tests. Ninety minutes execution time seems to be a magic boundary for this. Get together with the whole team and talk about ways to get the feedback from your acceptance tests more timely.

There might be gaps in your overall testing strategy. Consider the four testing quadrants and if you covered every necessary point in the quadrants: Do you run exploratory tests regularly? When did you invite a user for some usability testing the last time? What about load and performance testing? ATDD addresses just the business-facing quadrant that shall help your team move forward. Acceptance tests are an essential part that most teams leave out, but you shouldn’t sacrifice too much from the other quadrants as well.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.111.129