Chapter Four

Advances in Techniques for Test Prioritization

Hadi Hemmati University of Calgary, Calgary, AB, Canada

Abstract

With the increasing size of software systems and the continuous changes that are committed to the software's codebase, regression testing has become very expensive for real-world software applications. Test case prioritization is a classic solution in this context. Test case prioritization is the process of ranking existing test cases for execution with the goal of finding defects sooner. It is useful when the testing budget is limited and one needs to limit their test execution cost, by only running top n test cases, according to the testing budget. There are many heuristics and algorithms to rank test cases. In this chapter, we will see some of the most common test case prioritization techniques from software testing literature as well as trends and advances in this domain.

Keywords

Test case prioritization; Code coverage; Diversity; Fault detection; Regression testing; Search-based testing; Multiobjective; Execution cost; Scalability

1 Introduction

Testing is one of the most practical methods of software quality assurance in industry [1,2]. The demand for testing is growing with the widespread application of Agile methodologies and the emphasis on continuous integration and delivery [3]. Such methodologies and approaches ideally suggest running all test cases of the system, after every modification to the code (i.e., after fixing a defect or adding a new feature to the software under test) to make sure that (a) the new changes are being tested with some newly developed test cases, as soon as possible, and (b) the new modifications do not break any of the already correct functionalities (regression testing [4]). However, frequent rerunning of all test cases on large-scale systems with many test suites is very expensive. It is often impossible to rerun all test cases after every commit to the code. To get a sense of how many test executions large software companies are dealing with, let's look at the code commits at Google in 2013–14 reported by Ankit Mehta's opening talk at Google Test Automation Conference in 2014 [5]. In her talk she explains that there were around 30,000 commits per day in early October 2014 in the Google codebase. This resulted in around 100 million test runs per day at Google. Newer statistic from a keynote speech by John Micco from Google at the International Conference in Software Testing, Verification, and Validation (ICST) in 2017 shows that at Google 4.2 million individual tests are running continuously. They run before and after code submission which results in 150 million test executions per day (on average 35 runs per test per day).

One strategy to tackle the problem of “too many test cases to rerun frequently” is scheduling the reexecution of test cases less frequently, e.g., once a day, as part of a nightly integration build, or before each major release [6]. There are at least two pitfalls of this practice: (1) you may still end up with larger-than-your-budget test suites to execute, especially if you opt for regular schedules such as nightly builds, and (2) by postponing the test executions you let the defects accumulate on top of each other, which makes the debugging process more difficult, whereas if you rerun your regression test suite after each change you will end up debugging a very small set of code changes [7]. Therefore, to follow continuous integration principles in large-scale systems, we need to choose a subset of all test cases to be executed in each build; or ideally, prioritize the test cases so that depending on the time constraints of the build, only the most important tests are executed.

The solution to the above problem is called “test case prioritization” (TCP). The concepts of selecting a subset of test cases or prioritizing them are not new in software testing. They are mostly used in the context of regression testing, where researchers have proposed several techniques to effectively minimize, select, and prioritize the existing test cases. The goal of test minimization is to eliminate the redundant test cases (e.g., those that verify the same functionality in a same way) from the test suite. Test case selection, on the other hand, selects a subset of test cases for execution, which are more relevant to the current (untested) change(s). Finally, test case prioritization techniques do not throw away any test case and only rank the existing test cases for execution, based on a heuristic [4]. In other words, when test cases are prioritized, one executes the test cases in the given order, until the testing budget is over. Thus, a TCP's target is to optimize the overall fault detection rate of the executed test cases for any given budget [8].

In this chapter, we specifically focus on the TCP techniques. The goal of this chapter is giving an introductory read on the advances in test prioritization. Thus we see both basic techniques and more advanced ones. However, we do not survey all existing techniques, in details (there are already many surveys and literature reviews in this domain, for interested readers [4,6,9–11]).

The rest of this chapter is organized as follows: in Section 2 TCP problem is introduced. Section 3 categorizes TCP studies from two perspectives (their input resources and optimization heuristics). Section 4 discusses different approaches for evaluating a TCP approach and Section 5 summarizes the most important trends and future directions in this context. Finally, Section 6 concludes this chapter.

2 Test Case Prioritization Problem

Thomas et al. [12] define the problem of TCP as follows: “TCP is the problem of ordering the test cases within the test suite of a system under test (SUT), with the goal of maximizing some criteria, such as the fault detection rate of the test cases [13]. Should the execution of the test suite be interrupted or stopped for any reason, the more important test cases (with respect to the criteria) have been executed first.” A more formal definition of TCP problem can be found in Ref. [14] as follows:

Definition 1

(Test case prioritization) Given: T, a test suite; PT, the set of permutations of T; and f, a performance function from PT to the real numbers. Find:

$T^{'} \in PT s . t . (\forall T^{″} \in PT) (T^{″} \neq T^{'}) [f (T^{'}) \geq f (T^{″})]$ $T^{'} \in PT s . t . (\forall T^{″} \in PT) (T^{″} \neq T^{'}) [f (T^{'}) \geq f (T^{″})]$

In this definition, PT is the set of all possible prioritizations of T and f is any function that determines the performance of a given prioritization. The definition of performance can vary, as developers will have different goals at different times [14].

In this paper, we break down the TCP problem to even smaller pieces. Essentially, each test prioritization technique needs to rank a set of test cases based on a performance function as explained earlier. However, it is also important to know what information and knowledge about test cases (and possibly the SUT) we have at our disposal. In other words, the type of input plays an important role in designing TCP algorithms. For instance, assume we have access to the code coverage information about each test case for its most recent execution. If we have such knowledge, we can incorporate that into the ranking algorithm. For example, one can simply sort test cases based on their most recent code coverage. Basically, the performance function here is code coverage and the hypothesis is that test cases with higher code coverage are more likely to detect faults. However, if we do not have access to code coverage at all, we cannot use this heuristic for our TCP algorithm and need to design a different TCP technique.

Researchers have proposed and evaluated many TCP techniques, based on a range of input data sources and prioritization algorithms; Yoo and Harman [4] and Kazmi et al. [9] provide two (relatively) up-to-date surveys of TCP techniques, which we suggest the referred readers to. In this paper, however, we will look at the TCP problem from a slightly different perspective and categorize many existing techniques (including most recent ones that are not listed in the above surveys) into different classes. We also provide a summary of trends and future direction in TCP study.

To categorize TCP techniques, we first need to generalize the steps that most TCP solutions take as follows:

1. Represent each test case by encoding the information you have from the test case that you are going to use for the prioritization. For example, encode a test case by its code coverage or historical failure record.
2. Define your performance function in a form of an objective to achieve, based on the information in hand (the encoded test cases). For example, your objective can be achieving a maximum statement coverage or a maximum diversity among test cases.
3. Apply an optimization algorithm (anything from a simple Greedy-based sorting to more sophisticated technique such as a Genetic Algorithm) to identify the most optimum ordering of test cases, with respect to the above objective.

Based on the above steps, we need to, at least, identify two factors for a TCP solution: (a) what are the input resources at our disposal for encoding test cases and setting up our objectives? and (b) what is the optimization strategy to achieve the best performance (most cost-effective approach)?

3 Categorizing TCP Techniques

There are many TCP techniques available [4], where each technique may have access to different types of information and use different strategies to achieve the TCP's objective. In this section, we will categories TCP techniques from two perspectives: the “Input Resources for TCP” and the “Heuristics and Optimization Strategies for TCP.”

3.1 Input Resources for TCP

In Ref. [8] the authors have summarized the most common input resources for a TCP technique. We provide and extended version of that category here, as follows:

• Change information: This source of information is quite common in the context of regression testing, where the overall idea of a TCP technique is prioritizing test cases that cover the changed part of the system over those that cover the same old code. The rationale behind this idea is that the defects that could be detected in the old code by the regression test suite are already detected; thus the limited testing budget should be assigned to the new/modified code.
The main challenge here is to find out which test cases are affected (directly or indirectly) by each change, which is called change impact analysis [15,16]. Then the test cases can be represented using a simple Boolean variable that shows whether the test case has been in the affected subset or not. For instance, one can apply this idea on the method/function-level changes, which means that if Method A changes, all test cases that directly cover Method A are called “affected.” In addition, any test case that covers any Method X that depends (directly or indirectly) on Method A will be denoted as “affected” and all “affected” test cases will be prioritized higher than the other test cases.
Note that the overall ranking based on these Boolean values (“affected” or NOT) will not result in a total order. To make the ranking more precise, one can consider this information together with other sources of information for prioritization. It is also worth mentioning that the level of granularity may be different for change impact analysis. Examples are class, method, statement level. The finer grain analysis is more accurate but is also more costly.
One of the caveats of applying this approach on large-scale codebases with frequent changes (commits) is that one may end up with very large set of affected test cases. For instance, in a recent study [17], Memon reports that at Google, with over 150 million test executions and 800K builds per day, change impact analysis is not feasible after each change (the cost were growing quadratically with two multiplicative linear factors: the code submission rate and the size of the test pool). Google's approach to solve this issue is not repeating the analysis after every change and postpone it to some milestone (every 45 min). However, the size of affected test suite may be still huge even after putting some milestone (e.g., one change alone affected over 150K test cases). Memon et al.’s improvement is, in addition, excluding affected tests that are very distant from the original code change (10 levels in the dependency graph).
• Historical fault detection information: Another quite common source of information for a TCP technique is the historical performance of the test cases. The overall idea here is that the test cases that detected defects in the past should be prioritized over those that have not detected any defects, so far. The rationale behind this idea is that test cases that (frequently) failed in the past, most probably, cover a challenging and complex part of the code. So it is wise to rerun them in the new built as well. This idea is the cornerstone of many defect prediction research studies, where the assumption is that the part of the code which used to have many defects is a risky part of the code and its chance to be defective in the new version is also high [18,19].
An easy way for representing test cases using this information is again a Boolean variable to represent whether the test case has ever failed in the history of its execution or not [20,21]. A more complex representation of this information would use a weighted approach to combine failures in the past. In other words, rather than just identifying whether a test has ever failed or not, we would consider how many times it has failed, whether it has failed in more recent releases or older ones. In a recent study by Elbaum et al. [22], previous faults have been used as a basis of prioritization in the context of continuous integration at Google. Hemmati et al. [8] also have used a weighted history-based approach for prioritizing test cases. In their approach, the test cases that have failed more recently are given more priority.
Noor and Hemmati [23] suggest that instead of looking for test cases that have failed in the past, we calculate the similarity between failing test cases in the past and the current set of test cases, to be prioritized. In their approach a probably of failure is assigned to each current test case by how close they are to the failing test case in the past and how recent was the failure.
A variation of this high-level principle of history-based prioritization may assign a higher rank to test cases that are similar to test cases that detected more severe faults [24]. To implement this, each test case needs to be represented by the severity of the defect(s) it has detected so far. It is quite common practice to use this information together with other sources in a TCP [25,26].
• Code coverage data: Another common resource that is being used in TCP techniques is the code coverage of each test case [4,14]. Code coverage can be measured in many different variations (e.g., statement, branch, method coverage). The coverage can be extracted by analyzing either the program execution (dynamically) or the test and source code (statically). Dynamic analysis is more accurate, but it requires a real execution of the test cases. However, executing test cases to calculate the coverage is not an option for TCP, due to the nature of the problem (the limited testing budget). Therefore, the dynamic code coverage data can only be used, if test case coverage data is already available from previous executions [8]. Note that this data may not be perfect because (a) the coverage data for the new test cases is not available yet, (b) the code changes between two releases may reduce the accuracy of such coverage data, and (c) in many scenarios, instrumenting the code for dynamic analysis and keeping all coverage data from previous versions are not practical.
In the absence of execution information, coverage-based TCP techniques rank the test cases solely based on the static analysis of the test cases and/or the source code. For example, one can calculate method coverage of test cases by extracting the sequences of method calls in the source code for the given test cases by static analysis [27]. Of course, the availability of test scripts that execute specific method calls is a prerequisite here, which does not hold in some cases like manual test cases written in natural language. In addition, as explained, static analysis is not as precise as dynamically executing the system, when it comes to coverage calculations.
There are at least two ways of performing static analysis for test case coverage calculation. The cheaper one, which is the least accurate, only analyzes the test script and extracts the first-level method call sequence. The more accurate and more expensive way, however, follows each first-level method call to the source code and extracts all the nested method call sequence from the code as well. As a simple example, assume our abstract test scripts for two test cases are as follows (note that each test script is represented by a vector of test case method calls and all other execution details are omitted):
$T 1 : 〈A, B〉 and T 2 : 〈C〉$ $T 1 : 〈A, B〉 and T 2 : 〈C〉$

Using a first way of static analysis (only using the test code for coverage calculation) T1 is ranked higher than T2, given that it covers two source code methods A( ) and B( ), whereas T2 only covers one, C( ). However, assume the following scenario where A( ) and B( ) do not call any other methods in their body, but C( ) calls D( ) and E( ) and E( ) calls F( ). Following the second way of static analysis (using the test and source code for coverage calculation) we should represent the T1 and T2 as follows:
$T 1 : 〈A (), B ()〉 and T 2 : 〈C (), D (), E (), F ()〉$ $T 1 : 〈A (), B ()〉 and T 2 : 〈C (), D (), E (), F ()〉$

Thus T2 will be ranked higher, according to the represented code coverage data.
• Specification or requirements: In specification-based testing (aka model-based testing), each test case can be easily traced back to the specification models. Therefore, a typical model-based TCP technique has access to the specification models of the software under test. A typical scenario for model-based testing is specifying the software by a state machine and test cases by paths in the state machine. In such a scenario, a TCP technique would prioritize test cases by considering which paths in the model are executed by which test cases [28–30], which is called model coverage.
There are also TCP techniques that prioritize test cases based on the software requirement artifacts [31–33]. Basically, if one knows what features of the software are tested by which test cases, one can prioritize test cases by “feature coverage.” In other words, rather than focusing on the specification coverage we focus on the requirement coverage. The main challenge of all TCP techniques in this category is that they require extra information about the software, which is commonly not available (or up to date).
• Test scripts: Finally, there are a few TCP techniques that only look at the test scripts as a source of information. These TCP techniques are usually applicable in a wider range of domains, where the other mentioned sources of data may not be available. Given that the only available resource in this category is a textual document (the test script), the proposed techniques are typically some sort of text analysis technique, which can become quite advanced.
For instance, Thomas et al. [12] derive a “topic model” from the test scripts, which approximates the features that each test case covers. Topic modeling is a statistical text mining technique that groups related words of several documents into multiple bags called topics. Each document will have a membership value assigned to each topic, based on how frequent the document discusses the topic. This idea has been used in this work to estimate how frequent each test case discusses (i.e., covers, in our context) each feature of the code.
There are also cases where simpler text analysis techniques have been used for TCP. For example, the test scripts are interpreted as strings of words [34,35], without any extracted knowledge attached to them. Given such data per test case, TCP techniques in this category may target different objectives such as diversifying test cases [34,35] or maximizing their coverage [12], which we explain in the next subsection.
A variation of “test scripts” (which are typically used for automated testing) are “test instructions” (which are typically used in manual testing). These instructions are designed based on test plans and are given to the manual testers to follow, typically on the graphical user interface (GUI) of the software. A test instruction is similar to test scripts because they are black box and have no access to source code, but they are different than test scripts because they are written in natural languages which makes it harder for an automated TCP technique to work on. Same as the test scripts, one can look at the test instruction as a document and apply text mining techniques to extract important features [8,36]. These ideas will be explained in more detail in the next subsection.
• GUI and web events: Event-driven software are quite common nowadays. According to Memon et al. [37] “such systems typically take sequences of events (e.g., messages, mouse-clicks) as input, change their state, and produce an output (e.g., events, system calls, text messages).” Examples include web applications, GUIs, mobile applications, and embedded software. Modeling test cases based on their sequence of events is a common approach in test generation and prioritization in this domain. For instance, Bryce et al. [38,39] used the notion of t-way event interaction [40] coverage as a heuristic for test prioritization, where the goal is covering all t-way event interactions, as soon as possible.
In another work, Sampath et al. prioritize web applications’ test suites by test lengths, frequency of appearance of request sequences (which are the “events” in this context), and systematic coverage of parameter values and their interactions [41].

Note that availability of any of the above resources is very context dependent. However, in general, the type of testing heavily influences the available input resources. For instance, in the case of white-box unit testing, TCP techniques typically have access to both test and source codes, but in black-box testing the TCP techniques only have access to the test code and potentially specification or requirements information. In the context of regression testing, history of failure and change data may also be available and so on. Nevertheless, the take home message is that TCP technique can be categorized based on their available input resources, which defines how the test should be represented. The other perspective for categorizing TCP techniques is given test cases that are represented in a particular way (i.e., any representation that is applicable based on the above discussion on the available resources), what heuristic and algorithm is used by the TCP. In the next section, we will summarize some of these heuristics.

3.2 Heuristics and Optimization Strategies for TCP

The objective of most TCP techniques falls under one of these two heuristics: (a) maximizing some sort of coverage criterion and (b) maximizing diversity between test cases. There are several approaches to achieve these objectives. In the rest of this section, I will first explain the two main heuristics, in general, and then summarize some of the common strategies to realize those objectives.

3.2.1 TCP Heuristics

In this section, we will see two most common high-level heuristics that are used in TCP techniques.

• Maximizing coverage. Generally speaking, one can represent a test case as a set or sequence of some “features” or “aspects” of the SUT that are covered (i.e., being exercised, or verified, or just executed) during test execution. These “features” are defined based on the available “input resources for TCP,” explained in the previous section. For instance, a test case can be represented by a set/sequence of code elements (e.g., statements or methods) it covers. Assume we represent test cases by sets of method names that they cover. Now we can create a total “feature set,” which includes all method names that our test cases can possibly cover. A “maximizing coverage” heuristic (or as it is usually called a “coverage-based” approach) suggests that a TCP technique must order test cases in a way that the test executions result in a high coverage of those features, as soon as possible. The intuition behind this strategy is that by maximizing “feature coverage,” the test cases execute as much of the SUT as possible and thus increase their chances of fault detection.
Though the idea of “maximizing coverage” is mostly common for maximizing code coverage, as the example above, but its generic idea can be applicable in most situations, with respect to the available input resources and our testing perspective. Basically, given our choices of available input resources, what we consider as a feature defines what aspect of the software we want to test as soon as possible. Thus, all we need to do is to represent the test cases by their covered features and then try to order test cases so that higher coverage of those features is achieved as soon as possible.
Feature extraction: As mentioned, representing test cases by their covered features is the key in this heuristic. Features can be defined based on code, specification, requirements, GUI, etc. Sometime features are explicit and easy to map, e.g., when a feature is a code element (like statements or method calls), representing test cases by their covered code elements is straightforward. Sometimes the features are explicit but not easy to map to test cases. For instance, in case of requirements features, one might know exactly which requirements of the software are supposed to be tested, but there is no easy way to (automatically) map a test case to requirement features. Thus representing test cases by their covered requirement features is not as easy as representing them by their code-level features.
Finally, there are cases where the feature is implicit. In other words, the targeted feature is being estimated by some other surrogate metrics. For example, remember the case of input resources for manual testing, from the previous section? In that case, the authors’ [42] targeted features are the GUI elements that are covered by each test case. However, all they have access to (input resources for TCP) is the test instructions. In such a case, they need to estimate the features covered by each test case using the information at their disposal. In that particular study, the authors used an NLP-based test mining approach to estimate the exercised GUI elements per test case and represented each test case using a vector of estimated GUI coverage info.
Obviously, the most accurate coverage-based TCP techniques are those that have an explicit access to the features and an automated precise way for collecting the feature coverage data from the test cases (e.g., a dynamic code coverage-based approach). But the overall idea of “maximizing coverage” can be exactly applied on any extracted (and estimated) feature set.
Note that there are several strategies to maximize coverage and we come back to that after explaining the other major heuristic for TCP, which is maximizing diversity.
• Maximize diversity. The overall idea behind this heuristic is allocating testing resources to a diverse set of “features.” The intuition for this strategy is that set of test cases that are similar will likely detect the same faults, and therefore only one in the set needs to be executed. The rationale behind this idea is that if we are short in testing resources (which is the motivation for TCP) and we follow a “maximizing coverage” approach, we might end up having high coverage on some similar features, but no or very low coverage on some other features.
As an example, assume that the software under test is a text editor application. The “features” are menu items in the GUI. You have three tests (T1–T3) to prioritize, where each test covers one or more menu items as follows:
- ○ T1 covers “open,” “copy,” “paste,” and “save.”
- ○ T2 covers “open,” “cut,” “paste,” and “save as.”
- ○ T3 covers “new document,” and “close.”
In this example, we have eight features that can be covered by the test cases: “open,” “copy,” “cut,” “paste,” “save,” “save as,” “new document,” and “close.” Now assume that our coverage-based TCP already have picked T1, as the first test. So our coverage is 4/8 = 50%. To choose the second test a coverage-based heuristic does not have any preference between T2 and T3 since both cases will increase coverage to 6/8 = 75%. However, from the “maximize diversity” heuristic point of view, choosing T3 is more preferable, since T3 is quite different than T1, whereas T2 is quite similar to T1.
It should be quite clear by now that the definition of “similarity” or “difference” between test cases is a key in designing a diversity-based heuristic. To be able to define “similarity” or “difference” between test cases, we start with representing (encoding) test cases as a set or sequence of “features.” Exactly the same as the coverage-based approach. However, the diversity-based heuristic does not focus on maximizing the coverage of features. The new heuristic tries to rank test cases in a way that for any given budget the set of selected test cases executes/covers the most diverse features.
To measure diversity among a set of test cases, we need to define a “similarity” (or “distance”) function that works on the features covered by the test cases. A common way to do this is representing the test cases by a vector of features they cover and then apply a standard distance function to calculate the distance between every two test cases. Then aggregate the paired distances (using an aggregation function) and use the final value as a diversity metric for that set of test cases.
There are several similarity/distance functions that can be used to determine similarity/distance among test cases. These functions will be covered in the next section.

3.2.2 Optimization Strategies

Optimization strategies can be the same or different for the two heuristics (maximizing coverage and diversity). In the rest of this section, I will first explain some of the common strategies for maximizing coverage and then explain optimization strategies for diversity-based approaches.

3.2.2.1 Maximizing Coverage

So far we have learned that prioritizing test cases starts with extracting features per test case based on the available resource and encoding each test case as a vector of features. Therefore, the goal of “maximizing coverage” as an optimization strategy is to rank test cases so that the highest “feature” coverage can be achieved with any given testing budget. There are several approaches to tackle this problem. Some are more common, but suboptimal, and some less intuitive but more effective. Let us start with the simplest approach. A Greedy algorithm:

• Greedy approach for maximizing coverage: The simplest coverage-based TCP approach that uses a Greedy algorithm works as follows:
- ○ Step 1: Identify a test case with the most number of covered “features” from the test suite (in case of ties use a tie breaker, e.g., random selection)
- ○ Step 2: Append the test case to your ranked list and remove it from the test suite
- ○ Step 3: If the test suite is not empty, back to step 1
Though this algorithm is pretty simple and practical, it has obvious shortcomings. The main issue with the simple Greedy is that it does not take into account the current coverage of features while selecting the next best. To understand this let us see the following example.
Assume we have a test suite of three test cases (T1, T2, and T3) and a source code class with five methods (M1 … M5). A feature in this example is a method. Therefore, each test case is represented by the methods it covers, as summarized in Table 1 (a checked cell (X) means the corresponding test case covers the indicated method).

Table 1

Example of Feature (Method) Coverage of Test Cases, Used for a Greedy-Based TCP
T1 T2 T3
M1 X X
M2 X
M3 X X
M4 X
M5 X X

The simple coverage-based Greedy algorithm selects T2 as the first choice, the test with the highest feature coverage (four out of five). Then the algorithm picks T1 as the second highest coverage (three out of five) and last choice is T3, resulting in 〈T2, T1, T3〉 ranked list. However, a quick investigation shows that in fact 〈T2, T3, T1〉 is a better choice, since after two test executions one can achieve 100% coverage, whereas the simple Greedy's output requires three test executions, two achieve 100% coverage (only 80% after two executions). This is simply due to the fact that the Greedy is not aware of the current covered features.
A variation of the Greedy-based approach modifies the algorithm by implementing a different way of coverage calculation per test case. It is usually called an “additional coverage” approach in the literature. The idea is that rather than calculating the raw coverage per test case, we assign an “additional coverage” value per test. The “additional coverage” tells you how much extra feature coverage we achieve if we append a selected test case to the so far ranked list. For instance, in the example above, T2’s “additional coverage” is 80% because the ranked list is empty. But when T2 is added to the ranked list covering M1, M2, M3, or M5 does not add anything to the ranked list's coverage. Thus T1’s “additional coverage” is zero and T3’s “additional coverage” is 20% (going from four out of five to five out of five feature coverage).
• Local vs global optima: Both variations of the Greedy-based TCPs suffer from the same problem of “local optimum.” According to Wikipedia [43], in the field of optimization, “a local optimum of an optimization problem is a solution that is optimal (either maximal or minimal) within a neighboring set of candidate solutions. This is in contrast to a global optimum, which is the optimal solution among all possible solutions, not just those in a particular neighborhood of values.” Since a Greedy algorithm makes the locally optimal choice at each stage, in many problems, it does not produce an optimal solution [44].
For instance, in the example below (Table 2), following an additional Greedy algorithm results in selecting T1 as the best choice. Then after T1, any order of T2, T3, and T4 is the same from the (extra) coverage point of view. So let's say we end up with the following ranked list: 〈T1, T2, T3, T4〉. In terms of covering the methods, we get 50% coverage, right away, by the first test case execution. After that we get ~ 16.7% (one out of six methods) extra coverage by each extra test case we run, until we reach 100% (with four test executions). Now look at the alternative ranked list of 〈T2, T3, T4, T1〉. We will have a lower coverage to start with here ~ 33.4% (two out of six methods), but then every extra test case brings in an additional here ~ 33.4% (two out of six). This will result in 100% coverage by executing the first three test cases (T2, T3, and T4).

Table 2

Example of Feature (Method) Coverage of Test Cases, Used to Show Local vs Global Solutions in Greedy-Based TCPs
T1 T2 T3 T4
M1 X X
M2 X X
M3 X X
M4 X
M5 X
M6 X

Depending on how one measures the effectiveness of a TCP with respect to coverage (we will talk about different options in the test prioritization evaluation section), the above example can show the suboptimality of the Greedy-based solutions. Basically, if we ignore the first quick start of the first ranked list (〈T1, T2, T3, T4〉) and look at the overall coverage or how fast we achieve the maximum coverage, the second ranked list (〈T2, T3, T4, T1〉) is a global optimum and the first ranked list is a local one.
• Search-based coverage maximization: To escape from the local optima, the alternative TCP optimization strategy is using global search techniques. The use of local and global search algorithms has become quite common in the automated software engineering domain [45]. The overall idea is to reformulate a software engineering task as an optimization problem and then automatically search for the best solution using a metaheuristic search algorithm. Metaheuristic search algorithms are general strategies that need to be adapted to the problem at hand.
Many software engineering problems can be reformulated as optimization problems, for which search algorithms can be applied to solve them [46]. This has led to the development of a research area often referred to as Search-Based Software Engineering, for which several successful applications can be found in the literature, with a large representation from software testing [11,45].
In our context, the problem in hand is maximizing the coverage of a list of test cases by reordering them in a way that high coverage achieves as soon as possible. Note that this problem, in general, is an NP-hard problem [47]. Therefore, using an exhaustive search for global optimum in most realistic problems is not an option, since the search space size for selecting the best order of n test cases is equal to the number of all permutations = n!.
The space of all possible permutations of n test cases represents the search space. The order with the “best coverage” (fitness value) is called global optimum. A fitness function is defined to assign a fitness value to each permutation. A search algorithm can be run for an arbitrarily amount of time. The more time is used, the more elements of the search space can be evaluated. Unfortunately, in general, we will not know whether an element of the search space is a global optimum because such knowledge would require an evaluation of the entire search space. Therefore, stopping criteria need to be defined, as, for example, timeouts or fixed number of fitness evaluations.
There are several types of global search algorithms that one can choose, e.g., Simulated Annealing (SA), Genetic Algorithms (GA), and Ant Colony Optimizations (ACO) [45]. On average, on all possible problems, all search algorithms perform equally, and this is theoretically proven in the famous No Free Lunch theorem [48]. Nevertheless, for specific classes of problems (e.g., software engineering problems), there can be significant differences among the performance of different search algorithms. Therefore, it is important to study and evaluate different search algorithms when there are a specific class of problems we want to solve, as, for example, software testing and its subproblems. This type of comparisons in software testing can be found, for example, in Refs. [4,28,49]. For the class of TCP problems, we need a global search algorithm like GA that does not stuck in local optima. On average, in most TCP scenarios in the literature, the global search-based approaches have shown to lead to better results than a Greedy-based solution or a local search-based approach such as Hill Climbing [50].
• Fitness function and area under curve: A fitness function in a search-based TCP technique can be designed in different ways. A typical approach for defining fitness values in coverage-based TCPs is using the area under a curve (AUC) concept. AUC simply means the area under the curve, where the curve represents the coverage per test case execution. As we saw in Table 2’s example (local vs global search), the “Maximizing Coverage” strategy may be evaluated in different ways. One way to look at it is ordering the test cases so that we achieve 100% coverage (or whatever maximum we can achieve using our test suite) as soon as possible (with minimum test executions). Another way to look at it is having the highest coverage for any given subset of test case executions, following the ordered list. Essentially, this means that, since in practice we will end up running a subset of test cases and all test cases in that subset will be executed anyways, then we only care about the overall coverage but for any size subset. This latter evaluation method can be measured by AUC.
To illustrate AUC measurement for TCP evaluation, one can plot the coverage of test cases on the y-axis, while the x-axis shows the number of test executions following the order of test cases given by a TCP technique. For instance, Fig. 1 compares the two orderings O1: 〈T1, T2, T3, T4〉 and O2: 〈T2, T3, T4, T1〉 from the previous example (Table 2).

Fig. 1 Illustration of AUC measure for TCP evaluation in the example of Table 2.

The actual AUC values for O1 and O2 in this example are AUC(O1) = 2.25 and AUC(2) = 2.33. So according to AUC O2 is a slightly better TCP technique. However, as we also mentioned earlier the illustrated AUCs shows that O2 achieves 100% coverage by three test executions, whereas O1 requires four test cases for that level of coverage, which can be a more differentiating factor between O1 and O2 compared to the slight difference in their AUC. Nevertheless, AUC seems to be a reasonable measure to be used as fitness value in search-based TCPs.
So to summarize the search-based coverage maximization approach, we need a fitness function which typically is the AUC metric, and using a search-based algorithm like GA we explore different permutations to achieve higher AUC.
The exact way that the search space is explored is defined by the search algorithm. For instance, “GAs start off by an initial (typically random) population of individuals (each of them represents an ordered set of test cases, i.e., a candidate for an optimal solution) and improve the quality of individuals (in terms of the fitness value, i.e., AUC in this context) by iterating through generations. The transition from one generation to another consists of four main operations: selection, crossover, mutation, and sampling” [51]. There are several methods to implement each of these operations which their details are beyond the scope of this chapter. We encourage the interested readers to look at some of the related work such as Refs. [11,28,29,50,51].
Note that the AUC as defined above is just a heuristic to guide our search toward “better” solutions. It does not necessarily mean that a TCP that results in higher coverage-based AUC is “better.” In the test prioritization evaluation section, we will talk more about what is the ideal ordering and how we can measure the “better,” in this context.

Example of Feature (Method) Coverage of Test Cases, Used for a Greedy-Based TCP
	T1	T2	T3
M1	X	X
M2		X
M3	X	X
M4			X
M5	X	X

Example of Feature (Method) Coverage of Test Cases, Used to Show Local vs Global Solutions in Greedy-Based TCPs
	T1	T2	T3	T4
M1	X	X
M2	X		X
M3	X			X
M4		X
M5			X
M6				X

3.2.2.2 Maximizing Diversity

As I explained in Section 3.2.1, this heuristic tries to rank test cases in a way that for any given budget the set of selected test cases executes/covers the most diverse features. To do so we need to follow three steps:

• Encode test cases
• Build distance/similarity matrix
• Maximize/minimize distances/similarities

The encoding of test cases as a vector of features has already been discussed in earlier sections. I also have explained the overall idea of diversity between test cases and the rationale behind it, in the maximize diversity section. In this section, I will explain the distance function and the optimization strategies for diversity-based TCPs, which is extracted from this thesis publication [50].

Once the test cases are encoded into vectors of features, they are given to a similarity/distance function. Note that it does not matter whether we define a distance function and maximize that or a similarity function and minimize it. Either cases can be transformed to the other one. Therefore, we just use distance function and maximization terminologies here.

A simple way of implementing the diversification concept is using a pairwise distance function that typically takes two sets/sequences of elements (we use “{ }” to represent sets and “〈 〉” for representing sequences) and assigns a distance value to each pair. The results of measuring all these distance values are recorded in a “distance matrix” (in case of large test suites one can also replace this matrix generation phase with an on-the-fly distance calculation).

Note that the distance matrix can be an upper/lower triangular matrix, since the distance measure should be symmetric (the similarity between test case A and test case B is equal to the similarity between test case B and test case A). Therefore, we only need to store half of the matrix.

Given an encoding, one may use different set/sequence-based similarity functions [50,52]. The main difference between them is that set-based similarity measures, as opposed to sequence-based ones, do not take the order of features into account. For example, assume we represent test cases by the method names that are covered when running the test cases. Now assume test case T1 calls A( ) and then B( ), whereas test case T2 calls B( ) and then A( ). For a set-based distance functions T1 and T2 are identical. Next we will define some of the common set- and sequence-based distance functions.

• Set-based distance functions
Set-based distance measures are widely used in data mining [50] to assess the distance of two objects described as multidimensional feature vectors, where the set is composed of the features’ values [50]. In our case, each test case is also a vector of features. Each feature in the vector is taken from a limited alphabet of possible features. However, the vector size can be different since the length of test cases may vary.
- ○ Basic counting: This measure is a very basic function, which does not account for the features order in the test case [23]. The function requires each test to be represented as a set of unique features. That means that if for instance we represent test cases by the method names to cover, and the test covers a method twice, the set only indicates the covered method once. The basic counting function simply looks at the two sets and counts the overlaps. Number of overlapped features indicates similarity. Thus the raw distance is the number of noncommon elements within the union of the two sets. For example, if T1:{a, b, c} and T2:{a, d}, then the raw distance value is 3 (for the following uncommon features “b,” “c,” and “d”). This will result in a distance metric that is between 0 (identical sets—remember that the order does not matter) and a MAX value (the size of the largest union of two sets in the test suite), representing the distance between possibly two large completely different test cases. To normalize the metric, one can divide the raw values to the MAX to get a metric in the [0,1] range.
- ○ Hamming distance: Hamming distance is one of the most used distance functions in the literature and is a basic edit distance [53–55]. The edit distance between two strings is defined as the minimum number of edit operations (insertions, deletions, and substitutions) needed to transform the first string into the second [53–55]. Hamming is only applicable on identical length strings and is equal to the number of substitutions required in one input to become the second one [53–55]. If all inputs are originally of identical length, the function can be used as a sequence-based measure. However, in realistic applications, test vectors usually have different lengths. Therefore, to obtain vectors of identical length, a binary string is produced to indicate which features, from the set of all possible features of the encoding, exist in the vector. This binary string, however, does not preserve the original order of elements in the test and therefore leads to a set-based distance function.
  In our case, to use Hamming distance, each test case is represented as a binary string, where each bit corresponds to a unique feature among all possible features to cover by the test suite. A bit in the binary string is true only if the test case contains/covers the corresponding feature (e.g., the method call).
  As an example, let's calculate the Hamming distance between T1:{M1, M2, M3} and T2:{M1, M4} from Table 2. The binary string representations of T1 and T2 are as follows: T1:〈1,1,1,0,0,0〉 and T2:〈1,0,0,1,0,0〉, where the 6 bits represent M1–M6 in the ascending order. Thus Hamming distance (T1, T2) = 3, since we need to modify at least 3 bits (#2, #3, and #4) T1 or T2 to make them identical.
- ○ Other set-based functions: There are many other functions that can be used as a set-based function, e.g., Jaccard Index, Gower–Legendre(Dice), and Sokal–Sneath(Anti-Dice) measures [50] are all defined based in a generic similarity formula as below but with different weightings (w):
$similarity (A, B) = \frac{∣ A \cap B ∣}{∣ A \cap B ∣ + w (A \cup B - A \cap B)}$ $similarity (A, B) = \frac{∣ A \cap B ∣}{∣ A \cap B ∣ + w (A \cup B - A \cap B)}$

In the next section, we explain some of the common sequence-based distance functions.
• Sequence-based distance functions
Sequence-based distance functions are more accurate in defining distances between two vectors of element because they do not typically exclude the duplicate elements and they keep the order of elements into account. Note that sequence-based functions are not necessarily better for diversity-based TCP. Depending on the context, a TCP may actually want to exclude duplicates and order of features.
- ○ Edit distance function: One of the most common sequence-based distance functions is edit distance. There are several implementations of edit distance. Levenshtein algorithm, which unlike Hamming distance is not limited to identical length sequences, is a well-known implementation of edit distance [55]. In a basic Levenshtein, each mismatch (substitutions) or gap (insertion/deletion) between two sequences of elements increases the distance by one unit.
  For example, assume we show test cases by their sequence of method calls. Assume
  $\begin{array}{c} T 1 : 〈M 1, M 2, M 1, M 3〉 and \\ T 2 : 〈M 1, M 4, M 1〉 \end{array}$ $\begin{array}{c} T 1 : 〈M 1, M 2, M 1, M 3〉 and \\ T 2 : 〈M 1, M 4, M 1〉 \end{array}$
  
  We can align the tests based on their common elements and then calculate the basic Levenshtein score as follows:
  
  Note that M2 and M4 are mismatches and M3 is a gap. So the distance is + 1 for one mismatch and + 1 for one missing element. That is + 2. One can normalize it by the length of the longer test as well, which in this case is 4.
  Therefore, normalized_Levenshtein (T1, T2) = 2/4 = 0.5
  Other variations of Levenshtein may use different scores (operation weights) to matches, mismatches, and gaps. In addition, in some variations of the algorithm (mostly used in bioinformatics), there are different match scores (alphabet weight), based on the type of matches [55].
- ○ Other sequence-based functions: There are many sequence-based distance functions that are defined and used in bioinformatics for matching and aligning DNA sequences [54]. They can be categorized in global (e.g., Needleman–Wunsch) or local (Smith–Waterman) alignments. Global alignment's idea is quite similar to edit distance’s, but local alignment tries to only find partial matches. In the context of TCP, we may require to compare a short test case with a very long test that (almost) contains the short test case. In such cases, typical edit distance or global alignment measures show a high distance between the two test cases, but local alignment would show a low distance, which may be more appropriate in this example.
• Maximizing distances
In the last step of a diversity-based TCP technique, the distance matrix is given to an algorithm which maximizes the overall distance between all test cases for any given subset of the total test suite. In other words, we again use an aggregation measurement such as AUC, but this time rather than assigning the overall coverage for each n test case execution, we assign an aggregated distance value for the selected subset of test cases to be executed. There may be different ways to aggregate paired distance values to come up with one overall distance value per subset. One common way is averaging all the pairwise distances between all pairs of test cases in the selected set.
Note that the problem of prioritizing test cases to achieve such maximize diversity is again, in general, an NP-hard problem (traditional set cover) [47]. Therefore, using an exhaustive search in most realistic problems is not an option. Some of the other alternatives are: (1) Greedy, (2) Adaptive Random, (3) Clustering, and (4) Search-based algorithms.
- ○ Greedy-based distance maximization: There are several ways to implement a Greedy algorithm for distance maximization. I’ll explain one option here. First, we choose a pair of test cases with maximum distance, among all pairs in the test suite. Then at each step, a test case that has the highest distance to the already selected test cases will be appended to the prioritized list. The distance of a test case (Tx) to a set of n test cases (Ti where i:1 … n) can be defined in different ways. For instance, we can take the average of all distance (Tx, Ti) for i:1 … n. Another options are taking Maximum or Minimum of all distance (Tx, Ti) for i:1 … n.
- ○ Adaptive random testing for distance maximization: Adaptive random testing (ART) has been proposed as an extension to random testing [56]. Its main idea is that diversity among test cases should be rewarded, because failing test cases tend to be clustered in contiguous regions of the input domain. This has been shown to be true in empirical analyses regarding applications whose input data are of numerical type [57]. Therefore, ART is a candidate distance maximization strategy in our context as well. A basic ART algorithm described in Ref. [56] is quite similar to what was explained above for basic Greedy algorithm for distance maximization. There are different variations of ART [58] as well that can be applied on TCP problem.
- ○ Clustering-based distance maximization: Clustering algorithms partition data points into groups, using a similarity/distance measure between pairs of data points and pairs of clusters, so that the points that belong to the same group are similar and those belonging to different groups are dissimilar. Though clustering techniques are not optimization techniques, the fact that clusters are formed based on the similarities/distances among data points makes these algorithms a potential solution for our maximization problem.
  One way of applying clustering for prioritization is to start with selecting a test case (T1) in random. Then cluster the test cases into two partitions (C1 and C2). T1 will belong to one of the two clusters. Let's say T1 is in C1. Then we select a test case, T2 (either a random one or the centroid) from C2 to append to the ranked list. Then in each step, we split the biggest cluster into two and select a test case from the cluster that does not have a representative yet and append it to the ranked list, until we rank all test cases.
  There are also many other ways that one can come up with, when applying clustering for prioritization. It is also possible to combine clustering with another maximization algorithm so that we first cluster the test cases and then order the clusters rather than ordering test cases. Finally, the tests within a cluster need to be ordered using other strategies.
- ○ Search-based distance maximization: The overall idea here is quite similar to Search-based coverage maximization, explained earlier. Except that the fitness function should represent the aggregated measure based on the distance values. For instance, the average of all paired distance values for each set size, following the given ordering.

As an example, let's look at the Table 1 test cases. The encoded versions of those three test cases look like the following:

$\begin{array}{c} T 1 : 〈1, 0, 1, 0, 1〉 \\ T 2 : 〈1, 1, 1, 0, 1〉 \\ T 3 : 〈0, 0, 0, 1, 0〉 \end{array}$ $\begin{array}{c} T 1 : 〈1, 0, 1, 0, 1〉 \\ T 2 : 〈1, 1, 1, 0, 1〉 \\ T 3 : 〈0, 0, 0, 1, 0〉 \end{array}$

si6_e

Now using a Hamming distance function, the distance matrix is:

	T1	T2	T3
T1	—	0.2	0.8
T2	—	—	1.0
T3	—	—	—

Unlabelled Table

Finally, using a Greedy-based distance maximization, the first pair to select is T2 and T3 (with no particular preference among them) and then T1 is ranked as the last test to execute. To make a full ordering, one can choose the first test in random or choose the one among T2 and T3 that has higher coverage, which results in 〈T2, T3, T1〉 ordering.

4 Test Prioritization Evaluation

So far we have only focused on how to rank test cases based on available input resources. Given that all our techniques use heuristics (such as coverage or diversity of code, specification, history), the question is how good these heuristics are. In other words, assume we have the best optimization strategy and we rank test cases in an optimal way with respect to their let's say method coverage. Should we expect an optimal rank? How can we measure the quality of a ranked list? What is the ultimate goal of a TCP?

To answer these questions, we first need to (re)define our ideal ranking. We said earlier that there is no one way to define the best rank. The “best” might actually be different depending on the context, but overall, most people agree that the goal of a TCP is to rank test cases so that we find failing test cases as soon as possible. That means we need to prioritize failing test cases over nonfailing ones.

But this definition is not enough. First, we have to assume that each test case failure is mapped to one and only one defect. Second, we have to assume that all test failures are equally severe; otherwise we need to assign a weight per failure so that the failing tests with severe failure are ranked higher. Finally, we have to decide how to aggregate the results of several failing tests. Let me explain these in the following categories:

• A test suite with one failing test case: If a test suite contains only one failing test case, the goal of a TCP is putting that one test as high as possible in the ranked list. Ideally, as number one. If this is the case, the best evaluation measure for a prioritized list is the rank number of the failing test case in the list. For instance, if TCP1 prioritized the only failing test as #4 and TCP2 prioritized it as #20, TCP1 is a better prioritization technique. We can even say a TCP that ranks the failing test as #1 is the optimal solution. Note that this measure is also useful when the test suite has multiple failing tests, but as soon as a fault is detected, the code is fixed and the tests updated and reordered. Therefore, we always care about detecting the first fault.
• A test suite with multiple failing test cases: In a bit more generic setting, there might be more than one failing test cases per test suite. Note that we still assume that each test case detects maximum one defect and defects are equally severe. There are two different situations in this context, depending on whether the failures are unique or not.
- ○ Unique test case per defect: If the test-to-failure mapping is unique, which means there are no two test cases that detect the same defect, what we need is to rank the failing test cases all up in the ranked list. A simple evaluation measure here would be the number of test cases required for detecting all defects. However, you may end up with scenarios where let's say TCP1 detects 8 defects out of 10 total detectable defects up front by just 8 test executions and then nothing for a while until it detects the rest by 40 test executions. However, TCP2 detects the first 8 by 10 test executions and all 10 defects by 15. Now depending on the context one may choose TCP1 or TCP2. If I only care about the first 80% of the defects or I have limited budget of let's say 10 test executions, then TCP1 is my choice. But if I care about all bugs and I have more testing budget, TCP2 may be my better choice.
- ○ Multiple tests per defect: If the test case is not unique per failure, then the raw number of detected defects is not enough and we need to see how many unique defects have been defected so far. For instance, assume the first four test cases, ordered by TCP1, all fail but they all detect the same defect. However, among the first four test cases, ordered by TCP2, only two tests fail, but they detect separate defects. In this example, TCP2 is a better choice since we detect two defects rather than one (of TCP1 approach) by four test case executions.
• Multiple faults per test case: Finally, a test case alone can detect multiple faults. The best way to represent such cases is using a matrix, which is called fault matrix. It shows the test cases in rows/columns and unique faults in columns/rows (see Table 3).

Table 3

Sample Fault Matrix
T1 T2 T3 T4
Fault1 0 0 1 0
Fault2 1 0 1 0
Fault2 0 0 0 1

A very common effectiveness metric that is used in most TCP literature is from this category and is called “Average Percentage of Fault-Detection” (APFD) [14]. APFD: APFD measures the average of the percentage of faults detected by a prioritized test suite. APFD is given by
$APFD = 100^{*} (1 - \frac{{TF}_{1} + {TF}_{2} + \dots + {TF}_{m}}{nm} + \frac{1}{2 n})$ $APFD = 100^{*} (1 - \frac{{TF}_{1} + {TF}_{2} + \dots + {TF}_{m}}{nm} + \frac{1}{2 n})$

where n denotes the number of test cases, m is the number of faults, and TF_i is the number of tests which must be executed before fault i is detected. The overall idea behind APFD is calculating area under the curve where the x-axis is the number of test cases (or portion of test suite) executed in the given order and the y-axis is the number of unique faults detected by the test executions (quite similar to the explained concept of AUC for coverage-based optimization). Maximum APFD is when all defects are detected by the first test cases and minimum APFD is when no defect is detected by the first n − 1 test cases.
- ○ APFD's weaknesses: Though APFD is very common, but has its own limitations. Among them the most major one is when test suites are large with many test cases that do not detect any fault. This scenario, actually, happens to be a very common scenario given that test prioritization is more useful for large-scale test suites and typically these (regression) test suites have only a few failing test cases. The problem is that APFD cannot show the differences between two TCPs well, when the vast majority of test cases are not detecting any fault and the detecting tests are mostly ordered first. In other words, the gap between two curves (two different orderings) is in a small initial portion of the ordered test suite. Therefore, the gap between the overall AUCs for the two curves becomes insignificant.
  To overcome this issue, a simple approach is having a cutoff for the number of test cases in the suite that are being ranked. This way the very long tail of the curve will be excluded and the APFD results will be more normalized. Another option is transforming the raw APFDs by a logarithmic function to make the differences more visible.

Sample Fault Matrix
	T1	T3	T4
Fault1	0	1	0
Fault2	1	1	0
Fault2	0	0	1

• Measures based on prediction effectiveness: As we will discuss it in the next section, TCP problem can also be seen as a prediction problem, where the goal is predicting the failure likelihood of each test case in a test suite, in the new release, based on historical data. The prioritization then can be done by sorting the test cases based on such failure likelihood estimates. Most TCP techniques in this domain use historical data on some metrics (e.g., code coverage, historical failure, size) as a learning set and predict the failure of test cases based on the values of the metrics in this version of the software. Therefore, another category of evaluation metrics is standard prediction evaluation metrics such as precision, recall, F-measure, and accuracy [59].
• The cost and severity factor: The main two factors that are ignored in all above metrics are the cost of each test case execution and the severity of the detected faults per test case. These two factors (arguably) can be even more important that the raw number of detected faults is the basis for all the above metrics. The solution to consider these factors into account is actually quite simple. One can just use a weighted version of each metric. However, the challenge is finding the right weights. As an example, Elbaum et al. [24] proposed and studied one such variation of APFD which they call “cost-cognizant” metric APFDc.
Regarding the cost measurement, there are several metrics introduced in the literature, e.g., test case/suite execution time, test case/suite size, defect discovery time, prioritization time [9] that can be used to quantify the cost measure.
Severity can also be measured in different ways. In a study by Hettiarachchi [60], they used requirements modification status, complexity, security, and size of the software requirements as risk indicators and employed a fuzzy expert system to estimate the requirements risks.

5 Trends and Future Directions in TCP Techniques

The advances in test prioritization in recent years are in three categories: (a) Scalable solutions, (b) Intelligent algorithms, and (c) New application domains.

(a) Scalable solutions
Looking at many earlier works in the domain of TCP, we realize the algorithms and heuristics proposed in those studies had only been evaluated on toy applications (very small size applications or lab-made programs). Therefore, scalability of the TCP techniques to real-world complex systems is definitely one of the target research directions. At least two trends are clear from the recent publications: Prioritizing system-level test suites and prioritizing test cases in the presence of frequent code changes.
- • System TCP: In recent years, when researchers tried to apply the same old TCP techniques on real-world large-scale software systems, they have faced several scalability issues with respect to test execution cost. For example, Hemmati et al. [28] report that in the context of system-level testing of embedded systems, each test execution is very costly because there are hardware and network in the loop. This is very different situation that a typical unit-level testing scenario, where the external (e.g., network and database) access is mocked and each test execution is very fast. They report that even running 100 s of such test cases is not feasible for many teams, in their overnight regression test suite. The specific trend that this work has started is maximizing test suite's diversity rather than only focusing on its coverage. It is rather important for large-scale system where achieving very high coverage is very expensive; therefore, test cases prioritized by coverage-based approaches may fail to verify a wide-enough range of features of the software under test.
- • Continuous integration: The other important trend in test case prioritization advances we observe today is dealing with frequent amount of changes a software undergoes. With the increasing interest in continuous integration, large software companies such as Google face scalability issues, even for running all unit test cases after each commit [17]. It is even expensive to just run the affected test cases by the code change. The new trends and advances in TCPs [17] suggest bounded change impact analysis based on effective evidence collected from earlier versions of the SUT.
Future directions (1): Based on the earlier discussion, it seems that one of the future directions on TCP research is conducting large-scale empirical studies with industrial software systems and evaluating the existing techniques in terms of their cost effectiveness in large and proposing modifications and relaxations to make the algorithms scale to the size and complexity of real-world systems.
(b) Intelligent algorithms
There is no shortage of new intelligent ideas proposed for TCP problem. Most techniques try to come up with a new metric for test prioritization that works better than the existing ones. Here we list some of the new and trendy heuristics for TCPs:
- • Search-based optimization: As we saw earlier, the TCP problem can be formulated into an optimization problem and thus evolutionary search algorithms have been used in the literature to solve this problem. Genetic algorithm is one of the most common techniques in this domain. However, there are other search algorithms that have shown good results as well, for instance, Ant Colony Optimization [61], Particle Swarm, etc. [11].
- • Diversity-based (or similarity-based) TCP: TCP techniques in this category [28,29,51,52,62] have got attention in academia in recent years. The main idea as explained before is focusing on diversity of test cases rather than only coverage of some sort. This line of thought has also been very active in automated test generation literature. For example, Feldt et al. [63] studied diversity metrics and proposed new metrics that work on the diversity of the entire set rather than an aggregation of pairwise distances. In another work, Shi et al. [64] introduce an entropy-based metric for diversity measurement among test cases.
- • IR-based TCPs. Information retrieval techniques have been used for extracting features from different resources to represent test cases. For instance, Kwon et al. [65] used an IR-based (TF-IDF) metric to extract features from code elements. Thomas et al. [12] used a topic modeling algorithm (LDA) to extract features from string literals and code comments and Hemmati et al. [8,36] used a topic modeling to extract features from textual test case instructions.
- • Historical failure: Another category of TCP techniques that got attention in recent years is making the old-school “historical failure” metric more intelligent. As we saw earlier, the original algorithm simply orders test cases that fail in the past higher. Elbaum et al. [22] proposed a smarter and more scalable version of this idea by introducing a windowing mechanism to limit the considered history. They also look at how recent the failure was to rank the fault revealing test case. Noor and Hemmati also modified the basic algorithm by introducing the concept of “similarity to the failing test cases in the past” [23]. Basically, the idea is that many test cases are quite similar to older test cases (they may even be a slightly modified version of older test cases), but not exactly the same as those old test cases. In these scenarios, the test cases that are pretty similar (in terms of what behavior they test) to the old failing test cases should also be ranked higher. This makes the overall idea of historical failure metric more accurate and smooth (not a binary metric anymore).
There are also attempts to use multiple metrics together to get a better solution, overall. The main two strategies here are multiobjective search and machine learning-based approaches.
- • Multiobjective TCPs: The use of multiobjective search algorithms on the TCP problem has been exercised in several studies [4,66]. One of the very common trends is adding the cost factor as a secondary criterion to an existing effectiveness criterion such as coverage or diversity [51,67]. Due to scalability issues of multiobjective searches [68], this strategy is typically used for only two to three objective optimizations.
- • Machine learning-based TCP: The most straightforward way of using machine learning to combine heuristics for test prioritization is building prediction models (such as regression or Bayesian models) that accept a set of metrics (each based on one existing heuristic, e.g., coverage, historical failure detection, cost, diversity, change coverage) and predict the likelihood of “failure” for the test cases in the new release, based on historical data from previous releases. This idea falls under the category of history-based TCP approaches, since the learning dataset comes (typically) from the history. For example, Mirarab and Tahvildari [69] utilize Bayesian Networks (BNs) to incorporate source code changes, software fault-proneness, and test coverage data into a unified model. Noor and Hemmati [23] use a logistic regression model to aggregate metrics such as historical fault detection, code coverage, change coverage, and test case size for test failure prediction in the context of test prioritization.
  In another study [60], Hettiarachchi et al. used requirements “modification status,” complexity, security, and size of the software requirements as risk indicators and employed a fuzzy expert system to estimate the requirements risks. The test would then be prioritized based on the estimated risk of the corresponding covered requirements.
Future directions (2): Based on the earlier discussions, proposing more intelligent and effective TCP heuristics are definitely one of the future directions in TCP. Domains such as search-based TCP and machine learning-based TCPs can be seen as the dominant players. However, use of information theory and diversity among test cases for TCP as well as use of NLP techniques is becoming attractive to researchers as well. One area that is not explored yet but has potentials is the use of Deep Learning for TCP. Deep Learning combined with NLP may be helpful to extract features from test cases that are not easily identifiable but yet very useful for prediction.
(c) New application domains
The last category of trends and future directions is about the application domain. Most earlier work on TCP studies was focused on test suites of simple C/C ++/Java programs. Some of the new domains that have recently got special attention for test prioritization are product line software [66] and mobile applications [70].
- • Product line software: In this context, typically a variability model (e.g., feature model) is used to prioritize test cases to cover software features. A feature model is a tree structure that captures the information of all the possible products of a software product line in terms of features and relationships among them [71]. The idea is to maximize the coverage of features across products that are being tested. Combinatorial testing [40] is also a useful approach that is being used in this domain. Note that most above discussed TCP techniques and approaches can be directly or indirectly applied in this domain. For example, Wang et al. [66] used a multiobjective TCP to maximize effectiveness (feature coverage) while minimizing the cost of test cases in a product line setting.
- • Mobile: Another application domain that can be specifically benefitted from TCP is mobile application testing. One of the particular features of mobile applications that make them a perfect target for TCP is their battery limitation. Jabbarvand et al. proposed an energy-aware TCP approach that significantly reduces the number of tests needed to effectively test the energy properties of an Android app [70]. Basically, it is a coverage-based approach, but it relies on an energy-aware coverage criterion that indicates the degree to which energy-greedy segments of a program are tested. In another work, Zhai et al. [72] proposed a location-aware TCP. In their work a test case for a location-aware mobile app is a sequence of locations. They propose a suite of metrics that focus on the location of the device and points of interest (POI) for the application. For example, “sequence variance” measures the variance of a location sequence of a test case and “centroid distance” measures the distance from the centroid of a location sequence of a test case to the centroid of a set of POIs. The TCP tries to maximize the diversity of the locations while minimizing the relevance to the POIs.
Future directions (3): As soon as new application domains emerge, the TCP problem becomes relevant for them. Typically, as we saw for instance in the mobile app domain, the new TCPs will focus on features that are specific for such domains. Given the amount of interest in big data applications and cloud computing, we can predict that soon we will need TCP techniques that are focused for applications in these domains.

6 Summary

TCP is one of the classic testing problems that have been studied in the literature for many years. There are several solutions for this problem depending on the available input resources for the prioritization approach and the heuristics employed. This chapter introduces some of these techniques and summarizes trends and future directions in the context of TCP.

Table of Contents for Chapter Four: Advances in Techniques for Test Prioritization

Create new playlist

Sign In

Sign Up