Don't forget the developers! (and be careful with your assumptions)

A. Orso    Georgia Institute of Technology, Atlanta, GA, United States

Abstract

This chapter is not about data science for software engineering specifically; it is rather a general reflection about the broad set of techniques whose goal is to help developers perform software engineering tasks more efficiently and/or effectively. As a representative of these techniques, we studied spectra-based fault localization (SBFL), an approach that has received a great deal of attention in the last decade. We believe that studying SBFL, how it evolved over the years, and the way in which it has been evaluated, can teach us some general lessons that apply to other research areas that aim to help developers, including software analytics.

Keywords

Spectra-based fault localization (SBFL); User studies; Debugging; Helping developers

Acknowledgments

This chapter is based mainly on work [3] performed jointly with Chris Parnin (now a faculty member at North Carolina State University).

Disclaimer

This chapter is not about data science for software engineering specifically; it is rather a general reflection about the broad set of techniques whose goal is to help developers perform software engineering tasks more efficiently and/or effectively. As a representative of these techniques, we studied spectra-based fault localization (SBFL), an approach that has received a great deal of attention in the last decade. We believe that studying SBFL, how it evolved over the years, and the way in which it has been evaluated, can teach us some general lessons that apply to other research areas that aim to help developers, including software analytics.

Background

Spectra-based (or statistical) fault localization is a family of debugging techniques whose goal is to identify potentially faulty code by mining both passing and failing executions of a faulty program, inferring their statistical properties, and presenting developers with a ranked list of potentially faulty statements to inspect. Fig. 1 shows the typical scenario of usage for these techniques: given a faulty program and a set of test cases for the program, an automated debugging tool based on SBFL would produce a ranked list of potentially faulty statements. The developers would then inspect these statements one at a time until they find the bug. It is worth noting that, in this context, being able to look at only about 5–10% of the code in a majority of cases is typically considered a good result.

f49-01-9780128042069
Fig. 1 An abstract view of the typical SBFL usage scenario.

The first instances of these techniques (eg, [1,2]) were clever, disruptive, and promising. Case in point, the work by Jones and colleagues [1] was awarded an ACM SIGSOFT Outstanding Research Award, and Liblit's dissertation, which was based on [2], received an ACM Doctoral Dissertation Award. Unfortunately, however, SBFL somehow became too popular for its own good; a few years after these first and a few other innovative techniques were presented, researchers started to propose all sorts of variations on the initial idea. In most cases, these variations simply consisted of using a different formula for the ranking and showing slight improvements in the effectiveness of the approach. This trend is still in effect today, to some extent.

Are We Actually Helping Developers?

One of the main reasons for this flooding of alternative, fairly incremental SBFL techniques, we believe, is that researchers got too excited about (1) the presence of readily available data (test execution and outcome data), (2) the possibility of easily analyzing/mining this data, and (3) the presence of a clear baseline of comparison (the ranking results of existing techniques). These factors and excitement made researchers lose sight not only of the actual goal of the techniques they were developing, which was to help developers, but also of the many potential issues with their data analysis, such as bias, wrong assumptions, confounding factors, and spurious correlations.

To support our point, we performed a user study in which we assigned two debugging tasks to a pool of developers with different levels of expertise and studied their performance as they completed these tasks with, and without, the help of a representative SBFL technique. The results of our study clearly showed that SBFL techniques, at least as currently formulated, could not actually help developers: the developers who used SBFL did not locate bugs faster than the developers who performed debugging in a traditional way. This result was further confirmed by the fact that the performance of developers who used SBFL was not affected by the position of the faulty statements in the ranked lists produced by the technique (ie, having the fault ranked considerably higher or lower in the list did not make a difference in the developers' performance).

A more in-depth look at our results also revealed some additional issues with the (implicit) assumptions made by SBFL techniques. A first assumption is that locating a bug in 5–10% of the code is a good result. Although restricting the amount of code in which to locate a bug to a tenth of the software may sound much better than inspecting the whole program, in practice, this may still mean going through a list of thousands of statements, even for a relatively small program, which is clearly unrealistic. A second assumption is that programmers exhibit perfect bug understanding (ie, they can look at a line of code and immediately assess whether it is faulty). Unfortunately, developers cannot typically see a fault in a line of code by simply looking at that line without any context. On the contrary, we observed that the amount of context necessary to decide whether a statement is faulty could be considerable, and developers could take a long time to decide whether a statement reported as potentially faulty was actually faulty. A final assumption is that programmers would inspect a list of potentially faulty statements linearly and exhaustively. The assumption that developers would be willing to go through a list of potentially faulty statements one at a time, until they find the actual fault, is also unrealistic. In fact, we observed that developers stopped following the provided list and started “jumping” from one statement to the other (or completely gave up on the tool) after only a few entries. (Developers in general have a very low tolerance for false positives.)

Some Observations and Recommendations

The results we just discussed provide evidence that SBFL, a technique that appears to work well on paper, may not actually work that well or be helpful to developers in practice. As we mentioned at the beginning of this chapter, our goal is to use the example of SBFL to point out some common pitfalls for research that aim to help developers. We therefore conclude the chapter with a set of observations and recommendations derived from our results with SBFL that we believe can be of general value for at least some of the research performed in the area of data science for software engineering.

 The first, obvious observation that we can make based on our results is that techniques that are supposed to help developers should at some point be put into the hands of developers to be evaluated. We have to be especially careful when assuming that surrogate and easy to compute measures of success (eg, the percentage of statements in the code to be inspected for SBFL) are sensible and reasonable indicators of actual usefulness without any actual evidence of that.

 An additional, less obvious, observation is that it is easy to make several (often implicit) assumptions about the developers' behavior when defining a technique. If these assumptions, or even just some of them, are unrealistic, the practical usefulness of the technique will necessarily be negatively affected. To make the situation worse, sometimes it is difficult to even realize that these assumptions are there, so subsequent work may simply take them for granted and perpetuate the issue (as it has happened for years in the context of SBFL). In other words, we should be careful with our assumptions. Strong assumptions may be fine in the initial stages of research and in new areas. Such assumptions should, however, be made explicit and tested when the research becomes more mature.

 Based on these observations, we strongly believe that we should perform user studies whenever possible. The best, and maybe only way to understand whether an approach is effective, its assumptions are reasonable, and its results are actually useful is to perform user studies in settings that are as realistic as possible. Again, this is particularly true for mature approaches, for which analytical evaluations tend to fall short and provide limited information.

 Researchers should also avoid going for the low-hanging fruits that can only result in incremental work and add little to the state of the art. There are several egregious examples of this in the SBFL area, with dozens of papers whose main contribution is the improvement of success indicators (eg, rank of the faulty statements) that have been shown to be misleading. We see a similar risk for the area of software analytics, where the abundance of data, and of well-known data analysis techniques, makes it far too easy to produce results that may be publishable, but are not particularly useful or actionable.

In summary, when trying to help developers, we have to be careful about making sure that the approaches we develop are not just a nice technical exercise, but are actually useful. Doing so entails questioning our assumptions, looking for novel, rather than incremental results, and, ultimately, evaluating our techniques with real developers and in realistic settings.

References

[1] Jones J.A., Harrold M.J., Stasko J.T. Visualization of test information to assist fault localization. In: Proceedings of the international conference on software engineering (ICSE 2002); 2002:467–477.

[2] Liblit B., Aiken A., Zheng A.X., Jordan M.I. Bug isolation via remote program sampling. In: Proceedings of the conference on programming language design and implementation (PLDI 2003); 2003:141–154.

[3] Parnin C., Orso A. Are automated debugging techniques actually helping programmers. In: Proceedings of the international symposium on software testing and analysis (ISSTA 2011); 2011:199–209.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.39.133