What can go wrong in software engineering experiments?

S. Vegas; N. Juristo    Universidad Politécnica de Madrid, Madrid, Spain

Abstract

An astronomer wants to buy a telescope to observe a distant galaxy. He must be careful when choosing it, since viewed through the wrong telescope the galaxy will be an indecipherable blur. An experiment is an instrument we use in Software Engineering (SE) to analyze software development. The reliability of the findings is critically dependent on the alignment between the instrument and the phenomenon that we are studying. If our instrument is not properly aligned everything will be a blur, but we will mistakenly take it to be right. The aim of this chapter is to help readers to avoid common pitfalls when running SE experiments.

Keywords

Experimentation; Operationalization; Between-subjects designs; Within-subjects designs; Statistical significance; Power analysis

An astronomer wants to buy a telescope to observe a distant galaxy. While he is unsure about precisely what features the telescope he needs should have, he does know exactly how much he has to spend on equipment. So, without any further analysis of the required specifications, he goes ahead and orders the best telescope that he can afford and hopes for the best. But, viewed through the wrong telescope, the galaxy will be an indecipherable blur, and the equipment will be useless. Several factors regarding telescope design should to be taken into account (whether the objective should be a lens or a mirror, the diameter of the objective, the quality of the lens/mirrors, etc.). Apart from its price, they affect its magnification or power. In other words, what we will be able to observe through an instrument depends on the instrument’s characteristics.

An experiment is an instrument that we use in software engineering (SE) to analyze software development. If our instrument is not properly aligned everything will be a blur, but we will mistakenly take it to be right. The reliability of the findings is critically dependent on the quality of the instrument, the alignment between the instrument and the phenomenon that we are studying.

As in any other discipline, conducting experiments in SE is a challenging error-prone activity. Other fields are tackling the issue of how much trust they can place in experiment results. Pashler and Wagenmakers report “a crisis of confidence in psychological science reflecting an unprecedented level of doubt among practitioners about the reliability of research findings in the field.” The reliability of the results is highly dependent on design and protocol quality. Not everything goes right.

Experimentation is a fairly recent practice in SE compared with other much more mature experimental disciplines. Experimentalism is a paradigm that needs to be instantiated, translated, and adapted to the idiosyncrasy of each experimental discipline. Copy and paste, that is, copying from physics what an experiment is, copying from medicine the threats to the validity of experiments, or copying from psychology how to deal with experimental subjects, will not do at all. We can borrow from all these experiences, but our discipline needs to adopt its own form of experimentalism. We all need to learn more, and much more effort and research is needed to adapt the experimental paradigm to SE.

Based on our experiences of running experiments and reading the reports on other experiments, we have spotted some common mistakes in SE experiments. As a result, we have identified some good practices that may be helpful for avoiding common pitfalls.

Operationalize Constructs

Operationalization describes the act of translating a construct into its manifestation. In an SE experiment, we have cause and effect constructs, which are operationalized as treatments (methods, tools, and techniques to be studied) and dependent variables (software attributes under examination), respectively. For cause constructs, it is necessary to specify how the treatment will be applied, as SE’s immaturity (which very often shows up as informality) might lead different people to interpret the treatment differently. Effect constructs should take into account not only the metrics used to measure the dependent variable, but also the measurement procedure. The measurement procedure in SE is context dependent and requires to be specified.

Evaluate Different Design Alternatives

The simpler the design is, the better it will be. Experiments that study one only source of variability (one factor designs), and if run with subjects, where each subject applies one only treatment (between-subjects designs) are the most manageable. Because of SE’s intrinsic properties, however, they are very often not the right choice in this field.

 Small sample sizes and a large variability in subject behavior make it almost impossible to use between-subjects designs in experiments run with subjects. The alternative is that each subject applies all treatments (within-subjects designs).

 The influence of the intrinsic properties of the experimental objects (programs/software systems/tasks) and subjects (if any) very often obliges researchers to randomize according to them (stratified randomization) or use blocking variables.

 Software development process complexity makes it more or less impossible to rule out the influence of other sources of variability (typically experimental objects and subjects), and more than one factor needs to be added to the design.

Consequently, experimental design has to be approached iteratively, trying out different designs and then analyzing trade-offs.

Match Data Analysis and Experimental Design

Data analysis is driven by the experimental design. Issues such as the metric used to measure the treatments and dependent variables, the number of factors, whether the experiment has a between- or within-subjects design, and the use of blocking variables will all determine the particular data analysis technique to be applied. This data analysis technique maps the design to the statistical model in terms of the factors and interactions to be analyzed. However, the choice of data analysis technique and/or statistical model is not always as straightforward as all that:

 Parametric tests are the preferred option, as they are more powerful than non-parametric tests and are capable of analyzing several factors and their interactions. But the data do not always meet the data analysis technique requirements (normality of the data or residuals and/or homogeneity of variances depending on the technique in question). An alternative to non-parametric tests is transformation of the dependent variable. Additionally, some tests are robust to deviations from normality.

 Complex designs may require the addition of some extra factors to the statistical model. Take, for example, crossover designs, where each experimental subject applies all treatments, but different subjects apply treatments in a different order. The order in which subjects apply treatments (sequences) and the times at which each treatment is applied (periods) have to be added to the factor analysis, and a decision has to be made about how to deal with carryover.

Do Not Rely on Statistical Significance Alone

All experiments report statistical significance. However, statistical significance is the probability of observing an effect given that the null hypothesis is true. In other words, it measures whether the observed effect really is caused by the population characteristics or is merely the result of sampling error. But it gives no indication of how big the difference in treatments is. For relatively large sample sizes, even very small differences may be statistically significant. On this ground, we need a measure of practical significance. The question is whether the differences between treatments are large enough to be really meaningful. This is generally assessed using a measure of effect size. There is a wide range of over 70 effect size measures, capable of reporting different types of effects.

Do a Power Analysis

Power analysis can be done before (a priori) or after (post-hoc). A priori power analysis tells experimenters what minimum sample size they need to have a reasonable chance of detecting an effect of a given size before they run the experiment. Of course, experimenters will require a bigger sample size if they are looking for small effects than to detect medium or big effects. Post-hoc power analysis determines the power of the study assuming that the sample effect size is equal to the population effect size. While the utility of a priori power analysis is universally accepted, the usefulness of post-hoc power analysis is controversial, as it is a one-to-one function of the statistical significance.

Find Explanations for Results

The goal of the experiment is to answer the research questions. Experimenters should not, therefore, just stop when the null hypothesis is (not) rejected; they should question why they got the results that they got. They should hypothesize why one treatment is (or is not) better than the other and what might be causing differences in behavior.

Follow Guidelines for Reporting Experiments

The way in which an experiment and its results are reported is just as important as the actual results.

Improving the reliability of experimental results

A key question after an experiment has been conducted is to what extent the conclusions are valid. Most of these good practices will help us to address the different types of validity threats that experiments suffer from:

 Constructs operationalization is related to construct validity.

 The evaluation of different design alternatives is related to internal and external validity.

 Matching data analysis and experimental design, not relying on statistical significance alone and doing power analysis is related to statistical conclusion validity.

Additionally, experimentation is part of a learning cycle. The outcomes of an experiment need to be properly interpreted to generate knowledge. The good practices finding explanations for results and following reporting guidelines will help us to generate this knowledge.

Running SE experiments is a multifaceted process. Many different issues must be taken into account so that the results of the experiment are valid and can be used for knowledge generation. Moreover, SE has some special features, leading to some issues concerning experimentation being conceived of differently than in other disciplines.

Further Reading

[1] Ellis P.D. The essential guide to effect sizes: statistical power, meta-analysis and the interpretation of research results. Cambridge: Cambridge University Press; 2010.

[2] Jedlitschka A., Ciolkowski M., Pfahl D. Reporting controlled experiments in software engineering. In: Shull F., Singer J., Sjoberg D.I., eds. Guide to advanced empirical software engineering. London: Springer; 2008.

[3] Juristo N., Moreno A.M. Basics of software engineering experimentation. Boston, MA: Kluwer; 2001.

[4] Pashler H., Wagenmakers E.-J. Editors' introduction to the special section on replicability in psychological science: a crisis of confidence? Perspect Psychol Sci. 2012;7(6):528–530.

[5] Shadish W.R., Cook T.D., Campbell D.T. Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton Mifflin Company; 2002.

[6] Wohlin C., Runeson P., Höst M., Ohlsson M.C., Regnell B., Wesslén A. Experimentation in software engineering. Berlin: Springer; 2012.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.118.229