Systematic Reviews in Software Engineering

In this section I present the results of some SRs that have challenged “common knowledge” and, in some cases, forced me to revise some of my ideas about software engineering.

Cost Estimation Studies

Cost estimation studies often report empirical studies of industry data sets, so research questions related to cost estimation would appear to be obvious candidates for SRs. Indeed, a review of SRs published between January 2004 and June 2007 found that cost estimation topics were the most common subject for SRs [Kitchenham et al. 2009b]. Two of these reviews are of particular interest because they overturn some of our preconceptions about software cost estimation.

The accuracy of cost estimation models

In two SRs, Magne Jørgensen addressed the issue of whether estimates from cost models (mathematical formulas usually generated from data collected on past projects) are more accurate than expert judgment estimates (estimates based on the subjective opinion of software developers or managers) [Jørgensen 2004], [Jørgensen 2007]. Since the publication of the books by Boehm [Boehm 1981] and DeMarco [DeMarco 1982] in the early 1980s, it has been an article of faith among cost estimation researchers that cost estimation models must be better than expert judgment, but Jørgensen’s reviews were the first attempt to determine whether this belief was supported by empirical evidence.

In [Jørgensen 2004], Jørgensen found 15 primary studies that compared expert judgment models and cost estimation models. He categorized 5 to be in favor of expert judgment, 5 to find no difference, and 5 to be in favor of model-based estimation. In [Jørgensen 2007], he identified 16 primary studies that compared expert judgment estimates with formal cost models and found that the average accuracy of the expert-judgment models was better in 12.

The differences between the two SRs reflect differences in the included studies and the focus of the review. In the more recent SR, he omitted three of the primary studies included in the first SR and included four additional primary studies. The reason for including different studies was that the initial SR was aimed at justifying the need to improve procedures by which experts made their subjective estimates. The second SR was concerned with identifying the context within which an expert opinion estimate would more likely be accurate than the estimate from a formal model and vice versa. However, in spite of differences between the two SRs, the results are clearly not consistent with the view that cost model estimates are necessarily better than expert judgment estimates.

With respect to possible bias in “related research” sections in papers, I was the coauthor of one of the papers included in both of Jørgensen’s reviews [Kitchenham et al 2002]. In our paper, my colleagues and I were able to identify only 2 of the 12 papers published before 2002 found by Jørgensen, and both of them found expert judgment better than model-based estimation, as we ourselves did.

The accuracy of cost estimates in industry

The 1994 CHAOS Report from the Standish Group [The Standish Group 2003] stated that project overruns in the software industry averaged 189% and that only 16% of projects were successful (i.e., within budget and schedule). Subsequent reports from the Standish Group have found lower levels of overruns, with 34% of projects being classified as successful, a change observers have hailed as demonstrating how much software engineering techniques are improving. However, when Moløkken-Østvold and Jørgensen undertook a literature review of software effort estimation surveys ([Moløkken-Østvold et al. 2004], [Moløkken-Østvold and Jørgensen 2003]), they found three industrial surveys undertaken before the CHAOS report that found average cost overruns of between 30% and 50%. This was so different from the CHAOS report results that they reviewed the CHAOS report carefully to understand why the overrun rates were so high, and as a result of their investigation, omitted the CHAOS report from their survey.

The details of their investigation of the CHAOS report are given in [Jørgensen and Moløkken-Østvold 2006]. They found that the methodology adopted by the Standish Group left much to be desired:

  • The method of calculating the overruns was not specified. When Moløkken-Østvold and Jørgensen ran their own calculations, they found that the overruns should have been close to 89%, not 189%.

  • The Standish Group appeared to have deliberately solicited failure stories.

  • There was no category for under-budget projects.

  • Cost overruns were not well-defined and could have included costs on canceled projects.

They concluded that although the CHAOS report is one of the most frequently cited papers on estimate overruns, its results cannot be trusted.

Agile Methods

Currently there is a great deal of interest in Agile methods, both in industry and academia. Agile methods aim to deliver applications that match user requirements as quickly as possible while ensuring high quality. Many different approaches fall under the Agile rubric, but they all emphasize the practices of minimizing unnecessary overheads (i.e., excessive planning and documentation) while concentrating on incremental delivery of client-specified functions. Some of the best-known approaches are:

Extreme programming (XP, XP2)

The original XP method comprises 12 practices: the planning game, small releases, metaphor, simple design, test first, refactoring, pair programming, collective ownership, continuous integration, 40-hour week, on-site customers, and coding standards [Beck 2000]. The revised XP2 method consists of the following “primary practices”: sit together, whole team, informative workspace, energized work, pair programming, stories, weekly cycle, quarterly cycle, slack, 10-minute build, continuous integration, test-first programming, and incremental design [Beck 2004].

Scrum [Schwaber and Beedle 2001]

This method focuses on project management in situations where it is difficult to plan ahead by adopting mechanisms for “empirical process control” that focus on feedback loops. Software is developed by a self-organizing team in increments (called “sprints”), starting with planning and ending with a review. Features to be implemented in the system are registered in a backlog. Then, the product owner decides which backlog items should be developed in the following sprint. Team members coordinate their work in a daily stand-up meeting. One team member, the Scrum master, is in charge of solving problems that stop the team from working effectively.

Dynamic software development method (DSDM) [Stapleton 2003]

This method divides projects into three phases: pre-project, project life cycle, and post-project. It is based on nine principles: user involvement, empowering the project team, frequent delivery, addressing current business needs, iterative and incremental development, allow for reversing changes, high-level scope being fixed before project starts, testing throughout the life cycle, and efficient and effective communication.

Lean software development [Poppendieck and Poppendieck 2003]

This adapts principles from lean production, the Toyota production system in particular, to software development. It consists of seven principles: eliminate waste, amplify learning, decide as late as possible, deliver as fast as possible, empower the team, build integrity, and see the whole.

One of the possible limitations of SRs is that they may progress too slowly to be of value in a fast-moving domain such as software engineering. However, there have already been two recent SRs addressing Agile methods: [Dybå and Dingsøyr 2008a] and [Hannay et al. 2009].

Dybå and Dingsøyr

These researchers [Dybå and Dingsøyr 2008a] had three goals:

  • To assess current knowledge of the benefits and limitations of Agile methods

  • To assess the strength of evidence behind that knowledge

  • To apply their results to industry and the research community

They concentrated on Agile development as a whole, and therefore excluded papers that investigated specific methods, such as pair-programming in isolation. They identified 33 relevant primary studies. Most (24) of the studies they included examined professional software engineers. Nine studies took place in a university setting.

With respect to the differences between SRs and informal reviews, Dybå and Dingsøyr report that they found five papers published in or before 2003 that were not reported by two informal reviews published in 2004. With respect to the need for good-quality evidence, they rejected all the papers reported in the two informal reviews because they were either lessons-learned papers or single-practice studies that did not compare the technique they focused on with any alternative.

Dybå and Dingsøyr found that although some studies reported problems with XP (in the context of large, complex projects), most papers discussing XP found that it was easy to introduce and that the approach worked well in a variety of different environments. With respect to the limitations of XP, they found primary studies that reported that the role of the on-site customer was unsustainable in the long term.

However, they found many limitations with the empirical studies of Agile methods. Most of the studies concerned XP, with Scrum and Lean software development discussed in only one paper each. Furthermore, only one research team (which carried out four of the primary studies included) had looked at Agile methods being used by mature teams.

In addition, Dybå and Dingsøyr assessed the quality of the existing evidence in terms of the rigor of the study design, the quality of the individual primary studies (within the constraint of the basic design), the extent to which results from different studies gave consistent results, and the extent to which the studies were representative of real software development. They found the overall quality of evidence very low, and concluded that more research is needed of Agile methods other than XP, particularly studies of mature teams, where the researchers adopt more rigorous methods.

Hannay, Dybå, Arisholm, and Sjøberg

These researchers [Hannay et al. 2009] investigated pair-programming and undertook a meta-analysis to aggregate their results. If you are interested in how to do a meta-analysis, this paper provides a sound introduction. Their SR identified 18 primary studies, all of which were experiments. Of the 18 studies, 4 of the experiments involved only professional subjects, 1 involved professionals and students, and the remaining 13 used students. Hannay et al. investigated three different outcomes: quality, duration, and effort (although not every study addressed every outcome). Their initial analysis showed that the use of pair-programming had:

  • A small positive impact on quality

  • A medium positive effect on duration

  • A medium negative effect on effort

These results would seem to support the standard view of the impact of pair-programming. However, the results also indicated that there was significant heterogeneity among the studies. Heterogeneity means that the individual studies arise from different populations, so study results cannot be properly understood unless the different populations can be identified.

Hannay et al. found more problems when they investigated the possibility of publication bias. This occurs when papers that fail to demonstrate a significant effect are less likely to be accepted for publication than papers that show a significant effect. Their analysis suggested the likelihood of publication bias and found that adjusting for possible bias eliminated the quality effect completely, reduced the duration effect from moderate to small, but slightly increased the effort effect (although this occurred for only one particular analysis model).

They pointed out that the heterogeneity and possible publication bias could have been caused by the presence of moderating variables (i.e., variables that account for differences among study results). To investigate the impact of possible moderating variables, they looked at one study in detail. It was by far the largest study, involving 295 subjects, and used professional software engineers with three levels of experience (senior, intermediate, and junior). They concluded that there is likely to be an interaction between task complexity and outcome such that very complex tasks may achieve high quality using pair-programming but at the cost of considerably greater effort, whereas low-complexity tasks can be performed quickly but at a cost of poor quality. They recommend that researchers focus on moderating variables in future primary studies.

Dybå and Dingsøyr’s study suggests there are some benefits to be obtained from Agile methods [Dybå and Dingsøyr 2008a]. However, Hannay et al.’s meta-analysis suggests that the effects of pair-programming are not as strong as might have been hoped [Hannay et al. 2009]. Overall, the results of both studies suggest not only that the impact of Agile methods needs more research, but more generally, that we also need to improve the research methods we use for primary studies.

Inspection Methods

Inspection techniques are methods of reading code and documents with the aim of identifying defects. They are arguably one of the most researched topics in empirical software engineering. There were 103 human-centered experiments performed between 1993 and 2002 [Sjøberg et al. 2005]; 37 (36%) of them were inspection-related studies. The next largest category was object-oriented (OO) design methods, which were investigated in 8 (7.8%) of these 103 papers. It would seem, therefore, that inspections are a strong candidate for a SR.

Recently Ciolkowski performed a SR including a meta-analysis to test the claims made concerning perspective-based reading (PBR) [Ciolkowski 2009]. For example, many researchers had suggested that PBR was better than either “ad hoc” reading or checklist-based reading (CBR), and the experts involved in an electronic workshop rated the impact on the order of 35% [Boehm and Basili 2001], [Shull et al. 2002]. Other researchers were less enthusiastic. Wholin et al. undertook a study to investigate the feasibility of quantitative aggregation of inspection studies and concluded that CBR was the most effective technique [Wholin et al. 2003].

Ciolkowski’s initial meta-analysis found no significant impact of PBR. This result showed small to moderate heterogeneity. A subset analysis suggested that PBR was better than ad hoc reading and that CBR was better than PBR for requirements documents but worse for design documents. He also noted that the results were affected by moderating variables such as the origin of the inspection materials and whether the studies were performed by independent research groups. Overall, studies that used the same materials and the same group of researchers were more favorable to PBR than independent studies, although the subgroup analysis was not very robust. Nonetheless, he concluded that claims that PBR increased performance by 35% were not confirmed.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.12.232