Organizational Complexity Within Microsoft

We’ve explored the relationship of Conway’s Corollary with developer productivity by examining how “congruence,” Cataldo’s term for conforming to Conway’s Corollary, yields faster task completion. The other important outcome in any software development project is software quality. It is well known that a defect identified later in the development cycle has a much higher cost to correct. Defects that are not identified until after release are the most costly and therefore the most important to minimize. The cost associated with such defects have more than just monetary effects; perceived quality and company reputation drops when users encounter bugs in software, which can have lasting effects in the marketplace.

In light of Conway’s claims and the cost associated with defects, Nachi Nagappan, Brendan Murphy, and Vic Basili studied organizational structure and post-release defects in Windows Vista [Nagappan et al. 2008] and found a surprisingly strong relationship between the two. In fact, they found that measures of organizational structure were better indicators of software quality than any attributes of the software itself. Put more simply, in the case of Vista, if you want to know whether a piece of software has bugs, you’ll know better by looking at how the people who wrote it were organized than by looking at the code!

At the heart of their study is the premise that software development is a group effort and organizational complexity makes coordinating that effort difficult. As with any empirical study, they had to map their phenomenon down to something that is actually measurable. Measuring software quality is fairly easy. Windows Vista shipped with thousands of “binaries,” which are any piece of compiled code that exists in one file. So this includes all executables (.exe), shared libraries (.dll), and drivers (.sys). Whenever a component fails, the failure can be isolated to the binary that caused it, and reports of failures that are sent to Microsoft (remember that fun dialog you see whenever something crashes?) are recorded and used in decision making about how to prioritize bug fixing efforts. This information, along with other bug reports and their associated fixes, indicate how prone to failure each individual binary is. To evaluate Conway’s Corollary, Nagappan classified each binary in Vista as either failure-prone or not failure-prone based on how many distinct failures were associated with the binary. Note that if thousands of users send reports of crashes due to one defect, this is still only counted as one defect. Thus, failure-proneness is a measure of the number of defects known to be in a binary.

Measuring organizational complexity is a bit more difficult. Nagappan, Murphy, and Basili proposed eight metrics for organizational complexity, examined the individual correlation of each metric with failure-proneness, and built a regression model to examine the effect of each metric when controlling for the effects of each of the others. They used data from the version control system to determine which software engineers made changes to source files. Build information indicated which source files are used to build each binary. For simplicity, we say that an engineer contributed to, or touched, a binary if he or she made changes to any of the source files that were compiled into that binary.

Nagappan, Murphy, and Basili also gathered data from the company’s organizational chart to see how engineers were related organizationally. Each of their metrics uses information from at least one of these sources (and no other additional information was used). The eight organizational structure metrics are presented here:

Number of engineers

The number of engineers simply indicates how many individual software engineers contributed to a particular binary and are still employed by the company. We expect that higher values will result in more failure-prone binaries. The intuition behind this expectation is based on Brooks’ assertion [Brooks 1974] that if there are n engineers working on one component, there are n(n‒1)/2 possible communication paths. The communication overhead grows more than linearly when a software team grows. The more communication paths that can break down, the more likely there will be a coordination problem, leading to design mismatches, one engineer breaking another’s code, misunderstandings about design rationale, etc. Therefore, we expect to see more failures when more engineers work on a binary.

Number of ex-engineers

The number of ex-engineers is a measure of the engineers who worked on a binary and left the company prior to the release date. We expect that a higher value will lead to more defects. This is a measure of required knowledge transfer. When an engineer working on a binary leaves, another engineer who is less experienced with the code may step in to perform required work. This new engineer likely is not as familiar with the component’s design, the reasoning behind bug fixes, or who all of the stakeholders in the code are. This therefore increases the probability of mistakes being made and defects being introduced into the binary.

Edit frequency

The edit frequency is the total number of times that a binary was changed, independent of the number of lines in each change. A change is simply a unique commit in the version control system. Higher values of this metric are considered bad for code quality. If a binary has too many changes, this could indicate a lack of stability or control in the code, even if a small group of engineers are responsible for the changes. When taken together with the number of engineers and number of ex-engineers, this metric provides a comprehensive view of the distribution of changes. For instance, did a single engineer make the majority of the changes, or were they widely distributed across a large group? Correlating engineers with edits avoids a situation in which a few engineers making all of the changes can inflate this metric and lead to incorrect conclusions.

Depth of master ownership (DMO)

This measures how widely the work on a binary is distributed within an organization. For a particular binary, each person in the organization is labeled with the number of changes made to the binary by that person and all the people below him or her in the organizational hierarchy. In other words, the changes by individual contributors are summed and assigned to their manager along with his or her personal changes. The depth of master ownership is simply the lowest organizational level of the person labeled with at least 75% of the changes (the master owner). The higher this metric, the lower the master owner in the organization. We expect that a higher metric will be associated with less failures. If the majority of the changes to a binary come from a group of organizationally close engineers with a common manager, this low-level manager has ownership and the depth of master ownership is high. This indicates that the engineers working on a binary are in close contact and that coordination doesn’t require communication at high levels of the hierarchy. If we need to move up to a high-level manager to find someone responsible for 75% of the commits, there is no clear owner and changes are coming from many different parts of the organization. This may lead to problems regarding decision making due to many interested parties with different goals. Because a high-level manager has more people beneath him or her and is further removed from most of the actual work, communication is more likely to be delayed, lost, or misunderstood.

Percentage of organization contributing to development

This measures the proportion of the people in an organization that report to the level of the DMO, the previous metric. If the organization that owns a binary has 100 people in it and 25 of those people report to the master owner, for instance, this metric has a value of 0.25. We expect that lower values will indicate less failure-prone binaries. This metric aids the DMO metric when considering organizations that are unbalanced, for instance, if an organization has two managers but one manager deals with 50 people and the other only 10. A lower value indicates that ownership is more concentrated and that there is less coordination overhead across the organization.

Level of organizational code ownership

This is the percent of changes from the owning organization, or, when there is no defined owning organization, the largest contributing organization. Higher values are better. Different organizations within the same company will often have differing goals, work practices, and culture. Coordination between parties across organizational boundaries will need to deal with these differences, increasing the likelihood of broken builds, synchronization problems, and defects in the code. If there is an owner from a binary, then it is better for the majority of the changes to come from the same organization as the owner. If there is no clear owner, we still expect better outcomes if one organization makes the vast majority of the changes.

Overall organizational ownership

This measure is the ratio of the number of engineers who made changes to a binary and report to the master owner to the total number of engineers who made changes to the binary. We expect higher values to be associated with less failures. If the majority of the people making changes to a binary are managed by the owning manager, there is less required coordination outside the scope of that manager and therefore a lower chance of a coordination breakdown. This metric complements the DMO metric in that a low depth of master ownership (considered bad) may be offset by a high overall organizational ownership level (considered good).

Organization intersection factor

This measures the number of organizations that make at least 10% of the changes to a binary. Higher values indicate more organizationally distributed binaries, which we expect to lead to more failures. When more organizations make changes to a binary, this indicates no clear owner. There also may be competing goals within organizations touching the binary. The interest that multiple organizations show in the binary may also indicate that it represents an important “join-point” within the system architecture.

Each of these metrics measures some form of organizational complexity per binary. More organizational complexity indicates that the structure of the organization is poorly aligned with the structure of the software system itself. If our corollary to Conway’s Law is correct, we will observe better outcomes when there is less organizational complexity and more “fit” between the social and technical structures.

So how did these organizational measures hold up when compared to post-release defects in Windows Vista? To answer this question, Nagappan, Murphy, and Basili used a form of quantitative analysis known as logistic regression. Logistic regression takes as input a series of independent variables and produces a classification as output. In this case, each of the organizational metrics is measured for each binary and is used as the independent variables. Nagappan, Murphy, and Basili used post-release defect data to classify the binaries into failure-prone and not failure-prone. They then used logistic regression to determine whether there was a relationship between an independent variable (an organizational metric) and a classification (failure-prone or not). What’s more, this form of analysis can determine whether one organizational metric is related to failure-proneness when controlling for the effects of the other organizational metrics.

This is a little like studying the effect of a person’s age and height on his weight. If we examine the relationship of age with weight, a strong relationship will emerge, and likewise with height and weight. However, if you control for the height of a person, there is little to no relationship between age and weight (on average, someone in the U.S. who is six feet tall weighs about 210 lbs., regardless of whether he is 30 or 60). Logistic regression can also be used to make predictions. One can “train” a logistic regression model using a sample of people whose age, height, and weight are known, and then predict the weight of a new person by examining only his age and height.

The results of the study by Nagappan, Murphy, and Basili showed that each of the eight organizational metrics studied had a statistically significant effect on failure, even when controlling for the effects of the other seven metrics. That is, the higher the number of ex-engineers that worked on a binary, the more failures in that binary, even when removing the effects of the total number of engineers, depth of master ownership, etc. This indicates not only that organizational complexity is related to post-release defects, but also that each of the organizational metrics is measuring something at least slightly different from the others.

Next, Nagappan, Murphy, and Basili attempted to predict the failure-prone binaries based on the organizational metrics. Prediction is more difficult than simply finding a relationship because there are many factors that can affect an outcome. One might determine that age is strongly related to weight, but that doesn’t mean that one can accurately predict the weight of someone given only her age. The researchers found, however, that these organizational complexity metrics actually could predict failure-proneness with surprising accuracy. In fact, the organizational metrics were far better predictors than characteristics of the source code itself that have been shown to be highly indicative of failures (e.g., lines of code, cyclomatic complexity, dependencies, code coverage by testing).

The standard way to compare different prediction techniques is to examine precision and recall, two complementary measures of a predictor’s accuracy:

  • Precision, in this case, indicates how many of the binaries that a predictor classifies as failure-prone actually are failure-prone. A low value indicates many false positives, i.e., binaries that were predicted to be failure-prone but actually are not.

  • Recall indicates how many of the binaries that actually are failure-prone binaries are predicted to be failure-prone. A low value indicates that there are many failure-prone binaries that the predictor is incorrectly classifying as not failure-prone.

If possible, we would like to maximize both prediction and recall. An ideal predictor would predict all the actually failure-prone binaries as such and not predict any others as failure-prone. Table 11-1 shows the recall and precision for the organizational structure regression model compared to other methods of predicting failure-prone binaries in the past that have used aspects of the source code.

Table 11-1. Prediction accuracy

ModelPrecisionRecall
Organizational structure86.2%84.0%
Code churn78.6%79.9%
Code complexity79.3%66.0%
Dependencies74.4%69.9%
Code coverage83.8%54.4%
Pre-release bugs73.8%62.9%

Implications

What do these results actually mean? First, within the context of Windows Vista development, it is clear that organizational complexity has a strong relationship with post-release defects. These results confirm our corollary that (at least in within the context of Windows Vista) when communication and coordination mirrors the technical artifact itself, the software is “better.” When developers from different organizations are working on a common binary, that binary is much more likely to be failure-prone. There are many possible explanations for this. In the case of engineers leaving the company, binary-specific domain knowledge is lost, and low depth of master ownership means that communication between engineers working on the same binary must flow through upper levels of management and deal with the accompanying overhead, delay, and information loss.

Nagappan, Murphy, and Basili didn’t identify exactly which factors were causing the post-release defects. Such an endeavor would be quite difficult and most likely economically infeasible given the scale of Windows Vista. It would require something akin to a controlled “clinical trial” in which one group of engineers writes an operating system in a setting where organizational structure is minimized while another “placebo” group does the same thing in a more traditional structure as it exists at present.

Despite the lack of a detailed causality analysis, we can still benefit a great deal from these results. If practitioners believe that there is a causal relationship between organizational complexity and post-release defects, they may use such knowledge to minimize this complexity. Instead of allowing engineers throughout the organization to make changes to a binary, a clear owning team may be appointed to act as a gatekeeper or change reviewer. Due to interdependencies within any non-trivial software project, it is unlikely that organizationally complex components can be avoided entirely, but they may be identified up front. Stabilizing the interfaces of such components early in the development cycle may mitigate the negative effects of organizational complexity later on. An analysis of which binaries suffer from the most organizational complexity can be used late in the development cycle to direct testing resources to the components most likely to exhibit problems after release.

How much can we trust these results? To answer this question we have to understand the context of the study. It compared software components (binaries) that had different characteristics but were part of the same project. This approach avoids a number of confounding factors because it is more of an apples-to-apples comparison than comparing components from two different projects. Attempting to isolate the effects of a particular factor by examining different projects is prone to problems because each project is inherently different in many regards. A difference in outcomes may be due to the factor under study or to some external, unobserved factor, such as team experience, software domain, or tools used. Within the Windows organization inside of Microsoft, there is a consistent process, the same tools are used throughout, and decisions are made in similar ways. There is even a strong effort to make the process consistent and integrated across development sites in different countries. The consistency of the Windows development effort mitigates threats to the internal validity of this study and provides confidence that the relationships Nagappan, Murphy, and Basili observed between organizational complexity and post-release defects are not the result of some other form of bias due to differences in the binaries.

This study was also large-scale, comprising thousands of developers and binaries and tens of millions of lines of code. Given Brooks’ observations about the super-linear growth of coordination and communication requirements, it would seem that the effects of coordination mismatches would be magnified in larger projects. Windows Vista is clearly one of the largest pieces of software in use today, both in terms of developer teams and code size. It is therefore reasonable to expect that smaller projects (at least in terms of number of engineers) would not suffer as much from organizational complexity. The authors replicated their study on a reduced data set to determine the level at which the organizational metrics were good indicators of failure-proneness and found that the effects could be observed on a team size of 30 engineers and three levels of depth in the management hierarchy.

A reading of the original study shows careful attention to detail, a clear knowledge of the issues involved in a quantitative study (e.g., the authors used principal component analysis to mitigate the threat of multicolinearity in organizational metrics), and a fairly comprehensive examination of the results. The hypotheses are supported by both prior theory (Conway’s Law, Brooks’ Law, etc.) and empirical evidence. Further, discussions with those who have management experience in software projects indicate that these results match their intuition and experiences. As scientists, our confidence in theories increases as experiments and studies fail to refute them. Replications of this study on additional projects and in other contexts are beneficial because they can indicate whether the findings are fairly universal, specific to Vista, or more generally true under certain constraints. Ultimately, practitioners must examine both the context and results of empirical studies to make informed judgments about how well the results may generalize to their own particular scenarios.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.85.181