Data science revolution in process improvement and assessment?

M. Oivo    University of Oulu, Oulu, Finland

Abstract

In God we trust, all others bring data. This legendary statement from quality guru W. Edwards Deming has been used and quoted countless times in motivating and justifying the role of data in software process improvement (SPI). It is now more relevant than ever, but data science has the potential to revolutionize the collection and analysis of data in SPI.

Keywords

Software process improvement (SPI); Plan-do-check-act (PDCA) cycle; Assessments; Process improvement; Predictive analytics; Cross-project and open source repositories

In God we trust, all others bring data. This legendary statement from quality guru W. Edwards Deming has been used and quoted countless times in motivating and justifying the role of data in software process improvement (SPI). It is now more relevant than ever, but data science has the potential to revolutionize the collection and analysis of data in SPI.

Deming was promoting statistical process control. His thinking in the famous plan-do-check-act (PDCA) cycle is based on planning improvement actions, implementing them, measuring the effect, and then acting accordingly. In PDCA, we predefine actions, and consequently we predefine what we want to measure. This predefined data is used to guide improvement actions. The fundamental change that data science can bring about is that we are now able to analyze large amounts of data and draw conclusions based on that data without having preplanned the data collection specifically for predefined actions or items.

There have been SPI methods and solutions since the 1980s. Assessment-based SPI approaches collect and analyze data with predefined methods. They have been popular since the introduction of the SEI’s capability maturity model (CMM) in the late 1980s, and its successor, the capability maturity model integration (CMMI), since 2000. An international standard for process assessment and improvement (ISO/IEC 15504) is another key approach. The automobile industry has been an active user of ISO/IEC 15504 and has its own version called Automotive SPICE.

The assessment process is based on a predefined data collection, which occurs mainly in the form of document analysis and interviews. The basic idea behind assessment methods is using a reference model to which the processes of an organization can be compared. Process capability or organizational maturity is determined based on an assessment of the processes of the organization. Improvement recommendations are made based on the assessment results, aiming at higher capability and maturity levels as defined in the model. The data collection is predefined by the process model and practices of the assessment method.

There were some research efforts as early as in the 1990s attempting to integrate assessments with measurements in order to automate at least part of the data collection process and to enable a continuous assessment that could replace or complement regular assessments that are done only after relatively long intervals. One of the key problems of automatic data gathering is that there is a lot of data, and it is unstructured, incomplete, and often hard to get. Assessments require precise and complete data. Strikingly—does this sound familiar to data scientists who are used to working with large amounts of data that is unstructured, not perfect and not precise? They are, however, still able to analyze the big data and draw useful conclusions.

So why not use data science in assessments? Would it be sufficient for process improvement to have the big picture in an assessment that is based on the analysis of unstructured big data to complement the traditional assessment that digs into all the nitty-gritty details of every project that happens to be in the focus of the assessment? Would it be better if all the projects were in the focus of an assessment using data science rather than thoroughly analyzing a sample of only a few projects? Could we replace the traditional aim for accuracy that uses a (small) sample in the assessment with company-wide coverage of all projects with big data approaches? Would the results actually be more accurate this way?

The traditional problem with SPI is the lack of evidence of the effect of improvement actions. It is very difficult, if not impossible, to show with hard evidence what the benefits of certain SPI actions are. Real life is complicated, and the effects of SPI are mixed with a myriad of other events in companies. Proving the causality of SPI actions and quantitative improvements is extremely tough. Finding correlations is often the only thing we can do. But wait a minute! Isn’t this exactly what data science is doing? Could SPI learn something from data science? Would it be okay sometimes to find good enough correlations rather than attempting to find evasive but accurate causations of SPI actions? Would it be enough to know “what” is happening or to predict what will happen after improvement actions, rather than trying to accurately know “why” it is happening?

There is already a wealth of experience in using data science in business intelligence. Big data may even be considered as a “hype term.” Data science has also been used in business process improvement. SPI has many similarities with business process improvement and can benefit from the experiences in business processes.

Software development produces a lot of data from tools like issue tracking systems, version control systems, test data, code, documents, and so on. The amount of data available explodes with recording keyboards, voice recording, and video recording. This kind of data may not have any structure and can include just about anything, but it is potentially useful for big data analytics. Surely we will face similar privacy issues as with most big data applications that use personal data. Despite these challenges, the payback may be considerably high.

Another area in which data science has not yet reached its potential is the use of customer data to guide and improve the products and software development processes. The game industry and Internet companies are pioneering these approaches. By launching games or services in limited customer segments, they quickly collect usage data and user experiences to improve their products and software processes.

Traditional SPI tries to define processes in order to optimize them. However, it is often challenging to distinguish between official defined processes and actual processes used in practice. An alternative is to deduce the software processes from the data gathered from actual software development, resulting in a description of the actual process, not just the official or desired process. This data can also be used to analyze the workflows and identify bottlenecks.

Many of the SPI methods are based mainly on analyzing the past and making an assessment and improvement recommendations based on that analysis. However, what we really need is to know the future. We would like to predict what will happen in our future endeavors or estimate the effort required for software development. This is where predictive analytics and estimation methods come into play. We already have a growing community of researchers who work on analyzing large amounts of software engineering data to learn from it and to predict what will happen in software development if we take certain actions. These analysis methods go far beyond the simple postmortem analysis.

An interesting trend is the use of large cross-project and open source repositories for analyzing software engineering data and drawing conclusions from that data. Several such repositories have emerged. One example of the promising repositories is the PROMISE dataset (http://openscience.us/repo/), which includes data from hundreds of projects. It aims to serve as a long-term repository for software engineering data that researchers worldwide can use. Good examples of the use of such data include defect and cost modeling, which are used for prediction and estimation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.91.47