2.4. Making Detective Work Easier through Dynamic Visualization

To solve a mystery, a detective has to spot clues and patterns of behavior and then generate working hypotheses that are consistent with the evidence. This is usually done in an iterative way, by gathering more evidence and by enlarging or shifting the scope of the investigation as knowledge is developed. So it is with generating hypotheses through EDA.

We have seen that the first and sometimes only step in managing uncertainty is to identify and quantify sources of variation. Building on the old adage that "a picture is worth a thousand words," it is clear that graphical displays should play a key role here. This is especially desirable when the software allows you to interact freely with these graphical views. Thanks to the advance of technology, most Six Sigma practitioners now have capabilities on their desktops that were only the province of researchers 10 years ago, and were not even foreseen 30 years ago. Although not entirely coincidental, we are fortunate that the wide availability of this capability comes at a time when data volumes continue to escalate.

Incidentally, many of the statistical methods that fall under CDA and that today are in routine use by the Six Sigma community, were originally developed for squeezing the most out of a small volume of data, often with the use of nothing more than a calculator or a pen and paper. Increasingly, the Six Sigma practitioner is faced with a quite different challenge, whereby the sheer volume of data (rows and columns) can make the application of statistical testing, should it be needed, difficult and questionable.

At this point, let us consider the appropriate role of visualization and, tangentially, data mining within Six Sigma. Visualization, which has a long and interesting history of its own, is conventionally considered valuable in three ways:[]

  1. Checking raw data for anomalies (EDA).

  2. Exploring data to discover plausible models (EDA).

  3. Checking model assumptions (CDA).

Given the crucial role of communication in Six Sigma, we can add two additional ways in which visualization has value:

  1. Investigation of model outcomes (EDA and CDA).

  2. Communication of results to others (EDA and CDA).

There is a wide variety of ways to display data visually. Many of these, such as histograms, scatterplots, Pareto plots, and box plots, are already in widespread use. However, the simple idea of providing multiple, linked views of data with which you can interact via software takes current Six Sigma analysis to another level of efficiency and effectiveness. For example, imagine clicking on a bar in a Pareto chart and seeing the corresponding points in a scatterplot become highlighted. Imagine what can be learned! But, unfortunately, a lot of software is relatively static, offering little more than a computerized version of what is possible on the printed page. In contrast, we see the dynamic aspect of good visualization software as critical to the detective work of EDA, which relies on an unfolding, rather than pre-planned, set of steps.

Visualization remains an active area of research, particularly when data volumes are high.[] But there are already many new, useful graphical displays. For example, the parallel coordinates plots used for visualizing data with many columns are well known within the visualization community, but have not yet spread widely into the Six Sigma world.[]

Additionally, although there are established principles about the correct ways to represent data graphically, the fact that two individuals will perceive patterns differently means that good software should present a wide repertoire of representations, ideally all dynamically linked with one another.[] We hope to demonstrate through the case studies that this comprehensive dynamic linking is a powerful capability for hypothesis generation. To emphasize this desirable aspect, from now on, we will refer to dynamic visualization, rather than simply visualization.

Not only does dynamic visualization support EDA when data volumes are large, but it is also our experience that dynamic visualization is very powerful when data volumes are modest. For instance, if the distributions of two or more variables are linked together, you can quickly and easily see the balance of the data, that is, which values or levels of one variable occur with those of another. If the data are perfectly balanced, then tabulation may also provide the same insight, but if the data are only nearly balanced or if they are unbalanced, as is more often the case, the linked distributions will usually be much more easily interpreted. With dynamic visualization, we can assess many views of the data quickly and efficiently.

The mention of large data volumes inevitably raises the topic of data mining. This is a rapidly moving field, so that a precise definition is difficult. Essentially, data mining is the process of sorting through large amounts of data and picking out relevant information using techniques from machine learning and statistics.[] In many cases, the data are split into at least two sets, and a model is built using one set, then tested or validated on the second set. Once the model is built, it is used to score new data as they arrive, thereby making (hopefully) useful predictions.

As with traditional statistical analysis, there are several processes that you can use in data mining.[] In most data-mining applications, the software used automates each step in the process, usually involving some prescribed stopping rule to determine when there is no further structure in the data to model. As such, many data-mining efforts have a strong flavor of CDA. However, EDA can bring high value to data-mining applications, especially in Six Sigma settings. In our case studies, we will see two such applications.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.255.36