Chapter 5. Statistical Data Analysis Techniques

The intent of this chapter is not to make the reader an expert in statistical techniques. Rather, it is to familiarize the reader with the basic statistical techniques in use and demonstrate how Java can support statistical analysis. While there are quite a variety of data analysis techniques, in this chapter, we will focus on the more common tasks.

These techniques range from the relatively simple mean calculation to sophisticated regression analysis models. Statistical analysis can be a very complicated process and requires significant study to be conducted properly. We will start with an introduction to basic statistical analysis techniques, including calculating the mean, median, mode, and standard deviation of a dataset. There are numerous approaches used to calculate these values, which we will demonstrate using standard Java and third-party APIs. We will also briefly discuss sample size and hypothesis testing.

Regression analysis is an important technique for analyzing data. The technique creates a line that tries to match the dataset. The equation representing the line can be used to predict future behavior. There are several types of regression analysis. In this chapter, we will focus on simple linear regression and multiple regression. With simple linear regression, a single factor such as age is used to predict some behavior such as the likelihood of eating out. With multiple regression, multiple factors such as age, income level, and marital status may be used to predict how often a person eats out.

Predictive analytics, or analysis, is concerned with predicting future events. Many of the techniques used in this book are concerned with making predictions. Specifically, the regression analysis part of this chapter predicts future behavior.

Before we see how Java supports regression analysis, we need to discuss basic statistical techniques. We begin with mean, mode, and median.

In this chapter, we will cover the following topics:

  • Working with mean, mode, and median
  • Standard deviation and sample size determination
  • Hypothesis testing
  • Regression analysis

Working with mean, mode, and median

The mean, median, and mode are basic ways to describe characteristics or summarize information from a dataset. When a new, large dataset is first encountered, it can be helpful to know basic information about it to direct further analysis. These values are often used in later analysis to generate more complex measurements and conclusions. This can occur when we use the mean of a dataset to calculate the standard deviation, which we will demonstrate in the Standard deviation section of this chapter.

Calculating the mean

The term mean, also called the average, is computed by adding values in a list and then dividing the sum by the number of values. This technique is useful for determining the general trend for a set of numbers. It can also be used to fill in missing data elements. We are going to examine several ways to calculate the mean for a given set of data using standard Java libraries as well as third-party APIs.

Using simple Java techniques to find mean

In our first example, we will demonstrate a basic way to calculate mean using standard Java capabilities. We will use an array of double values called testData:

double[] testData = {12.5, 18.7, 11.2, 19.0, 22.1, 14.3, 16.9, 12.5,
   17.8, 16.9}; 

We create a double variable to hold the sum of all of the values and a double variable to hold the mean. A loop is used to iterate through the data and add values together. Next, the sum is divided by the length of our array (the total number of elements) to calculate the mean:

double total = 0; 
for (double element : testData) { 
   total += element; 
} 
double mean = total / testData.length; 
out.println("The mean is " + mean); 

Our output is as follows:

The mean is 16.19

Using Java 8 techniques to find mean

Java 8 provided additional capabilities with the introduction of optional classes. We are going to use the OptionalDouble class in conjunction with the Arrays class's stream method in this example. We will use the same array of doubles we used in the previous example to create an OptionalDouble object. If any of the numbers in the array, or the sum of the numbers in the array, is not a real number, the value of the OptionalDouble object will also not be a real number:

OptionalDouble mean = Arrays.stream(testData).average(); 
 

We use the isPresent method to determine whether we calculated a valid number for our mean. If we do not get a good result, the isPresent method will return false and we can handle any exceptions:

 
if (mean.isPresent()) { 
    out.println("The mean is " + mean.getAsDouble()); 
} else { 
    out.println("The stream was empty"); 
} 
 

Our output is the following:

The mean is 16.19

Another, more succinct, technique using the OptionalDouble class involves lambda expressions and the ifPresent method. This method executes its argument if mean is a valid OptionalDouble object:

OptionalDouble mean = Arrays.stream(testData).average(); 
mean.ifPresent(x-> out.println("The mean is " + x)); 

Our output is as follows:

The mean is 16.19

Finally, we can use the orElse method to either print the mean or an alternate value if mean is not a valid OptionalDouble object:

OptionalDouble mean = Arrays.stream(testData).average(); 
out.println("The mean is " + mean.orElse(0)); 

Our output is the same:

The mean is 16.19

For our next two mean examples, we will use third-party libraries and continue using the array of doubles, testData.

Using Google Guava to find mean

In this example, we will use Google Guava libraries, introduced in Chapter 3, Data Cleaning. The Stats class provides functionalities for handling numeric data, including finding mean and standard deviation, which we will demonstrate later. To calculate the mean, we first create a Stats object using our testData array and then execute the mean method:

Stats testStat = Stats.of(testData); 
double mean = testStat.mean(); 
out.println("The mean is " + mean); 

Notice the difference between the default format of the output in this example.

Using Apache Commons to find mean

In our final mean examples, we use Apache Commons libraries, also introduced in Chapter 3, Data Cleaning. We first create a Mean object and then execute the evaluate method using our testData. This method returns a double , representing the mean of the values in the array:

Mean mean = new Mean(); 
double average = mean.evaluate(testData); 
out.println("The mean is " + average); 

Our output is the following:

The mean is 16.19

Apache Commons also provides a helpful DescriptiveStatistics class. We will use this later to demonstrate median and standard deviation, but first we will begin by calculating the mean. Using the SynchronizedDescriptiveStatistics class is advantageous as it is synchronized and therefore thread safe.

We start by creating our DescriptiveStatistics object, statTest. We then loop through our double array and add each item to statTest. We can then invoke the getMean method to calculate the mean:

DescriptiveStatistics statTest =  
    new SynchronizedDescriptiveStatistics(); 
for(double num : testData){ 
   statTest.addValue(num); 
} 
out.println("The mean is " + statTest.getMean()); 
 

Our output is as follows:

The mean is 16.19

Next, we will cover the related topic: median.

Calculating the median

The mean can be misleading if the dataset contains a large number of outlying values or is otherwise skewed. When this happens, the mode and median can be useful. The term median is the value in the middle of a range of values. For an odd number of values, this is easy to compute. For an even number of values, the median is calculated as the average of the middle two values.

Using simple Java techniques to find median

In our first example, we will use a basic Java approach to calculate the median. For these examples, we have modified our testData array slightly:

double[] testData = {12.5, 18.3, 11.2, 19.0, 22.1, 14.3, 16.2, 12.5,
   17.8, 16.5}; 

First, we use the Arrays class to sort our data because finding the median is simplified when the data is in numeric order:

Arrays.sort(testData); 

We then handle three possibilities:

  • Our list is empty
  • Our list has an even number of values
  • Our list has an odd number of values

The following code could be shortened, but we have been explicit to help clarify the process. If our list has an even number of values, we divide the length of the list by 2. The first variable, mid1, will hold the first of two middle values. The second variable, mid2, will hold the second middle value. The average of these two numbers is our median value. The process for finding the median index of a list with an odd number of values is simpler and requires only that we divide the length by 2 and add 1:

if(testData.length==0){    // Empty list 
   out.println("No median. Length is 0"); 
}else if(testData.length%2==0){    // Even number of elements 
   double mid1 = testData[(testData.length/2)-1]; 
   double mid2 = testData[testData.length/2]; 
   double med = (mid1 + mid2)/2; 
   out.println("The median is " + med); 
}else{   // Odd number of elements 
   double mid = testData[(testData.length/2)+1]; 
   out.println("The median is " + mid); 
} 
 

Using the preceding array, which contains an even number of values, our output is:

The median is 16.35

To test our code for an odd number of elements, we will add the double 12.5 to the end of the array. Our new output is as follows:

The median is 16.5

Using Apache Commons to find the median

We can also calculate the median using the Apache Commons DescriptiveStatistics class demonstrated in the Calculating the mean section. We will continue using the testData array with the following values:

double[] testData = {12.5, 18.3, 11.2, 19.0, 22.1, 14.3, 16.2, 12.5,
   17.8, 16.5, 12.5}; 

Our code is very similar to what we used to calculate the mean. We simply create our DescriptiveStatistics object and call the getPercentile method, which returns an estimate of the value stored at the percentile specified in its argument. To find the median, we use the value of 50:

DescriptiveStatistics statTest =  
    new SynchronizedDescriptiveStatistics(); 
for(double num : testData){ 
   statTest.addValue(num); 
} 
out.println("The median is " + statTest.getPercentile(50)); 
 

Our output is as follows:

The median is 16.2

Calculating the mode

The term mode is used for the most frequently occurring value in a dataset. This can be thought of as the most popular result, or the highest bar in a histogram. It can be a useful piece of information when conducting statistical analysis but it can be more complicated to calculate than it first appears. To begin, we will demonstrate a simple Java technique using the following testData array:

double[] testData = {12.5, 18.3, 11.2, 19.0, 22.1, 14.3, 16.2, 12.5, 
   17.8, 16.5, 12.5}; 

We start off by initializing variables to hold the mode, the number of times the mode appears in the list, and a tempCnt variable. The mode and modeCount variables are used to hold the mode value and the number of times this value occurs in the list respectively. The variable tempCnt is used to count the number of times an element occurs in the list:

int modeCount = 0;    
double mode = 0;            
int tempCnt = 0; 

We then use nested for loops to compare each value of the array to the other values within the array. When we find matching values, we increment our tempCnt. After comparing each value, we test to see whether tempCnt is greater than modeCount, and if so, we change our modeCount and mode to reflect the new values:

for (double testValue : testData){ 
   tempCnt = 0; 
   for (double value : testData){ 
         if (testValue == value){ 
               tempCnt++; 
         } 
   } 
 
   if (tempCnt > modeCount){ 
         modeCount = tempCnt; 
         mode = testValue; 
   } 
} 
out.println("Mode" + mode + " appears " + modeCount + " times."); 

Using this example, our output is as follows:

The mode is 12.5 and appears 3 times.

While our preceding example seems straightforward, it poses potential problems. Modify the testData array as shown here, where the last entry is changed to 11.2:

double[] testData = {12.5, 18.3, 11.2, 19.0, 22.1, 14.3, 16.2, 12.5,
   17.8, 16.5, 11.2}; 
 

When we execute our code this time, our output is as follows:

The mode is 12.5 and appears 2 times.

The problem is that our testData array now contains two values that appear two times each, 12.5 and 11.2. This is known as a multimodal set of data. We can address this through basic Java code and through third-party libraries, as we will show in a moment.

However, first we will show two approaches using simple Java. The first approach will use two ArrayList instances and the second will use an ArrayList and a HashMap instance.

Using ArrayLists to find multiple modes

In the first approach, we modify the code used in the last example to use an ArrayList class. We will create two ArrayLists, one to hold the unique numbers within the dataset and one to hold the count of each number. We also need a tempMode variable, which we use next:

ArrayList<Integer> modeCount = new ArrayList<Integer>();  
ArrayList<Double> mode = new ArrayList<Double>();   
int tempMode = 0; 

Next, we will loop through the array and test for each value in our mode list. If the value is not found in the list, we add it to mode and set the same position in modeCount to 1. If the value is found, we increment the same position in modeCount by 1:

for (double testValue : testData){ 
   int loc = mode.indexOf(testValue); 
   if(loc == -1){ 
         mode.add(testValue); 
         modeCount.add(1); 
   }else{ 
         modeCount.set(loc, modeCount.get(loc)+1); 
   } 
} 

Next, we loop through our modeCount list to find the largest value. This represents the mode, or the frequency of the most common value in the dataset. This allows us to select multiple modes:

for(int cnt = 0; cnt < modeCount.size(); cnt++){ 
   if (tempMode < modeCount.get(cnt)){ 
         tempMode = modeCount.get(cnt); 
   } 
} 

Finally, we loop through our modeCount array again and print out any elements in mode that correspond to elements in modeCount containing the largest value, or mode:

for(int cnt = 0; cnt < modeCount.size(); cnt++){ 
   if (tempMode == modeCount.get(cnt)){ 
         out.println(mode.get(cnt) + " is a mode and appears " +  
             modeCount.get(cnt) + " times."); 
   } 
} 

When our code is executed, our output reflects our multimodal dataset:

12.5 is a mode and appears 2 times.
11.2 is a mode and appears 2 times.

Using a HashMap to find multiple modes

The second approach uses HashMap. First, we create ArrayList to hold possible modes, as in the previous example. We also create our HashMap and a variable to hold the mode:

ArrayList<Double> modes = new ArrayList<Double>(); 
HashMap<Double, Integer> modeMap = new HashMap<Double, Integer>(); 
int maxMode = 0; 

Next, we loop through our testData array and count the number of occurrences of each value in the array. We then add the count of each value and the value itself to the HashMap. If the count for the value is larger than our maxMode variable, we set maxMode to our new largest number:

for (double value : testData) { 
   int modeCnt = 0; 
   if (modeMap.containsKey(value)) { 
         modeCnt = modeMap.get(value) + 1; 
   } else { 
         modeCnt = 1; 
   } 
   modeMap.put(value, modeCnt); 
   if (modeCnt > maxMode) { 
         maxMode = modeCnt; 
   } 
} 

Finally, we loop through our HashMap and retrieve our modes, or all values with a count equal to our maxMode:

for (Map.Entry<Double, Integer> multiModes : modeMap.entrySet()) { 
   if (multiModes.getValue() == maxMode) { 
         modes.add(multiModes.getKey()); 
   } 
} 
for(double mode : modes){ 
   out.println(mode + " is a mode and appears " + maxMode + " times.");
} 

When we execute our code, we get the same output as in the previous example:

12.5 is a mode and appears 2 times.
11.2 is a mode and appears 2 times.

Using a Apache Commons to find multiple modes

Another option uses the Apache Commons StatUtils class. This class contains several methods for statistical analysis, including multiple methods for the mean, but we will only examine the mode here. The method is named mode and takes an array of doubles as its parameter. It returns an array of doubles containing all modes of the dataset:

double[] modes = StatUtils.mode(testData); 
for(double mode : modes){ 
   out.println(mode + " is a mode."); 
} 

One disadvantage is that we are not able to count the number of times our mode appears within this method. We simply know what the mode is, not how many times it appears. When we execute our code, we get a similar output to our previous example:

12.5 is a mode.
11.2 is a mode.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.235.79