Regression analysis

Regression analysis is useful for determining trends in data. It indicates the relationship between dependent and independent variables. The independent variables determine the value of a dependent variable. Each independent variable can have either a strong or a weak effect on the value of the dependent variable. Linear regression uses a line in a scatterplot to show the trend. Non-linear regression uses some sort of curve to depict the relationships.

For example, there is a relationship between blood pressure and various factors such as age, salt intake, and Body Mass Index (BMI). The blood pressure can be treated as the dependent variable and the other factors as independent variables. Given a dataset containing these factors for a group of individuals we can perform regression analysis to see trends.

There are several types of regression analysis supported by Java. We will be examining simple linear regression and multiple linear regression. Both approaches take a dataset and derive a linear equation that best fits the data. Simple linear regression uses a single dependent and a single independent variable. Multiple linear regression uses multiple dependent variables.

There are several APIs that support simple linear regression including:

Nonlinear Java support can be found at:

There are several statistics that evaluate the effectiveness of an analysis. We will focus on basic statistics.

Residuals are the difference between the actual data values and the predicted values. The Residual Sum of Squares (RSS) is the sum of the squares of residuals. Essentially it measures the discrepancy between the data and a regression model. A small RSS indicates the model closely matches the data. RSS is also known as the Sum of Squared Residuals (SSR) or the Sum of Squared Errors (SSE) of prediction.

The Mean Square Error (MSE) is the sum of squared residuals divided by the degrees of freedom. The number of degrees of freedom is the number of independent observations (N) minus the number of estimates of population parameters. For simple linear regression this N - 2 because there are two parameters. For multiple linear regression it depends on the number of independent variables used.

A small MSE also indicates that the model fits the dataset well. You will see both of these statistics used when discussing linear regression models.

The correlation coefficient measures the association between two variables of a regression model. The correlation coefficient ranges from -1 to +1. A value of +1 means that two variables are perfectly related. When one increases, so does the other. A correlation coefficient of -1 means that two variables are negatively related. When one increases, the other decreases. A value of 0 means there is no correlation between the variables. The coefficient is frequently designated as R. It will often be squared, thus ignoring the sign of the relation. The Pearson's product moment correlation coefficient is normally used.

Using simple linear regression

Simple linear regression uses a least squares approach where a line is computed that minimizes the sum of squared of the distances between the points and the line. Sometimes the line is calculated without using the Y intercept term. The regression line is an estimate. We can use the line's equation to predict other data points. This is useful when we want to predict future events based on past performance.

In the following example we use the Apache Commons SimpleRegression class with the Belgium population dataset used in Chapter 4, Data Visualization. The data is duplicated here for your convenience:

Decade

Population

1950

8639369

1960

9118700

1970

9637800

1980

9846800

1990

9969310

2000

10263618

While the application that we will demonstrate is a JavaFX application, we will focus on the linear regression aspects of the application. We used a JavaFX program to generate a chart to show the regression results.

The body of the start method follows. The input data is stored in a two-dimension array as shown here:

double[][] input = {{1950, 8639369}, {1960, 9118700},  
    {1970, 9637800}, {1980, 9846800}, {1990, 9969310},  
    {2000, 10263618}}; 

An instance of the SimpleRegression class is created and the data is added using the addData method:

SimpleRegression regression = new SimpleRegression(); 
regression.addData(input); 

We will use the model to predict behavior for several years as declared in the array that follows:

double[] predictionYears = {1950, 1960, 1970, 1980, 1990, 2000,  
    2010, 2020, 2030, 2040}; 

We will also format our output using the following NumberFormat instances. One is used for the year where the setGroupingUsed method with a parameter of false suppresses commas.

NumberFormat yearFormat = NumberFormat.getNumberInstance(); 
yearFormat.setMaximumFractionDigits(0); 
yearFormat.setGroupingUsed(false); 
NumberFormat populationFormat = NumberFormat.getNumberInstance(); 
populationFormat.setMaximumFractionDigits(0); 

The SimpleRegression class possesses a predict method that is passed a value, a year in this case, and returns the estimated population. We use this method in a loop and call the method for each year:

for (int i = 0; i < predictionYears.length; i++) { 
    out.println(nf.format(predictionYears[i]) + "-" 
            + nf.format(regression.predict(predictionYears[i]))); 
} 

When the program is executed, we get the following output:

1950-8,801,975
1960-9,112,892
1970-9,423,808
1980-9,734,724
1990-10,045,641
2000-10,356,557
2010-10,667,474
2020-10,978,390
2030-11,289,307
2040-11,600,223

To see the results graphically, we generated the following index chart. The line matches the actual population values fairly well and shows the projected populations in the future.

Using simple linear regression

Simple Linear Regression

The SimpleRegession class supports a number of methods that provide additional information about the regression. These methods are summarized next:

Method

Meaning

getR

Returns Pearson's product moment correlation coefficient

getRSquare

Returns the coefficient of determination (R-square)

getMeanSquareError

Returns the MSE

getSlope

The slope of the line

getIntercept

The intercept

We used the helper method, displayAttribute, to display various attribute values as shown here:

displayAttribute(String attribute, double value) {         
    NumberFormat numberFormat = NumberFormat.getNumberInstance(); 
    numberFormat.setMaximumFractionDigits(2); 
    out.println(attribute + ": " + numberFormat.format(value)); 
} 

We called the previous methods for our model as shown next:

displayAttribute("Slope",regression.getSlope()); 
displayAttribute("Intercept", regression.getIntercept()); 
displayAttribute("MeanSquareError", 
    regression.getMeanSquareError()); 
displayAttribute("R", + regression.getR()); 
displayAttribute("RSquare", regression.getRSquare()); 

The output follows:

Slope: 31,091.64
Intercept: -51,826,728.48
MeanSquareError: 24,823,028,973.4
R: 0.97
RSquare: 0.94

As you can see, the model fits the data nicely.

Using multiple regression

Our intent is not to provide a detailed explanation of multiple linear regression as that would be beyond the scope of this section. A more through treatment can be found athttp://www.biddle.com/documents/bcg_comp_chapter4.pdf. Instead, we will explain the basics of the approach and show how we can use Java to perform multiple regression.

Multiple regression works with data where multiple independent variables exist. This happens quite often. Consider that the fuel efficiency of a car can be dependent on the octane level of the gas being used, the size of the engine, the average cruising speed, and the ambient temperature. All of these factors can influence the fuel efficiency, some to a larger degree than others.

The independent variable is normally represented as Y where there are multiple dependent variables represented using different Xs. A simplified equation for a regression using three dependent variables follows where each variable has a coefficient. The first term is the intercept. These coefficients are not intended to represent real values but are only used for illustrative purposes.

Y = 11 + 0.75 X1 + 0.25 X2 − 2 X3

The intercept and coefficients are generated using a multiple regression model based on sample data. Once we have these values, we can create an equation to predict other values.

We will use the Apache Commons OLSMultipleLinearRegression class to perform multiple regression using cigarette data. The data has been adapted from http://www.amstat.org/publications/jse/v2n1/datasets.mcintyre.html. The data consists of 25 entries for different brands of cigarettes with the following information:

  •  Brand name
  •  Tar content (mg)
  •  Nicotine content (mg)
  • Weight (g)
  • Carbon monoxide content (mg)

The data has been stored in a file called data.csv as shown in the following partial listing of its contents where the columns values match the order of the previous list:

Alpine,14.1,.86,.9853,13.6
Benson&Hedges,16.0,1.06,1.0938,16.6
BullDurham,29.8,2.03,1.1650,23.5
CamelLights,8.0,.67,.9280,10.2
...

The following is a scatter plot chart showing the relationship of the data:

Using multiple regression

Multiple Regression Scatter plot

We will use a JavaFX program to create the scatter plot and to perform the analysis. We start with the MainApp class as shown next. In this example we will focus on the multiple regression code and we do not include the JavaFX code used to create the scatter plot. The complete program can be downloaded from http://www.packtpub.com/support.

The data is held in a one-dimensional array and a NumberFormat instance will be used to format the values. The array size reflects the 25 entries and the 4 values per entry. We will not be using the brand name in this example.

public class MainApp extends Application {
    private final double[] data = new double[100];
    private final NumberFormat numberFormat =
       NumberFormat.getNumberInstance();
    ...
    public static void main(String[] args) {
        launch(args);
    }
  }

The data is read into the array using a CSVReader instance as shown next:

int i = 0;
try (CSVReader dataReader = new CSVReader(
        new FileReader("data.csv"), ',')) {
    String[] nextLine;
    while ((nextLine = dataReader.readNext()) != null) {
        String brandName = nextLine[0];
        double tarContent = Double.parseDouble(nextLine[1]);
        double nicotineContent = Double.parseDouble(nextLine[2]);
        double weight = Double.parseDouble(nextLine[3]);
        double carbonMonoxideContent = 
            Double.parseDouble(nextLine[4]);
        data[i++] = carbonMonoxideContent;
        data[i++] = tarContent;
        data[i++] = nicotineContent;
        data[i++] = weight;
        ...
    }
}

Apache Commons possesses two classes that perform multiple regression:

  •  OLSMultipleLinearRegression- Ordinary Least Square (OLS) regression
  • GLSMultipleLinearRegression- Generalized Least Squared (GLS) regression

When the latter technique is used, the correlation within elements of the model impacts the results addversely. We will use the OLSMultipleLinearRegression class and start with its instantiation:

OLSMultipleLinearRegression ols = 
new OLSMultipleLinearRegression();

We will use the newSampleData method to initialize the model. This method needs the number of observations in the dataset and the number of independent variables. It may throw an IllegalArgumentException exception which needs to be handled.

int numberOfObservations = 25;
int numberOfIndependentVariables = 3;
try {
    ols.newSampleData(data, numberOfObservations, 
        numberOfIndependentVariables);
} catch (IllegalArgumentException e) {
     // Handle exceptions
}

Next, we set the number of digits that will follow the decimal point to two and invoke the estimateRegressionParameters method. This returns an array of values for our equation, which are then displayed:

numberFormat.setMaximumFractionDigits(2);
double[] parameters = ols.estimateRegressionParameters();
for (int i = 0; i < parameters.length; i++) {
    out.println("Parameter " + i +": " + 
        numberFormat.format(parameters[i]));
}

When executed we will get the following output which gives us the parameters needed for our regression equation:

Parameter 0: 3.2
Parameter 1: 0.96
Parameter 2: -2.63
Parameter 3: -0.13

To predict a new dependent value based on a set of independent variables, the getY method is declared, as shown next. The parameters parameter contains the generated equation coefficients. The arguments parameter contains the value for the dependent variables. These are used to calculate the new dependent value which is returned:

public double getY(double[] parameters, double[] arguments) {
    double result = 0;
    for(int i=0; i<parameters.length; i++) {
        result += parameters[i] * arguments[i];
    }
    return result;
}

We can test this method by creating a series of independent values. Here we used the same values as used for the SalemUltra entry in the data file:

double arguments1[] = {1, 4.5, 0.42, 0.9106};
out.println("X: " + 4.9 + "  y: " + 
        numberFormat.format(getY(parameters,arguments1)));

This will give us the following values:

X: 4.9 y: 6.31

The return value of 6.31 is different from the actual value of 4.9. However, using the values for VirginiaSlims:

double arguments2[] = {1, 15.2, 1.02, 0.9496};
out.println("X: " + 13.9 + "  y: " + 
        numberFormat.format(getY(parameters,arguments2)));

We get the following result:

X: 13.9 y: 15.03

This is close to the actual value of 13.9. Next, we use a different set of values than found in the dataset:

double arguments3[] = {1, 12.2, 1.65, 0.86};
out.println("X: " + 9.9 + "  y: " + 
        numberFormat.format(getY(parameters,arguments3)));

The result follows:

X: 9.9 y: 10.49

The values differ but are still close. The following figure shows the predicted data in relation to the original data:

Using multiple regression

Multiple regression projected

The OLSMultipleLinearRegression class also possesses several methods to evaluate how well the models fits the data. However, due to the complex nature of multiple regression, we have not discussed them here.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.253.198