Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Regression

We will explore basic regression algorithms through analysis of energy efficiency dataset (Tsanas and Xifara, 2012). We will investigate the heating and cooling load requirements of the buildings based on their construction characteristics such as surface, wall and roof area, height, hazing area, and compactness. The researchers used a simulator to design 12 different house configurations while varying 18 building characteristics. In total, 768 different buildings were simulated.

Our first goal is to systematically analyze the impact each building characterizes has on the target variable, that is, heating or cooling load. The second goal is to compare the performance of a classical linear regression model against other methods, such as SVM regression, random forests, and neural networks. For this task, we will use the Weka library.

Loading the data

Download the energy efficiency dataset from https://archive.ics.uci.edu/ml/datasets/Energy+efficiency.

The dataset is in Excel's XLSX format, which cannot be read by Weka. We can convert it to a Comma Separated Value (CSV) format by clicking File | Save As… and picking CSV in the saving dialog as shown in the following screenshot. Confirm to save only the active sheet (since all others are empty) and confirm to continue to lose some formatting features. Now, the file is ready to be loaded by Weka:

Open the file in a text editor and inspect if the file was indeed correctly transformed. There might be some minor issues that may be potentially causing problems. For instance, in my export, each line ended with a double semicolon, as follows:

X1;X2;X3;X4;X5;X6;X7;X8;Y1;Y2;;
0,98;514,50;294,00;110,25;7,00;2;0,00;0;15,55;21,33;;
0,98;514,50;294,00;110,25;7,00;3;0,00;0;15,55;21,33;;

To remove the doubled semicolon, we can use the Find and Replace function: find ";;" and replace it with ";".

The second problem was that my file had a long list of empty lines at the end of the document, which can be simply deleted:

0,62;808,50;367,50;220,50;3,50;5;0,40;5;16,64;16,03;;
;;;;;;;;;;;
;;;;;;;;;;;

Now, we are ready to load the data. Let's open a new file and write a simple data import function using Weka's converter for reading files in CSV format:

import weka.core.Instances;
import weka.core.converters.CSVLoader;
import java.io.File;
import java.io.IOException;

public class EnergyLoad {

  public static void main(String[] args) throws IOException {

    // load CSV
    CSVLoader loader = new CSVLoader();
    loader.setSource(new File(args[0]));
    Instances data = loader.getDataSet();

    System.out.println(data);
  }
}

The data is loaded. Let's move on.

Analyzing attributes

Before we analyze attributes, let's first try to understand what we are dealing with. In total, there are eight attributes describing building characteristic and two target variables, heating and cooling load, as shown in the following table:

Attribute	Attribute name
`X1`	Relative compactness
`X2`	Surface area
`X3`	Wall area
`X4`	Roof area
`X5`	Overall height
`X6`	Orientation
`X7`	Glazing area
`X8`	Glazing area distribution
`Y1`	Heating load
`Y2`	Cooling load

Building and evaluating regression model

We will start with learning a model for heating load by setting the class attribute at the feature position:

data.setClassIndex(data.numAttributes() - 2);

The second target variable—cooling load—can be now removed:

//remove last attribute Y2
Remove remove = new Remove();
remove.setOptions(new String[]{"-R", data.numAttributes()+""});
remove.setInputFormat(data);  
data = Filter.useFilter(data, remove);

Linear regression

We will start with a basic linear regression model implemented with the LinearRegression class. Similarly as in the classification example, we will initialize a new model instance, pass parameters and data, and invoke the buildClassifier(Instances) method, as follows:

import weka.classifiers.functions.LinearRegression;
...

data.setClassIndex(data.numAttributes() - 2);
LinearRegression model = new LinearRegression();
model.buildClassifier(data); 
System.out.println(model);

The learned model, which is stored in the object, can be outputted by calling the toString() method, as follows:

Y1 =

    -64.774  * X1 +
     -0.0428 * X2 +
      0.0163 * X3 +
     -0.089  * X4 +
      4.1699 * X5 +
     19.9327 * X7 +
      0.2038 * X8 +
     83.9329

Linear regression model constructed a function that linearly combines the input variables to estimate the heating load. The number in front of the feature explains the feature's impact on the target variable: sign corresponds to positive/negative impact, while magnitude corresponds to its significance. For instance, feature X1—relative compactness is negatively correlated with heating load, while glazing area is positively correlated. These two features also significantly impact the final heating load estimation.

The model performance can be similarly evaluated with cross-validation technique.

The 10-fold cross-validation is as follows:

Evaluation eval = new Evaluation(data);
eval.crossValidateModel(
model, data, 10, new Random(1), new String[]{});
System.out.println(eval.toSummaryString());

We can output the common evaluation metrics including correlation, mean absolute error, relative absolute error, and so on, as follows:

Correlation coefficient                  0.956 
Mean absolute error                      2.0923
Root mean squared error                  2.9569
Relative absolute error                 22.8555 %
Root relative squared error             29.282  %
Total Number of Instances              768

Regression trees

Another approach is to construct a set of regression models, each on its own part of the data. The following diagram shows the main difference between a regression model and a regression tree. Regression model constructs a single model that best fits all the data. Regression tree, on the other hand, constructs a set of regression models, each modeling a part of the data as shown on the right-hand side. Compared to the regression model, the regression tree can better fit the data, but the function is a piece-wise linear with jumps between modeled regions:

Regression tree in Weka is implemented within the M5 class. Model construction follows the same paradigm: initialize model, pass parameters and data, and invoke the buildClassifier(Instances) method.

import weka.classifiers.trees.M5P;
...
M5P md5 = new M5P();
md5.setOptions(new String[]{""});
md5.buildClassifier(data); 
System.out.println(md5);

The induced model is a tree with equations in the leaf nodes, as follows:

M5 pruned model tree:
(using smoothed linear models)

X1 <= 0.75 : 
|   X7 <= 0.175 : 
|   |   X1 <= 0.65 : LM1 (48/12.841%)
|   |   X1 >  0.65 : LM2 (96/3.201%)
|   X7 >  0.175 : 
|   |   X1 <= 0.65 : LM3 (80/3.652%)
|   |   X1 >  0.65 : LM4 (160/3.502%)
X1 >  0.75 : 
|   X1 <= 0.805 : LM5 (128/13.302%)
|   X1 >  0.805 : 
|   |   X7 <= 0.175 : 
|   |   |   X8 <= 1.5 : LM6 (32/20.992%)
|   |   |   X8 >  1.5 : 
|   |   |   |   X1 <= 0.94 : LM7 (48/5.693%)
|   |   |   |   X1 >  0.94 : LM8 (16/1.119%)
|   |   X7 >  0.175 : 
|   |   |   X1 <= 0.84 : 
|   |   |   |   X7 <= 0.325 : LM9 (20/5.451%)
|   |   |   |   X7 >  0.325 : LM10 (20/5.632%)
|   |   |   X1 >  0.84 : 
|   |   |   |   X7 <= 0.325 : LM11 (60/4.548%)
|   |   |   |   X7 >  0.325 : 
|   |   |   |   |   X3 <= 306.25 : LM12 (40/4.504%)
|   |   |   |   |   X3 >  306.25 : LM13 (20/6.934%)

LM num: 1
Y1 = 
  72.2602 * X1 
  + 0.0053 * X3 
  + 11.1924 * X7 
  + 0.429 * X8 
  - 36.2224

...

LM num: 13
Y1 = 
  5.8829 * X1 
  + 0.0761 * X3 
  + 9.5464 * X7 
  - 0.0805 * X8 
  + 2.1492

Number of Rules : 13

The tree has 13 leaves, each corresponding to a linear equation. The preceding output is visualized in the following diagram:

The tree can be read similar to a classification tree. The most important features are at the top of the tree. The terminal node, leaf, contains a linear regression model explaining the data that reach this part of the tree.

Evaluation outputs the following results:

Correlation coefficient                  0.9943
Mean absolute error                      0.7446
Root mean squared error                  1.0804
Relative absolute error                  8.1342 %
Root relative squared error             10.6995 %
Total Number of Instances              768

Tips to avoid common regression problems

First, use prior studies and domain knowledge to figure out which features to include in regression. Check literature, reports, and previous studies on what kind of features work and reasonable variables for modeling your problem. Suppose you have a large set of features with random data, it is highly likely that several features will be correlated to the target variable (even though the data is random).

Keep the model simple to avoid overfitting. The Occam's razor principle states that you should select a model that best explains your data with the fewest assumptions. In practice, the model can be as simple as 2-4 predictor features.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Regression

Create new playlist

Sign In

Sign Up