Regression is a technique used to predict a value of a numerical class, in contrast to classification, which predicts the value of a nominal class. Given a set of attributes, the regression builds a model, usually an equation that is used to compute the predicted class value.
Let's look at an example of a house price-based regression model, and create some real data to examine. These are actual numbers from houses for sale, and we will be trying to find the value of a house we are supposed to sell:
Size (m2) |
Land (m2) |
Rooms |
Granite |
Extra bathroom |
Price |
---|---|---|---|---|---|
1076 |
2801 |
6 |
0 |
0 |
€324.500,00 |
990 |
3067 |
5 |
1 |
1 |
€466.000,00 |
1229 |
3094 |
5 |
0 |
1 |
€425.900,00 |
731 |
4315 |
4 |
1 |
0 |
€387.120,00 |
671 |
2926 |
4 |
0 |
1 |
€312.100,00 |
1078 |
6094 |
6 |
1 |
1 |
€603.000,00 |
909 |
2854 |
5 |
0 |
1 |
€383.400,00 |
975 |
2947 |
5 |
1 |
1 |
?? |
To load files in Weka, we have to put the table in the ARFF file format and save it as house.arff
. Make sure the attributes are numeric, as shown here:
@RELATION house @ATTRIBUTE size NUMERIC @ATTRIBUTE land NUMERIC @ATTRIBUTE rooms NUMERIC @ATTRIBUTE granite NUMERIC @ATTRIBUTE extra_bathroom NUMERIC @ATTRIBUTE price NUMERIC @DATA 1076,2801,6,0,0,324500 990,3067,5,1,1,466000 1229,3094,5,0,1,425900 731,4315,4,1,0,387120 671,2926,4,0,1,312100 1078,6094,6,1,1,603000 909,2854,5,0,1,383400 975,2947,5,1,1,?
Use the following snippet:
import java.io.BufferedReader; import java.io.FileReader; import weka.core.Instance; import weka.core.Instances; import weka.classifiers.functions.LinearRegression; public class Regression{ public static void main(String args[]) throws Exception{ //load data Instances data = new Instances(new BufferedReader(new FileReader("dataset/house.arff"))); data.setClassIndex(data.numAttributes() - 1); //build model LinearRegression model = new LinearRegression(); model.buildClassifier(data); //the last instance with missing class is not used System.out.println(model); //classify the last instance Instance myHouse = data.lastInstance(); double price = model.classifyInstance(myHouse); System.out.println("My house ("+myHouse+"): "+price); } }
Here is the output:
Linear Regression Model price = 195.2035 * size + 38.9694 * land + 76218.4642 * granite + 73947.2118 * extra_bathroom + 2681.136 My house (975,2947,5,1,1,?): 458013.16703945777
The model estimated the value of our house to be $458,013.17.
Import a basic regression model named weka.classifiers.functions.LinearRegression
:
import java.io.BufferedReader; import java.io.FileReader; import weka.core.Instance; import weka.core.Instances; import weka.classifiers.functions.LinearRegression;
Load the house dataset:
Instances data = new Instances(new BufferedReader(new FileReader("dataset/house.arff"))); data.setClassIndex(data.numAttributes() - 1);
Initialize and build a regression model. Note, that the last instance is not used for building the model since the class value is missing:
LinearRegression model = new LinearRegression(); model.buildClassifier(data);
Output the model:
System.out.println(model);
Use the model to predict the price of the last instance in the dataset:
Instance myHouse = data.lastInstance(); double price = model.classifyInstance(myHouse); System.out.println("My house ("+myHouse+"): "+price);
This section lists some additional algorithms.
There is a wide variety of implemented regression algorithms one can use in Weka:
weka.classifiers.rules.ZeroR
: The class for building and using an 0-R classifier. Predicts the mean (for a numeric class) or the mode (for a nominal class) and it is considered as a baseline; that is, if your classifier's performance is worse than average value predictor, it is not worth considering it.weka.classifiers.trees.REPTree
: The fast decision tree learner. Builds a decision/regression tree using information gain/variance and prunes it using reduced-error pruning (with backfitting). It only sorts values for numeric attributes once. Missing values are dealt with by splitting the corresponding instances into pieces (that is, as in C4.5).weka.classifiers.functions.SMOreg
: SMOreg
implements the support vector machine for regression. The parameters can be learned using various algorithms. The algorithm is selected by setting the RegOptimizer
. The most popular algorithm (RegSMOImproved
) is due to Shevade, Keerthi, and others, and this is the default RegOptimizer
.weka.classifiers.functions.MultilayerPerceptron
: A classifier that uses backpropagation to classify instances. This network can be built by hand, or created by an algorithm, or both. The network can also be monitored and modified during training time. The nodes in this network are all sigmoid (except for when the class is numeric in which case the output nodes become unthresholded linear units).weka.classifiers.functions.GaussianProcesses
: Implements Gaussian Processes for regression without hyperparameter-tuning.13.59.107.152