Advanced modeling with ensembles

In the previous section, we implemented an orientation baseline; now, let's focus on heavy machinery. We will follow the approach taken by the KDD Cup 2009 winning solution, developed by the IBM research team (Niculescu-Mizil and others).

To address this challenge, they used the ensemble selection algorithm (Caruana and Niculescu-Mizil, 2004). This is an ensemble method, which means it constructs a series of models and combines their output in a specific way, in order to provide the final classification. It has several desirable properties that make it a good fit for this challenge, as follows:

  • It was proven to be robust, yielding excellent performance.
  • It can be optimized for a specific performance metric, including AUC.
  • It allows for different classifiers to be added to the library.
  • It is an anytime method, meaning that if we run out of time, we have a solution available.

In this section, we will loosely follow the steps as they are described in their report. Note that this is not an exact implementation of their approach, but rather a solution overview that will include the necessary steps to dive deeper.

A general overview of the steps is as follows:

  1. First, we will preprocess the data by removing attributes that clearly do not bring any value – for example, all of the missing or constant values; fixing missing values, in order to help machine learning algorithms, which cannot deal with them; and converting categorical attributes to numerical attributes.
  2. Next, we will run the attribute selection algorithm to select only a subset of attributes that can help in the prediction of tasks.
  3. In the third step, we will instantiate the ensemble selection algorithms with a wide variety of models, and finally, we will evaluate the performance.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.109.8