Creating your own transformer

As the complexity and type of dataset changes, you might find that you can't find an existing feature extraction transformer that fits your needs. We will see an example of this in Chapter 7, Discovering Accounts to Follow Using Graph Mining, where we create new features from graphs.

A transformer is akin to a converting function. It takes data of one form as input and returns data of another form as output. Transformers can be trained using some training dataset, and these trained parameters can be used to convert testing data.

The transformer API is quite simple. It takes data of a specific format as input and returns data of another format (either the same as the input or different) as output. Not much else is required of the programmer.

The transformer API

Transformers have two key functions:

  • fit(): This takes a training set of data as input and sets internal parameters
  • transform(): This performs the transformation itself. This can take either the training dataset, or a new dataset of the same format

Both fit() and transform() fuction should take the same data type as input, but transform() can return data of a different type.

We are going to create a trivial transformer to show the API in action. The transformer will take a NumPy array as input, and discretize it based on the mean. Any value higher than the mean (of the training data) will be given the value 1 and any value lower or equal to the mean will be given the value 0.

We did a similar transformation with the Adult dataset using pandas: we took the Hours-per-week feature and created a LongHours feature if the value was more than 40 hours per week. This transformer is different for two reasons. First, the code will conform to the scikit-learn API, allowing us to use it in a pipeline. Second, the code will learn the mean, rather than taking it as a fixed value (such as 40 in the LongHours example).

Implementation details

To start, open up the IPython Notebook that we used for the Adult dataset. Then, click on the Cell menu item and choose Run All. This will rerun all of the cells and ensure that the notebook is up to date.

First, we import the TransformerMixin, which sets the API for us. While Python doesn't have strict interfaces (as opposed to languages like Java), using a mixin like this allows scikit-learn to determine that the class is actually a transformer. We also need to import a function that checks the input is of a valid type. We will use that soon.

Let's look at the code:

from sklearn.base import TransformerMixin
from sklearn.utils import as_float_array

Now, create a new class that subclasses from our mixin:

class MeanDiscrete(TransformerMixin):

We need to define both a fit and transform function to conform to the API. In our fit function, we find the mean of the dataset and set an internal variable to remember that value. Let's look at the code:

    def fit(self, X):

First, we ensure that X is a dataset that we can work with, using the as_float_array function (which will also convert X if it can, for example, if X is a list of floats):

        X = as_float_array(X)

Next, we compute the mean of the array and set an internal parameter to remember this value. When X is a multivariate array, self.mean will be an array that contains the mean of each feature:

        self.mean = X.mean(axis=0)

The fit function also needs to return the class itself. This requirement ensures that we can perform chaining of functionality in transformers (such as calling transformer.fit(X).transform(X)). Let's look at the code:

        return self

Next, we define the transform function, this takes a dataset of the same type as the fit function, so we need to check we got the right input:

    def transform(self, X):
        X = as_float_array(X)

We should perform another check here too. While we need the input to be a NumPy array (or an equivalent data structure), the shape needs to be consistent too. The number of features in this array needs to be the same as the number of features the class was trained on.

        assert X.shape[1] == self.mean.shape[0]

Now, we perform the actual transformation by simply testing if the values in X are higher than the stored mean.

        return X > self.mean

We can then create an instance of this class and use it to transform our X array:

mean_discrete = MeanDiscrete()
X_mean = mean_discrete.fit_transform(X)

Unit testing

When creating your own functions and classes, it is always a good idea to do unit testing. Unit testing aims to test a single unit of your code. In this case, we want to test that our transformer does as it needs to do.

Good tests should be independently verifiable. A good way to confirm the legitimacy of your tests is by using another computer language or method to perform the calculations. In this case, I used Excel to create a dataset, and then computed the mean for each cell. Those values were then transferred here.

Unit tests should also be small and quick to run. Therefore, any data used should be of a small size. The dataset I used for creating the tests is stored in the Xt variable from earlier, which we will recreate in our test. The mean of these two features is 13.5 and 15.5, respectively.

To create our unit test, we import the assert_array_equal function from NumPy's testing, which checks whether two arrays are equal:

from numpy.testing import assert_array_equal

Next, we create our function. It is important that the test's name starts with test_, as this nomenclature is used for tools that automatically find and run tests. We also set up our testing data:

def test_meandiscrete():
    X_test = np.array([[ 0,  2],
                        [ 3,  5],
                        [ 6,  8],
                        [ 9, 11],
                        [12, 14],
                        [15, 17],
                        [18, 20],
                        [21, 23],
                        [24, 26],
                        [27, 29]])

We then create our transformer instance and fit it using this test data:

    mean_discrete = MeanDiscrete()
    mean_discrete.fit(X_test)

Next, we check whether the internal mean parameter was correctly set by comparing it with our independently verified result:

    assert_array_equal(mean_discrete.mean, np.array([13.5, 15.5]))

We then run the transform to create the transformed dataset. We also create an (independently computed) array with the expected values for the output:

    X_transformed = mean_discrete.transform(X_test)
    X_expected = np.array([[ 0,  0],
                            [ 0, 0],
                            [ 0, 0],
                            [ 0, 0],
                            [ 0, 0],
                            [ 1, 1],
                            [ 1, 1],
                            [ 1, 1],
                            [ 1, 1],
                            [ 1, 1]])

Finally, we test that our returned result is indeed what we expected:

    assert_array_equal(X_transformed, X_expected)

We can run the test by simply running the function itself:

test_meandiscrete()

If there was no error, then the test ran without an issue! You can verify this by changing some of the tests to deliberately incorrect values, and seeing that the test fails. Remember to change them back so that the test passes.

If we had multiple tests, it would be worth using a testing framework called nose to run our tests.

Putting it all together

Now that we have a tested transformer, it is time to put it into action. Using what we have learned so far, we create a Pipeline, set the first step to the MeanDiscrete transformer, and the second step to a Decision Tree Classifier. We then run a cross validation and print out the result. Let's look at the code:

from sklearn.pipeline import Pipeline
pipeline = Pipeline([('mean_discrete', MeanDiscrete()),
  ('classifier', DecisionTreeClassifier(random_state=14))])
  scores_mean_discrete = cross_val_score(pipeline, X, y, scoring='accuracy')
  print("Mean Discrete performance:
{0:.3f}".format(scores_mean_discrete.mean()))

The result is 0.803, which is not as good as before, but not bad for simple binary features.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.60.62