Extracting validation curves

We used random forests to build a classifier in the previous recipe, but we don't exactly know how to define the parameters. In our case, we dealt with two parameters: n_estimators and max_depth. They are called hyperparameters, and the performance of the classifier depends on them. It would be nice to see how the performance gets affected as we change the hyperparameters. This is where validation curves come into picture. These curves help us understand how each hyperparameter influences the training score. Basically, all other parameters are kept constant and we vary the hyperparameter of interest according to our range. We will then be able to visualize how this affects the score.

How to do it…

  1. Add the following code to the same Python file, as in the previous recipe:
    # Validation curves
    
    from sklearn.learning_curve import validation_curve
    
    classifier = RandomForestClassifier(max_depth=4, random_state=7)
    parameter_grid = np.linspace(25, 200, 8).astype(int)
    train_scores, validation_scores = validation_curve(classifier, X, y, 
            "n_estimators", parameter_grid, cv=5)
    print "
    ##### VALIDATION CURVES #####"
    print "
    Param: n_estimators
    Training scores:
    ", train_scores
    print "
    Param: n_estimators
    Validation scores:
    ", validation_scores

    In this case, we defined the classifier by fixing the max_depth parameter. We want to estimate the optimal number of estimators to use, and so have defined our search space using parameter_grid. It is going to extract training and validation scores by iterating from 25 to 200 in eight steps.

  2. If you run it, you will see the following on your Terminal:
    How to do it…
  3. Let's plot it:
    # Plot the curve
    plt.figure()
    plt.plot(parameter_grid, 100*np.average(train_scores, axis=1), color='black')
    plt.title('Training curve')
    plt.xlabel('Number of estimators')
    plt.ylabel('Accuracy')
    plt.show()
  4. Here is the figure that you'll get:
    How to do it…
  5. Let's do the same for the max_depth parameter:
    classifier = RandomForestClassifier(n_estimators=20, random_state=7)
    parameter_grid = np.linspace(2, 10, 5).astype(int)
    train_scores, valid_scores = validation_curve(classifier, X, y, 
            "max_depth", parameter_grid, cv=5)
    print "
    Param: max_depth
    Training scores:
    ", train_scores
    print "
    Param: max_depth
    Validation scores:
    ", validation_scores

    We fixed the n_estimators parameter at 20 to see how the performance varies with max_depth. Here is the output on the Terminal:

    How to do it…
  6. Let's plot it:
    # Plot the curve
    plt.figure()
    plt.plot(parameter_grid, 100*np.average(train_scores, axis=1), color='black')
    plt.title('Validation curve')
    plt.xlabel('Maximum depth of the tree')
    plt.ylabel('Accuracy')
    plt.show()
  7. If you run this code, you will get the following figure:
    How to do it…
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.143.65