Evaluating cars based on their characteristics

Let's see how we can apply classification techniques to a real-world problem. We will use a dataset that contains some details about cars, such as number of doors, boot space, maintenance costs, and so on. Our goal is to determine the quality of the car. For the purposes of classification, the quality can take four values: unacceptable, acceptable, good, and very good.

Getting ready

You can download the dataset at https://archive.ics.uci.edu/ml/datasets/Car+Evaluation.

You need to treat each value in the dataset as a string. We consider six attributes in the dataset. Here are the attributes along with the possible values they can take:

  • buying: These will be vhigh, high, med, and low
  • maint: These will be vhigh, high, med, and low
  • doors: These will be 2, 3, 4, 5, and more
  • persons: These will be 2, 4, more
  • lug_boot: These will be small, med, and big
  • safety: These will be low, med, and high

Given that each line contains strings, we need to assume that all the features are strings and design a classifier. In the previous chapter, we used random forests to build a regressor. In this recipe, we will use random forests as a classifier.

How to do it…

  1. We will use the car.py file that we already provided to you as reference. Let's go ahead and import a couple of packages:
    from sklearn import preprocessing
    from sklearn.ensemble import RandomForestClassifier
  2. Let's load the dataset:
    input_file = 'path/to/dataset/car.data.txt'
    
    # Reading the data
    X = []
    count = 0
    with open(input_file, 'r') as f:
        for line in f.readlines():
            data = line[:-1].split(',')
            X.append(data)
    
    X = np.array(X)

    Each line contains a comma-separated list of words. Therefore, we parse the input file, split each line, and then append the list to the main data. We ignore the last character on each line because it's a newline character. The Python packages only work with numerical data, so we need to transform these attributes into something that those packages will understand.

  3. In the previous chapter, we discussed label encoding. That is what we will use here to convert strings to numbers:
    # Convert string data to numerical data
    label_encoder = [] 
    X_encoded = np.empty(X.shape)
    for i,item in enumerate(X[0]):
        label_encoder.append(preprocessing.LabelEncoder())
        X_encoded[:, i] = label_encoder[-1].fit_transform(X[:, i])
    
    X = X_encoded[:, :-1].astype(int)
    y = X_encoded[:, -1].astype(int)

    As each attribute can take a limited number of values, we can use the label encoder to transform them into numbers. We need to use different label encoders for each attribute. For example, the lug_boot attribute can take three distinct values, and we need a label encoder that knows how to encode this attribute. The last value on each line is the class, so we assign it to the y variable.

  4. Let's train the classifier:
    # Build a Random Forest classifier
    params = {'n_estimators': 200, 'max_depth': 8, 'random_state': 7}
    classifier = RandomForestClassifier(**params)
    classifier.fit(X, y)

    You can play around with the n_estimators and max_depth parameters to see how they affect the classification accuracy. We will actually do this soon in a standardized way.

  5. Let's perform cross-validation:
    # Cross validation
    from sklearn import cross_validation
    
    accuracy = cross_validation.cross_val_score(classifier, 
            X, y, scoring='accuracy', cv=3)
    print "Accuracy of the classifier: " + str(round(100*accuracy.mean(), 2)) + "%"

    Once we train the classifier, we need to see how it performs. We use three-fold cross-validation to calculate the accuracy here.

  6. One of the main goals of building a classifier is to use it on isolated and unknown data instances. Let's use a single datapoint and see how we can use this classifier to categorize it:
    # Testing encoding on single data instance
    input_data = ['vhigh', 'vhigh', '2', '2', 'small', 'low'] 
    input_data_encoded = [-1] * len(input_data)
    for i,item in enumerate(input_data):
        input_data_encoded[i] = int(label_encoder[i].transform(input_data[i]))
    
    input_data_encoded = np.array(input_data_encoded)

    The first step was to convert that data into numerical data. We need to use the label encoders that we used during training because we want it to be consistent. If there are unknown values in the input datapoint, the label encoder will complain because it doesn't know how to handle that data. For example, if you change the first value in the list from vhigh to abcd, then the label encoder won't work because it doesn't know how to interpret this string. This acts like an error check to see if the input datapoint is valid.

  7. We are now ready to predict the output class for this datapoint:

    # Predict and print output for a particular datapoint
    output_class = classifier.predict(input_data_encoded)
    print "Output class:", label_encoder[-1].inverse_transform(output_class)[0]

    We use the predict method to estimate the output class. If we output the encoded output label, it wouldn't mean anything to us. Therefore, we use the inverse_transform method to convert this label back to its original form and print out the output class.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.243.130