We can perform the same analysis in Python. Load a number of imports that are to be used:
import pandas as pd import numpy as np from os import system import graphviz #pip install graphviz from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score from sklearn import tree
Read in the mpg data file:
carmpg = pd.read_csv("car-mpg.csv") carmpg.head(5)
Break up the data into factors and results:
columns = carmpg.columns mask = np.ones(columns.shape, dtype=bool) i = 0 #The specified column that you don't want to show mask[i] = 0 mask[7] = 0 #maker is a string X = carmpg[columns[mask]] Y = carmpg["mpg"]
Split up the data between training and testing sets:
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.3, random_state = 100)
Create a decision tree model:
clf_gini = tree.DecisionTreeClassifier(criterion = "gini", random_state = 100, max_depth=3, min_samples_leaf=5)
Calculate the model fit:
clf_gini.fit(X_train, y_train) DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=5, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=100, splitter='best')
Graph out the tree:
#I could not get this to work on a Windows machine #dot_data = tree.export_graphviz(clf_gini, out_file=None, # filled=True, rounded=True, # special_characters=True) #graph = graphviz.Source(dot_data) #graph