More example – fetal state classification on cardiotocography

After a successful application of SVM with linear kernel, we will look at one more example of an SVM with RBF kernel to start with.

We are going to build a classifier that helps obstetricians categorize cardiotocograms (CTGs) into one of the three fetal states (normal, suspect, and pathologic). The cardiotocography dataset we use is from https://archive.ics.uci.edu/ml/datasets/Cardiotocography under the UCI Machine Learning Repository and it can be directly downloaded from https://archive.ics.uci.edu/ml/machine-learning-databases/00193/CTG.xls as an .xls Excel file. The dataset consists of measurements of fetal heart rate and uterine contraction as features, and the fetal state class code (1=normal, 2=suspect, 3=pathologic) as a label. There are in total 2,126 samples with 23 features. Based on the numbers of instances and features (2,126 is not far more than 23), the RBF kernel is the first choice.

We work with the Excel file using pandas, which is suitable for table data. It might request an additional installation of the xlrd package when you run the following lines of codes, since its Excel module is built based on xlrd. If so, just run pip install xlrd in the terminal to install xlrd

We first read the data located in the sheet named Raw Data:

>>> import pandas as pd
>>> df = pd.read_excel('CTG.xls', "Raw Data")

Then, we take these 2,126 data samples, and assign the feature set (from columns D to AL in the spreadsheet), and label set (column AN) respectively:

>>> X = df.ix[1:2126, 3:-2].values
>>> Y = df.ix[1:2126, -1].values

Don't forget to check the class proportions:

>>> Counter(Y)
Counter({1.0: 1655, 2.0: 295, 3.0: 176})

We set aside 20% of the original data for final testing:

>>> from sklearn.model_selection import train_test_split
>>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
test_size=0.2, random_state=42)

Now, we tune the RBF-based SVM model in terms of the penalty C, and the kernel coefficient :

>>> svc = SVC(kernel='rbf')
>>> parameters = {'C': (100, 1e3, 1e4, 1e5),
... 'gamma': (1e-08, 1e-7, 1e-6, 1e-5)}
>>> grid_search = GridSearchCV(svc, parameters, n_jobs=-1, cv=5)
>>> start_time = timeit.default_timer()
>>> grid_search.fit(X_train, Y_train)
>>> print("--- %0.3fs seconds ---" %
(timeit.default_timer() - start_time))
--- 11.751s seconds ---
>>> grid_search.best_params_
{'C': 100000.0, 'gamma': 1e-07}

>>> grid_search.best_score_
0.9547058823529412

>>> svc_best = grid_search.best_estimator_

Finally, we apply the optimal model to the testing set:

>>> accuracy = svc_best.score(X_test, Y_test)
>>> print('The accuracy on testing set is:
{0:.1f}%'.format(accuracy*100))
The accuracy on testing set is: 96.5%

Also, we have to check the performance for individual classes since the data is not quite balanced:

>>> prediction = svc_best.predict(X_test)
>>> report = classification_report(Y_test, prediction)
>>> print(report)
precision recall f1-score support

1.0 0.98 0.98 0.98 333
2.0 0.89 0.91 0.90 64
3.0 0.96 0.93 0.95 29

micro avg 0.96 0.96 0.96 426
macro avg 0.95 0.94 0.94 426
weighted avg 0.96 0.96 0.96 426
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.36.146