More examples - fetal state classification on cardiotocography with SVM

After a successful application of SVM with the linear kernel, we will look at one more example where SVM with the RBF kernel is suitable for it.

We are going to build a classifier that helps obstetricians categorize cardiotocograms (CTGs) into one of the three fetal states (normal, suspect, and pathologic). The cardiotocography dataset we use is from https://archive.ics.uci.edu/ml/datasets/Cardiotocography  under the UCI Machine Learning Repository and it can be directly downloaded via https://archive.ics.uci.edu/ml/machine-learning-databases/00193/CTG.xls  as an .xls Excel file. The dataset consists of measurements of fetal heart rate and uterine contraction as features and fetal state class code (1=normal, 2=suspect, 3=pathologic) as label. There are, in total, 2126 samples with 23 features. Based on the numbers of instances and features (2126 is not far more than 23), the RBF kernel is the first choice.

We herein work with the .xls Excel file using pandas (http://pandas.pydata.org/), which is a powerful data analysis library. It can be easily installed via the command line pip install pandas in the Terminal. It might request an additional installation of the xlrd package, which the pandas Excel module is based on.

We first read the data located in the sheet named Raw Data:

>>> import pandas as pd
>>> df = pd.read_excel('CTG.xls', "Raw Data")

And then take these 2126 data samples, assign the feature set (from column D to AL in the spreadsheet), and label set (column AN) respectively:

>>> X = df.ix[1:2126, 3:-2].values
>>> Y = df.ix[1:2126, -1].values

Don't forget to check class proportions:

>>> Counter(Y)
Counter({1.0: 1655, 2.0: 295, 3.0: 176})

We set aside 20% of the original data for final testing:

>>> from sklearn.model_selection import train_test_split
>>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
test_size=0.2, random_state=42)

Now we tune the RBF-based SVM model in terms of the penalty C and the kernel coefficient :

>>> svc = SVC(kernel='rbf')
>>> parameters = {'C': (100, 1e3, 1e4, 1e5),
... 'gamma': (1e-08, 1e-7, 1e-6, 1e-5)}
>>> grid_search = GridSearchCV(svc, parameters, n_jobs=-1, cv=3)
>>> start_time = timeit.default_timer()
>>> grid_search.fit(X_train, Y_train)
>>> print("--- %0.3fs seconds ---" %
(timeit.default_timer() - start_time))
--- 6.044s seconds ---
>>> grid_search.best_params_
{'C': 100000.0, 'gamma': 1e-07}
>>> grid_search.best_score_
0.942352941176
>>> svc_best = grid_search.best_estimator_

And finally employ the optimal model to the testing set:

>>> accuracy = svc_best.score(X_test, Y_test)
>>> print('The accuracy on testing set is:
{0:.1f}%'.format(accuracy*100))
The accuracy on testing set is: 96.5%

Also check the performance for individual classes since the data is not quite balanced:

>>> prediction = svc_best.predict(X_test)
>>> report = classification_report(Y_test, prediction)
>>> print(report)
precision recall f1-score support
1.0 0.98 0.98 0.98 333
2.0 0.89 0.91 0.90 64
3.0 0.96 0.93 0.95 29
avg / total 0.96 0.96 0.96 426
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.72.176