The input data is split into K parts where one is reserved for testing, and the other K-1 for training. This process is repeated K times and the evaluation metrics are averaged. This helps in determining how well a model would generalize to new datasets.
In our example, we have labeled 96 observations in three classes (positive, negative, and neutral). We used 80 as a training set and 16 observations (17%) as a test set. Many tweets are ambiguous for sentiment classification even for human beings. Therefore, we would expect the performance in terms of precision of around 80%.
We have split our tests into three parts:
- Training set 83% - Test set 17%
- Cross validation
- Qualitative verbatim evaluation
print("Naive Bayes") print(classification_report(test_labels, nb.predict(test_vectors))) print(confusion_matrix(test_labels, nb.predict(test_vectors))) predicted = cross_val_predict(nb, train_vectors, train_labels, cv=10) print("Cross validation %s" % accuracy_score(train_labels, predicted))
The first test showed a precision of 75%, which is acceptable for a dataset with few labels:
Naive Bayes |
precision |
recall |
f1-score |
support |
negative |
0.80 |
0.50 |
0.62 |
8 |
neutral |
1.00 |
0.20 |
0.33 |
5 |
positive |
0.20 |
0.67 |
0.31 |
3 |
avg / total |
0.75 |
0.44 |
0.47 |
16 |
In terms of k-fold cross-validation, we obtained the results of around 73% of precision:
- Cross validation = 0.7375: Thus, the human check of the sentiment of the tweets looks very promising. We have extracted some random verbatims to illustrate the results.
- Positive: The success of the Premier League, with its record-breaking takings is impacting in Europe https://t.co/JDulKICszb
- Neutral: Arsenal and Manchester United home fixtures moved https://t.co/kyg7H1H6BN, #saintsfc
- Negative: Wenger's future at Arsenal plunged into further uncertainty as Palace profit https://t.co/gvbbdLH9gi
As you can see, certain verbatims are too ambiguous for even humans to correctly interpret the sentiment, so a perfect sentiment analysis algorithm is unrealistic. In the cases when we analyze content on a specific topic, such as football in this chapter, creating a custom sentiment analysis algorithm is a good idea. In the case of mixed content, where the topic is not evident, one can use a readily available open source module. However, when building a custom algorithm, it's critical to use validation techniques to be sure of a minimum accuracy.