The interesting thing about this dataset is that each comment can have multiples labels. For instance, a comment could be insulting and toxic, or it could be obscene and have identity_hate elements in it.
Hence, we are leveling up here by trying to predict not one label (such as positive or negative), but multiple labels in one go. For each label, we'll predict a value between 0 and 1 to indicate how likely it is to belong to that category.
This is not a probability value in the Bayesian meaning of the word, but represents the same intent.
Let's preview the test dataset as well using the same idea:
test_df = pd.read_csv("data/test.csv")
test_df.head()
We get the following output:
id | comment_text | |
---|---|---|
0 | 00001cee341fdb12 | Yo bitch Ja Rule is more succesful then you'll... |
1 | 0000247867823ef7 | == From RfC == The title is fine as i... |
2 | 00013b17ad220c46 | == Sources == * Zawe Ashto... |
3 | 00017563c3f7919a | If you have a look back at the source, the in... |
4 | 00017695ad8997eb | I don't anonymously edit articles at all. |
This preview confirms that we have a text challenge. The focus here is on the semantic categorization of text. The test dataset does not have empty headers or columns for the target columns, but we can infer them from the train dataframe.