The interesting thing about this dataset is that each comment can have multiples labels. For instance, a comment could be insulting and toxic, or it could be obscene and have identity_hate elements in it.

Hence, we are leveling up here by trying to predict not one label (such as positive or negative), but multiple labels in one go. For each label, we'll predict a value between 0 and 1 to indicate how likely it is to belong to that category.

This is not a probability value in the Bayesian meaning of the word, but represents the same intent.

I'd recommend trying out the models that we saw earlier with this dataset, and re-implementing this code for our favourite IMDb dataset.

Let's preview the test dataset as well using the same idea:

test_df = pd.read_csv("data/test.csv")
test_df.head()

We get the following output:

	`id`	`comment_text`
0	00001cee341fdb12	Yo bitch Ja Rule is more succesful then you'll...
1	0000247867823ef7	`== From RfC ==` The title is fine as i...
2	00013b17ad220c46	`== Sources == *` Zawe Ashto...
3	00017563c3f7919a	If you have a look back at the source, the in...
4	00017695ad8997eb	I don't anonymously edit articles at all.

This preview confirms that we have a text challenge. The focus here is on the semantic categorization of text. The test dataset does not have empty headers or columns for the target columns, but we can infer them from the train dataframe.

Table of Contents for
Multiple target dataset

Multiple target dataset

Table of Contents for Multiple target dataset

Create new playlist

Sign In

Sign Up

Table of Contents for
Multiple target dataset