BBC dataset

In 2006, Greene and Cunningham collected the BBC dataset to study a particular document—Clustering challenge using support vector machines. The dataset consists of 2,225 documents from the BBC News website from 2004 to 2005, corresponding to the stories collected from five topical areas: business, entertainment, politics, sport, and technology. The dataset can be seen at the following website: http://mlg.ucd.ie/datasets/bbc.html.

We can download the raw text files under the Dataset: BBC section. You will also notice that the website contains an already processed dataset, but, for this example, we want to process the dataset by ourselves. The ZIP contains five folders, one per topic. The actual documents are placed in the corresponding topic folder, as shown in the following screenshot:

Now, let's build a topic classifier.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.24.106