Getting the newsgroups data

The first project in this book is about the 20 newsgroups dataset. It's composed of text taken from newsgroup articles, as its name implies. It was originally collected by Ken Lang and now has been widely used for experiments in text applications of machine learning techniques, specifically NLP techniques.

The data contains approximately 20,000 documents across 20 online newsgroups. A newsgroup is a place on the internet where people can ask and answer questions about a certain topic. The data is already cleaned to a certain degree and already split into training and testing sets. The cutoff point is at a certain date.

The original data comes from http://qwone.com/~jason/20Newsgroups/, with 20 different topics listed, as follows:

  • comp.graphics
  • comp.os.ms-windows.misc
  • comp.sys.ibm.pc.hardware
  • comp.sys.mac.hardware
  • comp.windows.x
  • rec.autos
  • rec.motorcycles
  • rec.sport.baseball
  • rec.sport.hockey
  • sci.crypt
  • sci.electronics
  • sci.med
  • sci.space
  • misc.forsale
  • talk.politics.misc
  • talk.politics.guns
  • talk.politics.mideast
  • talk.religion.misc
  • alt.atheism
  • soc.religion.christian

All of the documents in the dataset are in English. And we can easily deduce the topics from the newsgroups  names.

The dataset is labeled and each document is composed of text data and a group label. This also makes it a perfect fit for supervised learning, such as text classification. And we will explore it in detail in Chapter 5, Classifying Newsgroup Topic with Support Vector Machine.

Some of the newsgroups are closely related or even overlapping, for instance, those five computer groups (comp.graphics, comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, comp.sys.mac.hardware, and comp.windows.x), while some are not closely related to each other, such as Christian (soc.religion.christian) and baseball (rec.sport.baseball). Hence, it's a perfect use case for unsupervised learning such as clustering with which we can see whether similar topics are grouped together and unrelated ones are far apart. Moreover, we can even discover abstract topics beyond the original 20 labels using topic modeling techniques. We will explore clustering and topic modeling in detail in Chapter 3Mining the 20 Newsgroups Dataset with Clustering and Topic Modeling Algorithms.

For now, let's focus on exploring and analyzing the text data. We shall get started with acquiring the data.

It is possible to download the dataset manually from the original website or many other online repositories. However, there are also many versions of the dataset—some are cleaned in a certain way and some are in raw form. To avoid confusion, it is best to use a consistent acquisition method. The scikit-learn library provides a utility function that loads the dataset. Once the dataset is downloaded, it's automatically cached. We don't need to download the same dataset twice.

In most cases, caching the dataset, especially for a relatively small one, is considered a good practice. Other Python libraries also provide data download utilities, but not all of them implement automatic caching. This is another reason why we love scikit-learn.

As always, we first import the loader function for the 20 newsgroups data, as follows:

>>> from sklearn.datasets import fetch_20newsgroups

Then, we download the dataset with all default parameters as follows:

>>> groups = fetch_20newsgroups()
Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)

We can also specify one or more certain topic groups and particular sections (training, testing, or both) and just load such a subset of data in the program. The full list of parameters and options for the loader function is summarized in the following table:

You might find random_state interesting. Why do we need to it and why do we need to fix it? It's actually used for the purpose of reproducibility. You are able to get the same dataset every time you run the script. Otherwise, working on datasets shuffled under different orders might bring in unnecessary variations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.29.71