Exploring the newsgroups data

After we download the 20 newsgroups dataset by whatever means we prefer, the data object of groups is now cached in memory. The data object is in the form of key-value dictionary. Its keys are as follows:

>>> groups.keys()
dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

The target_names key gives the newsgroups names:

>>> groups['target_names']
   ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

The target key corresponds to a newsgroup but is encoded as an integer:

>>> groups.target
array([7, 4, 4, ..., 3, 1, 8])

Then what are the distinct values for these integers? We can use the unique function from NumPy to figure it out:

>>> import numpy as np
>>> np.unique(groups.target)
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

They're ranging from 0 to 19, representing the 1^st, 2^nd, 3^rd, …, 20^th newsgroup topics in groups['target_names'].

In the context of multiple topics or categories, it is important to know what the distribution of topics is. A uniform class distribution is the easiest to deal with, because there are no under-represented or over-represented categories. However, frequently we have a skewed distribution with one or more categories dominating. We herein use the seaborn package (https://seaborn.pydata.org/) to compute the histogram of categories and plot it utilizing the matplotlib package (https://matplotlib.org/). We can install both packages via pip as follows:

python -m pip install -U matplotlib
pip install seaborn

In the case of conda, you can execute the following command line:

conda install -c conda-forge matplotlib
conda install seaborn

Remember to install matplotlib before seaborn as matplotlib is one of the dependencies of the seaborn package.

Now let's display the distribution of the classes as follows:

>>> import seaborn as sns
>>> sns.distplot(groups.target)
<matplotlib.axes._subplots.AxesSubplot object at 0x108ada6a0>
>>> import matplotlib.pyplot as plt
>>> plt.show()

Refer to the following screenshot for the end result:

As we can see, the distribution is approximately uniform so that's one less thing to worry about.

It's good to visualize to get a general idea of how the data is structured, what possible issues may arise, and whether there are any irregularities that we have to take care of.

Other keys are quite self-explanatory: data contains all newsgroups documents and filenames stores the path where each document is located in your filesystem.

Now, let's now have a look at the first document and its topic number and name by executing the following command:

>>> groups.data[0]
"From: [email protected] (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
 ---- brought to you by your neighborhood Lerxst ----




"
>>> groups.target[0]
7
>>> groups.target_names[groups.target[0]]
'rec.autos'

If random_state isn't fixed (42 by default), you may get different results running the preceding scripts.

As we can see, the first document is from the rec.autos newsgroup, which was assigned the number 7. Reading this post, we can easily figure out it's about cars. The word car actually occurs a number of times in the document. Words such as bumper also seem very car-oriented. However, words such as doors may not necessarily be car related, as they may also be associated with home improvement or another topic. As a side note, it makes sense to not distinguish between doors and door, or the same word with different capitalization such as Doors. There are some rare cases where capitalization does matter, for instance, if we're trying to find out whether a document is about the band called The Doors or the more common concept, the doors (in wood).

Table of Contents for Exploring the newsgroups data

Create new playlist

Sign In

Sign Up

Table of Contents for
Exploring the newsgroups data