Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 8

Shaping Data

IN THIS CHAPTER

Manipulating HTML data

Manipulating raw text

Discovering the bag of words model and other techniques

Manipulating graph data

Chapter 7 demonstrates techniques for working with data as an entity — as something you work with in Python. However, data doesn’t exist in a vacuum. It doesn’t just suddenly appear within Python for absolutely no reason at all. As demonstrated in Chapter 6, you load the data. However, loading may not be enough — you may have to shape the data as part of loading it. That’s the purpose of this chapter. You discover how to work with a variety of container types in a way that makes it possible to load data from a number of complex container types, such as HTML pages. In fact, you even work with graphics, images, and sounds.

As you progress through the book, you discover that data takes all kinds of forms and shapes. As far as the computer is concerned, data consists of 0s and 1s. Humans give the data meaning by formatting, storing, and interpreting it in a certain way. The same group of 0s and 1s could be a number, date, or text, depending on the interpretation. The data container provides clues as to how to interpret the data, so that’s why this chapter is so important to you as a data scientist using Python to discover data patterns. You find that you can discover patterns in places where you might have thought patterns couldn’t exist.

You don’t have to type the source code for this chapter manually. In fact, it’s a lot easier if you use the downloadable source (see the Introduction for download instructions). The source code for this chapter appears in the P4DS4D2_08_Shaping_Data.ipynb source code file.

Working with HTML Pages

HTML pages contain data in a hierarchical format. You often find HTML content in a strict HTML form or as XML. The HTML form can present problems because it doesn’t always necessarily follow strict formatting rules. XML does follow strict formatting rules because of the standards used to define it, which makes it easier to parse. However, in both cases, you use similar techniques to parse a page. The first section that follows describes how to parse HTML pages in general.

Sometimes you don’t need all the data on a page. Instead you need specific data, which is where XPath comes into play. You can use XPath to locate specific data on the HTML page and extract it for your particular needs.

Parsing XML and HTML

Simply extracting data from an XML file as you do in Chapter 6 may not be enough. The data may not be in the correct format. Using the approach in Chapter 6, you end up with a DataFrame containing three columns of type str. Obviously, you can’t perform much data manipulation with strings. The following example shapes the XML data from Chapter 6 to create a new DataFrame containing just the <Number> and <Boolean> elements in the correct format.

from lxml import objectify

import pandas as pd

from distutils import util

xml = objectify.parse(open('XMLData.xml'))

root = xml.getroot()

df = pd.DataFrame(columns=('Number', 'Boolean'))

for i in range(0, 4):

obj = root.getchildren()[i].getchildren()

row = dict(zip(['Number', 'Boolean'],

[obj[0].pyval,

bool(util.strtobool(obj[2].text))]))

row_s = pd.Series(row)

row_s.name = obj[1].text

df = df.append(row_s)

print(type(df.loc['First']['Number']))

print(type(df.loc['First']['Boolean']))

The DataFrame df is initially instantiated as empty, but as the code loops through the root node’s children, it extracts a list containing the following

A <Number> element (expressed as an int)
An ordinal element (a string)
A <Boolean> element (expressed as a string)

that the code uses to increment df. In fact, the code relies on the ordinal number element as the index label and constructs a new individual row to append to the existing DataFrame. This operation programmatically converts the information contained in the XML tree into the right data type to place into the existing variables in df. The number elements are already available as int type; the conversion of the <Boolean> element is a little harder. You must convert the string to a numeric value using the strtobool() function in distutils.util. The output is a 0 for False values and a 1 for True values. However, that’s still not a Boolean value. To create a Boolean value, you must convert the 0 or 1 using bool().

This example also shows how to access individual values in the DataFrame. Notice that the name property now uses the <String> element value for easy access. You provide an index value using loc and then access the individual feature using a second index. The output from this example is

<class 'int'>

<class 'bool'>

Using XPath for data extraction

Using XPath to extract data from your dataset can greatly reduce the complexity of your code and potentially make it faster as well. The following example shows an XPath version of the example in the previous section. Notice that this version is shorter and doesn’t require the use of a for loop.

from lxml import objectify

import pandas as pd

from distutils import util

xml = objectify.parse(open('XMLData.xml'))

root = xml.getroot()

map_number = map(int, root.xpath('Record/Number'))

map_bool = map(str, root.xpath('Record/Boolean'))

map_bool = map(util.strtobool, map_bool)

map_bool = map(bool, map_bool)

map_string = map(str, root.xpath('Record/String'))

data = list(zip(map_number, map_bool))

df = pd.DataFrame(data,

columns=('Number', 'Boolean'),

index = list(map_string))

print(df)

print(type(df.loc['First']['Number']))

print(type(df.loc['First']['Boolean']))

The example begins just like the previous example, with the importing of data and obtaining of the root node. At this point, the example creates a data object that contains record number and Boolean value pairs. Because the XML file entries are all strings, you must use the map() function to convert the strings to the appropriate values. Working with the record number is straightforward — all you do is map it to an int. The xpath() function accepts a path from the root node to the data you need, which is ’Record/Number’ in this case.

Mapping the Boolean value is a little more difficult. As in the previous section, you must use the util.strtobool() function to convert the string Boolean values to a number that bool() can convert to a Boolean equivalent. However, if you try to perform just a double mapping, you’ll encounter an error message saying that lists don’t include a required function, tolower().To overcome this obstacle, you perform a triple mapping and convert the data to a string using the str() function first.

Creating the DataFrame is different, too. Instead of adding individual rows, you add all the rows at one time by using data. Setting up the column names is the same as before. However, now you need some way of adding the row names, as in the previous example. This task is accomplished by setting the index parameter to a mapped version of the xpath() output for the ’Record/String’ path. Here’s the output you can expect:

Number Boolean

First 1 True

Second 2 False

Third 3 True

Fourth 4 False

<type 'numpy.int64'>

<type 'numpy.bool_'>

Working with Raw Text

Even though it might seem as if raw text wouldn’t present a problem in parsing because it doesn’t contain any special formatting, you do have to consider how the text is stored and whether it contains special words within it. The multiple forms of Unicode can present interpretation problems that you need to consider as you work through the text. Using regular expressions can help you locate specific information within a raw-text file. You can use regular expressions for both data cleaning and pattern matching. The following sections help you understand the techniques used to shape raw-text files.

Dealing with Unicode

Text files are pure text — this much is certain. The way the text is encoded can differ. For example, a character can use seven, eight, or more bits for encoding purposes. The use of special characters can differ as well. In short, the interpretation of bits used to create characters differs from encoding to encoding. You can see a host of encodings at http://www.i18nguy.com/unicode/codepages.html.

Sometimes you need to work with encodings other than the default encoding set within the Python environment. When working with Python 3.x, you must rely on Universal Transformation Format 8-bit (UTF-8) as the encoding used to read and write files. This environment is always set for UTF-8, and trying to change it causes an error message. The article at https://docs.python.org/3/howto/unicode.html provides insights on how to get around the Unicode problems in Python.

Dealing with encoding in the incorrect way can prevent you from performing tasks such as importing modules or processing text. Make sure to test your code carefully and completely to ensure that any problem with encoding won’t affect your ability to run the application. Good additional articles to read on this topic appear at http://blog.notdot.net/2010/07/Getting-unicode-right-in-Python and http://web.archive.org/web/20120722170929/http://boodebr.org/main/python/all-about-python-and-unicode.

Stemming and removing stop words

Stemming is the process of reducing words to their stem (or root) word. This task isn’t the same as understanding that some words come from Latin or other roots, but instead makes like words equal to each other for the purpose of comparison or sharing. For example, the words cats, catty, and catlike all have the stem cat. The act of stemming helps you analyze sentences by tokenizing them in a more efficient way because the machine learning algorithm has to learn about the stem cat and not about all its variants.

Removing suffixes to create stem words and generally tokenizing sentences are only two parts of the process, however, of creating something like a natural language interface. Languages include a great number of glue words that don’t mean much to a computer but have significant meaning to humans, such as a, as, the, that, and so on in English. These short, less useful words are stop words. Sentences don’t make sense without them to humans, but for your computer, they can act as a means of stopping sentence analysis.

The act of stemming and removing stop words simplifies the text and reduces the number of textual elements so that just the essential elements remain. In addition, you keep just the terms that are nearest to the true sense of the phrase. By reducing phrases in such a fashion, a computational algorithm can work faster and process the text more effectively.

This example requires the use of the Natural Language Toolkit (NLTK), which Anaconda (see Chapter 3 for details on Anaconda) doesn’t install by default. To use this example, you must download and install NLTK using the instructions found at http://www.nltk.org/install.html for your platform. Make certain that you install the NLTK for whatever version of Python you’re using for this book when you have multiple versions of Python installed on your system. After you install NLTK, you must also install the packages associated with it. The instructions at http://www.nltk.org/data.html tell you how to perform this task (install all the packages to ensure you have everything).

The following example demonstrates how to perform stemming and remove stop words from a sentence. It begins by training an algorithm to perform the required analysis using a test sentence. Afterward, the example checks a second sentence for words that appear in the first.

from sklearn.feature_extraction.text import *

from nltk import word_tokenize

from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()

def stem_tokens(tokens, stemmer):

stemmed = []

for item in tokens:

stemmed.append(stemmer.stem(item))

return stemmed

def tokenize(text):

tokens = word_tokenize(text)

stems = stem_tokens(tokens, stemmer)

return stems

vocab = ['Sam loves swimming so he swims all the time']

vect = CountVectorizer(tokenizer=tokenize,

stop_words='english')

vec = vect.fit(vocab)

sentence1 = vec.transform(['George loves swimming too!'])

print(vec.get_feature_names())

print(sentence1.toarray())

At the outset, the example creates a vocabulary using a test sentence and places it in vocab. It then creates a CountVectorizer, vect, to hold a list of stemmed words, but excludes the stop words. The tokenizer parameter defines the function used to stem the words. The stop_words parameter refers to a pickle file that contains stop words for a specific language, which is English in this case. There are also files for other languages, such as French and German. (You can see other parameters for the CountVectorizer() at https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html.) The vocabulary is fitted into another CountVectorizer, vec, which is used to perform the actual transformation on a test sentence using the transform() function. Here’s the output from this example.

['love', 'sam', 'swim', 'time']

[[1 0 1 0]]

The first output shows the stemmed words. Notice that the list contains only swim, not swimming and swims. All the stop words are missing as well. For example, you don’t see the words so, he, all, or the.

The second output shows how many times each stemmed word appears in the test sentence. In this case, a love variant appears once and a swim variant appears once as well. The words sam and time don’t appear in the second sentence, so those values are set to 0.

Introducing regular expressions

Regular expressions present the data scientist with an interesting array of tools for parsing raw text. At first, it may seem daunting to figure out precisely how regular expressions work. However, sites such as https://regexr.com/ let you play with regular expressions so that you can see how the use of various expressions performs specific types of pattern matching. Of course, the first requirement is to discover pattern matching, which is the use of special characters to tell a parsing engine what to find in the raw text file. Table 8-1 provides a list of pattern-matching characters and tells you how to use them.

TABLE 8-1 Pattern-Matching Characters Used in Python

Character	Interpretation
`(re)`	Groups regular expressions and remembers the matched text
`(?: re)`	Groups regular expressions without remembering matched text
`(?#…)`	Indicates a comment, which isn’t processed
`re?`	Matches 0 or 1 occurrence of preceding expression (but no more than 0 or 1 occurrence)
`re*`	Matches 0 or more occurrences of the preceding expression
`re+`	Matches 1 or more occurrences of the preceding expression
`(?> re)`	Matches an independent pattern without backtracking
`.`	Matches any single character except the newline ( ) character (adding the `m` option allows it to match the newline character as well)
`[^…]`	Matches any single character or range of characters not found within the brackets
`[…]`	Matches any single character or range of characters that appears within the brackets
`re{ n, m}`	Matches at least `n` and at most `m` occurrences of the preceding expression
`, , etc.`	Matches control characters such as newlines ( ), carriage returns ( ), and tabs ( )
`d`	Matches digits (which is equivalent to using `[0-9]`)
`a\|b`	Matches either `a` or `b`
`re{ n}`	Matches exactly the number of occurrences of preceding expression specified by `n`
`re{ n,}`	Matches `n` or more occurrences of the preceding expression
`D`	Matches nondigits
`S`	Matches nonwhitespace
`B`	Matches nonword boundaries
`W`	Matches nonword characters
`1…9`	Matches nth grouped subexpression
`10`	Matches nth grouped subexpression if it matched already (otherwise, the pattern refers to the octal representation of a character code)
`A`	Matches the beginning of a string
`^`	Matches the beginning of the line
`z`	Matches the end of a string
	Matches the end of string (when a newline exists, it matches just before newline)
`$`	Matches the end of the line
`G`	Matches the point where the last match finished
`s`	Matches whitespace (which is equivalent to using [ f])
	Matches word boundaries when outside the brackets; matches the backspace (0x08) when inside the brackets
`w`	Matches word characters
`(?= re)`	Specifies a position using a pattern (This pattern doesn’t have a range.)
`(?! re)`	Specifies a position using pattern negation (This pattern doesn’t have a range.)
`(?-imx)`	Toggles the `i`, `m`, or `x` options temporarily off within a regular expression (when this pattern appears in parentheses, only the area within the parentheses is affected)
`(?imx)`	Toggles the `i`, `m`, or `x` options temporarily on within a regular expression (when this pattern appears in parentheses, only the area within the parentheses is affected)
`(?-imx: re)`	Toggles the `i`, `m`, or `x` options within parentheses temporarily off
`(?imx: re)`	Toggles the `i`, `m`, or `x` options within parentheses temporarily on

Using regular expressions helps you manipulate complex text before using other techniques described in this chapter. In the following example, you see how to extract a telephone number from a sentence no matter where the telephone number appears. This sort of manipulation is helpful when you have to work with text of various origins and in irregular format. You can see some additional telephone number manipulation routines at http://www.diveintopython.net/regular_expressions/phone_numbers.html. The big thing is that this example helps you understand how to extract any text you need from text you don’t.

import re

data1 = 'My phone number is: 800-555-1212.'

data2 = '800-555-1234 is my phone number.'

pattern = re.compile(r'(d{3})-(d{3})-(d{4})')

dmatch1 = pattern.search(data1).groups()

dmatch2 = pattern.search(data2).groups()

print(dmatch1)

print(dmatch2)

The example begins with two telephone numbers placed in sentences in various locations. Before you can do much, you need to create a pattern. Always read a pattern from left to right. In this case, the pattern is looking for three digits, followed by a dash, three more digits, followed by another dash, and finally four digits.

To make the process faster and easier, the code calls the compile() function to create a compiled version of the pattern so that Python doesn’t have to recreate the pattern every time you need it. The compiled pattern appears in pattern.

The search() function looks for the pattern in each of the test sentences. It then places any matched text that it finds into groups and outputs a tuple into one of two variables. Here’s the output from this example.

('800', '555', '1212')

('800', '555', '1234')

Using the Bag of Words Model and Beyond

The goal of most data imports is to perform some type of analysis. Before you can perform analysis on textual data, you must tokenize every word within the dataset. The act of tokenizing the words creates a bag of words. You can then use the bag of words to train classifiers, a special kind of algorithm used to break words down into categories. The following section provides additional insights into the bag of words model and shows how to work with it.

GETTING THE 20 NEWSGROUPS DATASET

The examples in the sections that follow rely on the 20 Newsgroups dataset (http://qwone.com/~jason/20Newsgroups/) that’s part of the Scikit-learn installation. The host site provides some additional information about the dataset, but essentially it’s a good dataset to use to demonstrate various kinds of text analysis.

You don’t have to do anything special to work with the dataset because Scikit-learn already knows about it. However, when you run the first example, you see the message “WARNING:sklearn.datasets.twenty_newsgroups:Downloading dataset from http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz (14MB).” All this message tells you is that you need to wait for the data download to complete. There is nothing wrong with your system. Look at the left side of the code cell in IPython Notebook and you see the familiar In [*]: entry. When this entry changes to show a number, the download is complete. The message doesn’t go away until the next time you run the cell.

Understanding the bag of words model

As mentioned in the introduction, in order to perform textual analysis of various sorts, you need to first tokenize the words and create a bag of words from them. The bag of words uses numbers to represent words, word frequencies, and word locations that you can manipulate mathematically to see patterns in the way that the words are structured and used. The bag of words model ignores grammar and even word order — the focus is on simplifying the text so that you can easily analyze it.

The creation of a bag of words revolves around Natural Language Processing (NLP) and Information Retrieval (IR). Before you perform this sort of processing, you normally remove any special characters (such as HTML formatting from a web source), remove the stop words, and possibly perform stemming as well (as described in the “Stemming and removing stop words” section, earlier this chapter). For the purpose of this example, you use the 20 Newsgroups dataset directly. Here’s an example of how you can obtain textual input and create a bag of words from it:

from sklearn.datasets import fetch_20newsgroups

from sklearn.feature_extraction.text import *

categories = ['comp.graphics', 'misc.forsale',

'rec.autos', 'sci.space']

twenty_train = fetch_20newsgroups(subset='train',

categories=categories,

shuffle=True,

random_state=42)

count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(

twenty_train.data)

print("BOW shape:", X_train_counts.shape)

caltech_idx = count_vect.vocabulary_['caltech']

print('"Caltech": %i' % X_train_counts[0, caltech_idx])

Remember A number of the examples you see online are unclear as to where the list of categories they use come from. The host site at http://qwone.com/~jason/20Newsgroups/ provides you with a listing of the categories you can use. The category list doesn’t come from a magic hat somewhere, but many examples online simply don’t bother to document some information sources. Always refer to the host site when you have questions about issues such as dataset categories.

The call to fetch_20newsgroups() loads the dataset into memory. You see the resulting training object, twenty_train, described as a bunch. At this point, you have an object that contains a listing of categories and associated data, but the application hasn’t tokenized the data, and the algorithm used to work with the data isn’t trained.

Now that you have a bunch of data to use, you can begin creating a bag of words with it. The bag of words process begins by assigning an integer value (an index of a sort) to each unique word in the training set. In addition, each document receives an integer value. The next step is to count every occurrence of these words in each document and create a list of document and count pairs so that you know which words appear how often in each document.

Naturally, some words from the master list aren’t used in some documents, thereby creating a high-dimensional sparse dataset. The scipy.sparse matrix is a data structure that lets you store only the nonzero elements of the list in order to save memory. When the code makes the call to count_vect.fit_transform(), it places the resulting bag of words into X_train_counts. You can see the resulting number of entries by accessing the shape property and the counts for the word "Caltech" in the first document:

BOW shape: (2356, 34750)

"Caltech": 3

Working with n-grams

An n-gram is a continuous sequence of items in the text you want to analyze. The items are phonemes, syllables, letters, words, or base pairs. The n in n-gram refers to a size. An n-gram that has a size of one, for example, is a unigram. The example in this section uses a size of three, making a trigram. You use n-grams in a probabilistic manner to perform tasks such as predicting the next sequence in a series, which wouldn’t seem very useful until you start thinking about applications such as search engines that try to predict the word you want to type based on the previous letters you’ve supplied. However, the technique has all sorts of applications, such as in DNA sequencing and data compression. The following example shows how to create n-grams from the 20 Newsgroups dataset.

from sklearn.datasets import fetch_20newsgroups

from sklearn.feature_extraction.text import *

categories = ['sci.space']

twenty_train = fetch_20newsgroups(subset='train',

categories=categories,

remove=('headers',

'footers',

'quotes'),

shuffle=True,

random_state=42)

count_chars = CountVectorizer(analyzer='char_wb',

ngram_range=(3,3),

max_features=10)

count_chars.fit(twenty_train['data'])

count_words = CountVectorizer(analyzer='word',

ngram_range=(2,2),

max_features=10,

stop_words='english')

count_words.fit(twenty_train['data'])

X = count_chars.transform(twenty_train.data)

print(count_chars.get_feature_names())

print(X[1].todense())

print(count_words.get_feature_names())

The beginning code is the same as in the previous section. You still begin by fetching the dataset and placing it into a bunch. However, in this case, the vectorization process takes on new meaning. The arguments process the data in a special way.

In this case, the analyzer parameter determines how the application creates the n-grams. You can choose words (word), characters (char), or characters within word boundaries (char_wb). The ngram_range parameter requires two inputs in the form of a tuple: The first determines the minimum n-gram size and the second determines the maximum n-gram size. The third argument, max_features, determines how many features the vectorizer returns. In the second vectorizer call, the stop_words argument removes the terms contained in the English pickle (see the “Stemming and removing stop words” section, earlier in the chapter, for details). At this point, the application fits the data to the transformation algorithm.

The example provides three outputs. The first shows the top ten trigrams for characters from the document. The second is the n-gram for the first document. It shows the frequency of the top ten trigrams. The third is the top ten trigrams for words. Here’s the output from this example:

[' an', ' in', ' of', ' th', ' to', 'he ', 'ing', 'ion',

'nd ', 'the']

[[0 0 2 5 1 4 2 2 0 5]]

['anonymous ftp', 'commercial space', 'gamma ray',

'nasa gov', 'national space', 'remote sensing',

'sci space', 'space shuttle', 'space station',

'washington dc']

Implementing TF-IDF transformations

The Term Frequency times Inverse Document Frequency (TF-IDF) transformation is a technique used to help compensate for words found relatively often in different documents, which makes it hard to distinguish between the documents because they are too common (stop words are a good example). What this transformation is really telling you is the importance of a particular word to the uniqueness of a document. The greater the frequency of a word in a document, the more important it is to that document. However, the measurement is offset by the document size — the total number of words the document contains — and by how often the word appears in other documents.

Even if a word appears many times inside a document, that doesn’t imply that the word is important for understanding the document itself; in many documents, you find stop words with the same frequency as the words that relate to the document’s general topics. For example, if you analyze documents with scifi-related discussions (such as in the 20 Newsgroups dataset), you may find that many of them deal with UFOs; therefore, the acronym UFO can’t represent a distinction between different documents. Moreover, longer documents contain more words than shorter ones, and repeated words are easily found when the text is abundant.

In fact, a word found a few times in a single document (or possibly a few others) could prove quite distinctive and helpful in determining the document type. If you are working with documents discussing scifi and automobile sales, the acronym UFO can be distinctive because it easily separates the two topic types in your documents.

Search engines often need to weight words in a document in a way that helps determine when the word is important in the text. You use words with the higher weight to index the document so that when you search for those words, the search engine will retrieve that document. This is the reason that the TD-IDF transformation is used quite often in search engine applications.

Getting into more details, the TF part of the TF-IDF equation determines how frequently the term appears in the document, while the IDF part of the equation determines the term’s importance because it represents the inverse of the frequency of that word among all the documents. A large IDF implies a seldom-found word and that the TF-IDF weight will also be larger. A small IDF means that the word is common, and that will result in a small TF-IDF weight. You can see some actual calculations of this particular measure at http://www.tfidf.com/. Here’s an example of how to calculate TF-IDF using Python:

from sklearn.datasets import fetch_20newsgroups

from sklearn.feature_extraction.text import *

categories = ['comp.graphics', 'misc.forsale',

'rec.autos', 'sci.space']

twenty_train = fetch_20newsgroups(subset='train',

categories=categories,

shuffle=True,

random_state=42)

count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(

twenty_train.data)

tfidf = TfidfTransformer().fit(X_train_counts)

X_train_tfidf = tfidf.transform(X_train_counts)

caltech_idx = count_vect.vocabulary_['caltech']

print('"Caltech" scored in a BOW:')

print('count: %0.3f' % X_train_counts[0, caltech_idx])

print('TF-IDF: %0.3f' % X_train_tfidf[0, caltech_idx])

This example begins much like the other examples in this section have, by fetching the 20 Newsgroups dataset. It then creates a word bag, much like the example in the “Understanding the bag of words model” section, earlier in this chapter. However, now you see something you can do with the word bag.

In this case, the code calls upon TfidfTransformer() to convert the raw newsgroup documents into a matrix of TF-IDF features. The use_idf controls the use of inverse-document-frequency reweighting, which it turned on in this case. The vectorized data is fitted to the transformation algorithm. The next step, calling tfidf.transform(), performs the actual transformation process. Here’s the result you get from this example:

"Caltech" scored in a BOW:

count: 3.000

TF-IDF: 0.123

Notice how the word Caltech now has a lower value in the first document compared to the example in the previous paragraph, where the counting of occurrences for the same word in the same document scored a value of 3. To understand how counting occurrences relates to TF-IDF, compute the average word count and average TF-IDF:

import numpy as np

count = np.mean(X_train_counts[X_train_counts>0])

tfif = np.mean(X_train_tfidf[X_train_tfidf>0])

print('mean count: %0.3f' % np.mean(count))

print('mean TF-IDF: %0.3f' % np.mean(tfif))

The results demonstrate that no matter how you count occurrences of Caltech in the first document or use its TF-IDF, the value is always double the average word, revealing that it is a keyword for modeling the text:

mean count: 1.698

mean TF-IDF: 0.064

Remember TF-IDF helps you to locate the most important word or n-grams and exclude the least important ones. It is also very helpful as an input for linear models, because they work better with TF-IDF scores than word counts. At this point, you normally train a classifier and perform various sorts of analysis. Don’t worry about this next part of the process just yet. Starting with Chapters 12 and 15, you get introduced to classifiers. In Chapter 17, you begin working with classifiers in earnest.

Working with Graph Data

Imagine data points that are connected to other data points, such as how one web page is connected to another web page through hyperlinks. Each of these data points is a node. The nodes connect to each other using links. Not every node links to every other node, so the node connections become important. By analyzing the nodes and their links, you can perform all sorts of interesting tasks in data science, such as defining the best way to get from work to your home using streets and highways. The following sections describe how graphs work and how to perform basic tasks with them.

Understanding the adjacency matrix

An adjacency matrix represents the connections between nodes of a graph. When a connection exists between one node and another, the matrix indicates it as a value greater than 0. The precise representation of connections in the matrix depends on whether the graph is directed (where the direction of the connection matters) or undirected.

A problem with many online examples is that the authors keep them simple for explanation purposes. However, real-world graphs are often immense and defy easy analysis simply through visualization. Just think about the number of nodes that even a small city would have when considering street intersections (with the links being the streets themselves). Many other graphs are far larger, and simply looking at them will never reveal any interesting patterns. Data scientists call the problem in presenting any complex graph using an adjacency matrix a hairball.

One key to analyzing adjacency matrices is to sort them in specific ways. For example, you might choose to sort the data according to properties other than the actual connections. A graph of street connections might include the date the street was last paved with the data, enabling you to look for patterns that direct someone based on the streets that are in the best repair. In short, making the graph data useful becomes a matter of manipulating the organization of that data in specific ways.

Using NetworkX basics

Working with graphs could become difficult if you had to write all the code from scratch. Fortunately, the NetworkX package for Python makes it easy to create, manipulate, and study the structure, dynamics, and functions of complex networks (or graphs). Even though this book covers only graphs, you can use the package to work with digraphs and multigraphs as well.

The main emphasis of NetworkX is to avoid the whole issue of hairballs. The use of simple calls hides much of the complexity of working with graphs and adjacency matrices from view. The following example shows how to create a basic adjacency matrix from one of the NetworkX-supplied graphs:

import networkx as nx

G = nx.cycle_graph(10)

A = nx.adjacency_matrix(G)

print(A.todense())

The example begins by importing the required package. It then creates a graph using the cycle_graph() template. The graph contains ten nodes. Calling adjacency_matrix() creates the adjacency matrix from the graph. The final step is to print the output as a matrix, as shown here:

[[0 1 0 0 0 0 0 0 0 1]

[1 0 1 0 0 0 0 0 0 0]

[0 1 0 1 0 0 0 0 0 0]

[0 0 1 0 1 0 0 0 0 0]

[0 0 0 1 0 1 0 0 0 0]

[0 0 0 0 1 0 1 0 0 0]

[0 0 0 0 0 1 0 1 0 0]

[0 0 0 0 0 0 1 0 1 0]

[0 0 0 0 0 0 0 1 0 1]

[1 0 0 0 0 0 0 0 1 0]]

Tip You don’t have to build your own graph from scratch for testing purposes. The NetworkX site documents a number of standard graph types that you can use, all of which are available within IPython. The list appears at https://networkx.github.io/documentation/latest/reference/generators.html.

It’s interesting to see how the graph looks after you generate it. The following code displays the graph for you. Figure 8-1 shows the result of the plot.

import matplotlib.pyplot as plt

%matplotlib inline

nx.draw_networkx(G)

plt.show()

Screenshot displaying the result of plotting an original graph using a code, where an edge can be added between nodes 1 and 5. — FIGURE 8-1: Plotting the original graph.

The plot shows that you can add an edge between nodes 1 and 5. Here’s the code needed to perform this task using the add_edge() function. Figure 8-2 shows the result.

G.add_edge(1,5)

nx.draw_networkx(G)

plt.show()

Screenshot displaying the result of plotting the additional graph using the code add_edge( ) function, where an edge has been added between nodes 1 and 5. — FIGURE 8-2: Plotting the graph addition.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 8: Shaping Data

Create new playlist

Sign In

Sign Up

Shaping Data

Working with HTML Pages

Parsing XML and HTML

Using XPath for data extraction

Working with Raw Text

Dealing with Unicode

Stemming and removing stop words

Introducing regular expressions

Using the Bag of Words Model and Beyond

Understanding the bag of words model

Working with n-grams

Implementing TF-IDF transformations

Working with Graph Data

Understanding the adjacency matrix

Using NetworkX basics

Table of Contents for
Chapter 8: Shaping Data