14. Feature Engineering for Domains: Domain-Specific Learning

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

14. Feature Engineering for Domains: Domain-Specific Learning

In [1]:

# setup
from mlwpy import *
%matplotlib inline

import cv2

In a perfect world, our standard classification algorithms would be provided with data that is relevant, plentiful, tabular (formatted in a table), and naturally discriminitive when we look at it as points in space.

In reality, we may be stuck with data that is

Only an approximation of our target task
Limited in quantity or covers only some of many possibilities
Misaligned with the prediction method(s) we are trying to apply to it
Written as text or images which are decidedly not in an example-feature table

Issues 1 and 2 are specific to a learning problem you are focused on. We discussed issue 3 in Chapters 10 and 13; we can address it by manually or automatically engineering our feature space. Now, we’ll approach issue 4: what happens when we have data that isn’t in a nice tabular form? If you were following closely, we did talk about this a bit in the previous chapter. Through kernel methods, we can make direct use of objects as they are presented to us (perhaps as a string). However, we are then restricted to kernel-compatible learning methods and techniques; what’s worse, kernels come with their own limitations and complexities.

So, how can we convert awkward data into data that plays nicely with good old-fashioned learning algorithms? Though the terms are somewhat hazy, I’ll use the phrase feature extraction to capture the idea of converting an arbitrary data source—something like a book, a song, an image, or a movie—to tabular data we can use with the non-kernelized learning methods we’ve seen in this book. In this chapter, we will deal with two specific cases: text and images. Why? Because they are plentiful and have well-developed methods of feature extraction. Some folks might mention that processing them is also highly lucrative. But, we’re above such concerns, right?

14.1 Working with Text

When we apply machine learning methods to text documents, we encounter some interesting challenges. In contrast to a fixed set of measurements grouped in a table, documents are (1) variable-length, (2) order-dependent, and (3) unaligned. Two different documents can obviously differ in length. While learning algorithms don’t care about the ordering of features—as long as they are in the same order for each example—documents most decidedly do care about order: “the cat in the hat” is very different from “the hat in the cat” (ouch). Likewise, we expect the order of information for two examples to be presented in the same way for each example: feature one, feature two, and feature three. Two different sentences can communicate the same information in many different arrangements of—possibly different—words: “Sue went to see Chris” and “Chris had a visitor named Sue.”

We might try the same thing with two very simple documents that serve as our examples: “the cat in the hat” and “the quick brown fox jumps over the lazy dog”. After removing the super common, low-meaning stop words—like the—we end up with

	sentence
example 1	cat in hat
example 2	quick brown fox jumps over lazy dog

Or,

	word 1	word 2	word 3	word 4
example 1	cat	in	hat	*
example 2	quick	brown	fox	jumps

which we would have to extend out to the longest of all our examples. Neither of these feels right. The first really makes no attempt to tease apart any relationships in the words. The second option seems to go both too far and not far enough: everything is broken into pieces, but word 1 in example 1 may have no relationship to word 1 in example 2.

There’s a fundamental disconnect between representing examples as written text and representing them as rows in a table. I see a hand raised in the classroom. “Yes, you in the back?” “Well, how can we represent text in a table?” I’m glad you asked.

Here’s one method:

Gather all of the words in all of the documents and make that list of words the features of interest.
For each document, create a row for the learning example. Row entries indicate if a word occurs in that example. Here, we use - to indicate no and to let the yeses stand out.

	in	over	quick	brown	lazy	cat	hat	fox	dog	jumps
example 1	yes	–	–	–	–	yes	yes	–	–	–
example 2	–	yes	yes	yes	yes	–	–	yes	yes	yes

This technique is called bag of words. To encode a document, we take all the words in it, write them down on slips of paper, and throw them in a bag. The benefit here is ease and quickness. The difficulty is that we completely lose the sense of ordering! For example, “Barb went to the store and Mark went to the garage” would be represented in the exact same way as “Mark went to the store and Barb went to the garage”. With that caveat in mind, in life—and in machine learning—we often use simplified versions of complicated things either because (1) they serve as a starting point or (2) they turn out to work well enough. In this case, both are valid reasons why we often use bag-of-words representations for text learning.

We can extend this idea from working with single words—called unigrams—to pairs of adjacent words, called bigrams. Pairs of words give us a bit more context. In the Barb and Mark example, if we had trigrams—three-word phrases—we would capture the distinction of Mark-store and Barb-garage (after the stop words are removed). The same idea extends to n-grams. Of course, as you can imagine, adding longer and longer phrases takes more and more time and memory to process. We will stick with single-word unigrams for our examples here.

If we use a bag-of-words (BOW) representation, we have several different options for recording the presence of a word in a document. Above, we simply recorded yes values. We could have equivalently used zeros and ones or trues and falses. The large number of dashes points out an important practical issue. When we use BOW, the data become very sparse. Clever storage, behind the scenes, can compress that table by only recording the interesting entries and avoiding all of the blank no entries.

If we move beyond a simple yes/no recording scheme, our first idea might be to record counts of occurrences of the words. Beyond that, we might care about normalizing those counts based on some other factors. All of these are brilliant ideas and you are commended for thinking of them. Better yet, they are well known and implemented in sklearn, so let’s make these ideas concrete.

14.1.1 Encoding Text

Here are a few sample documents that we can use to investigate different ways of encoding text:

In [2]:

	cat	cow	hat	jumped	meowed	mooed	moon	not	over	said	you
0	True	False	True	False	False	False	False	False	False	False	False
1	False	True	False	True	False	False	True	False	True	False	False
2	True	True	False	False	True	True	False	False	False	False	False
3	True	True	False	False	False	False	False	True	False	True	True

	cat	cow	hat	jumped	meowed	mooed	moon	not	over	said	you
0	1	0	1	0	0	0	0	0	0	0	0
1	0	1	0	1	0	0	1	0	1	0	0
2	1	1	0	0	1	1	0	0	0	0	0
3	2	2	0	0	0	0	0	1	0	1	1

	cat	cow	hat	jumped	meowed	mooed	moon	not	over	said	you
0	0.29	0.00	1.39	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.0000
1	0.00	0.29	0.00	1.39	0.00	0.00	1.39	0.00	1.39	0.00	0.0000
2	0.29	0.29	0.00	0.00	1.39	1.39	0.00	0.00	0.00	0.00	0.0000
3	0.58	0.58	0.00	0.00	0.00	0.00	0.00	1.39	0.00	1.39	1.39

Table of Contents for 14. Feature Engineering for Domains: Domain-Specific Learning

Create new playlist

Sign In

Sign Up

14. Feature Engineering for Domains: Domain-Specific Learning

14.1 Working with Text

14.1.1 Encoding Text

14.1.1.1 Binary Bags of Words

14.1.1.2 Bag-of-Word Counts

14.1.1.3 Normalized Bag-of-Word Counts: TF-IDF

14.1.2 Example of Text Learning

14.2 Clustering

14.2.1 k-Means Clustering

14.3 Working with Images

14.3.1 Bag of Visual Words

14.3.2 Our Image Data

14.3.3 An End-to-End System

14.3.3.1 Extracting Local Visual Words

14.3.3.2 Global Vocabulary and Translation

14.3.3.3 Bags of Global Visual Words and Learning

14.3.3.4 Prediction

14.3.4 Complete Code of BoVW Transformer

14.4 EOC

14.4.1 Summary

14.4.2 Notes

14.4.3 Exercises

Table of Contents for
14. Feature Engineering for Domains: Domain-Specific Learning