Working with text data

One of the main challenges in text mining is transforming unstructured written natural language into structured attribute-based instances. The process involves many steps as shown in the following image:

Working with text data

First, we extract some text from the Internet, existing documents, or databases. At the end of the first step, the text could still be presented in the XML format or some other proprietary format. The next step is to, therefore, extract the actual text only and segment it into parts of the document, for example, title, headline, abstract, body, and so on. The third step is involved with normalizing text encoding to ensure the characters are presented the same way, for example, documents encoded in formats such as ASCII, ISO 8859-1, and Windows-1250 are transformed into Unicode encoding. Next, tokenization splits the document into particular words, while the following step removes frequent words that usually have low predictive power, for example, the, a, I, we, and so on.

The part-of-speech (POS) tagging and lemmatization step could be included to transform each token (that is, word) to its basic form, which is known as lemma, by removing word endings and modifiers. For example, running becomes run, better becomes good, and so on. A simplified approach is stemming, which operates on a single word without any context of how the particular word is used, and therefore, cannot distinguish between words having different meaning, depending on the part of speech, for example, axes as plural of axe as well as axis.

The last step transforms tokens into a feature space. Most often feature space is a bag-of-words (BoW) presentation. In this presentation, a set of all words appearing in the dataset is created, that is, a bag of words. Each document is then presented as a vector that counts how many times a particular word appears in the document.

Consider the following example with two sentences:

  • Jacob likes table tennis. Emma likes table tennis too.
  • Jacob also likes basketball.

The bag of words in this case consists of {Jacob, likes, table, tennis, Emma, too, also, basketball}, which has eight distinct words. The two sentences could be now presented as vectors using the indexes of the list, indicating how many times a word at a particular index appears in the document, as follows:

  • [1, 2, 2, 2, 1, 0, 0, 0]
  • [1, 1, 0, 0, 0, 0, 1, 1]

Such vectors finally become instances for further learning.

Note

Another very powerful presentation based on the BoW model is word2vec. Word2vec was introduced in 2013 by a team of researchers led by Tomas Mikolov at Google. Word2vec is a neural network that learns distributed representations for words. An interesting property of this presentation is that words appear in clusters, such that some word relationships, such as analogies, can be reproduced using vector math. A famous example shows that king - man + woman returns queen.

Further details and implementation are available at the following link:

https://code.google.com/archive/p/word2vec/

Importing data

In this chapter, we will not look into how to scrap a set of documents from a website or extract them from database. Instead, we will assume that we already collected them as set of documents and store them in the .txt file format. Now let's look at two options how to load them. The first option addresses the situation where each document is stored in its own .txt file. The second option addresses the situation where all the documents are stored in a single file, one per line.

Importing from directory

Mallet supports reading from directory with the cc.mallet.pipe.iterator.FileIterator class. File iterator is constructed with the following three parameters:

  • A list of File[] directories with text files
  • File filter that specifies which files to select within a directory
  • A pattern that is applied to a filename to produce a class label

Consider the data structured into folders as shown in the following image. We have documents organized in five topics by folders (tech, entertainment, politics, and sport, business). Each folder contains documents on particular topics, as shown in the following image:

Importing from directory

In this case, we initialize iterator as follows:

FileIterator iterator =
  new FileIterator(new File[]{new File("path-to-my-dataset")},
  new TxtFilter(),
  FileIterator.LAST_DIRECTORY);

The first parameter specifies the path to our root folder, the second parameter limits the iterator to the .txt files only, while the last parameter asks the method to use the last directory name in the path as class label.

Importing from file

Another option to load the documents is through cc.mallet.pipe.iterator.CsvIterator.CsvIterator(Reader, Pattern, int, int, int), which assumes all the documents are in a single file and returns one instance per line extracted by a regular expression. The class is initialized by the following components:

  • Reader: This is the object that specifies how to read from a file
  • Pattern: This is a regular expression, extracting three groups: data, target label, and document name
  • int, int, int: These are the indexes of data, target, and name groups as they appear in a regular expression

Consider a text document in the following format, specifying document name, category and content:

AP881218 local-news A 16-year-old student at a private Baptist... 
AP880224 business The Bechtel Group Inc. offered in 1985 to... 
AP881017 local-news A gunman took a 74-year-old woman hostage... 
AP900117 entertainment Cupid has a new message for lovers this... 
AP880405 politics The Reagan administration is weighing w... 

To parse a line into three groups, we can use the following regular expression:

^(\S*)[\s,]*(\S*)[\s,]*(.*)$

There are three groups that appear in parenthesis, (), where the third group contains the data, the second group contains the target class, and the first group contains the document ID. The iterator is initialized as follows:

CsvIterator iterator = new CsvIterator (
fileReader,
Pattern.compile("^(\S*)[\s,]*(\S*)[\s,]*(.*)$"),
  3, 2, 1));

Here the regular expression extracts the three groups separated by an empty space and their order is 3, 2, 1.

Now let's move to data pre-processing pipeline.

Pre-processing text data

Once we initialized an iterator that will go through the data, we need to pass the data through a sequence of transformations as described at the beginning of this section. Mallet supports this process through a pipeline and a wide variety of steps that could be included in a pipeline, which are collected in the cc.mallet.pipe package. Some examples are as follows:

  • Input2CharSequence: This is a pipe that can read from various kinds of text sources (either URI, File, or Reader) into CharSequence
  • CharSequenceRemoveHTML: Thise pipe removes HTML from CharSequence
  • MakeAmpersandXMLFriendly: This converts & to &amp in tokens of a token sequence
  • TokenSequenceLowercase: This converts the text in each token in the token sequence in the data field to lower case
  • TokenSequence2FeatureSequence: This converts the token sequence in the data field of each instance to a feature sequence
  • TokenSequenceNGrams: This converts the token sequence in the data field to a token sequence of ngrams, that is, combination of two or more words

    Note

    The full list of processing steps is available in the following Mallet documentation:

    http://mallet.cs.umass.edu/api/index.html?cc/mallet/pipe/iterator/package-tree.html

    Now we are ready to build a class that will import our data.

    First, let's build a pipeline, where each processing step is denoted as a pipeline in Mallet. Pipelines can be wired together in a serial fashion with a list of ArrayList<Pipe> objects:

    ArrayList<Pipe> pipeList = new ArrayList<Pipe>();

    Begin by reading data from a file object and converting all the characters into lower case:

    pipeList.add(new Input2CharSequence("UTF-8"));
    pipeList.add( new CharSequenceLowercase() );

    Next, tokenize raw strings with a regular expression. The following pattern includes Unicode letters and numbers and the underscore character:

    Pattern tokenPattern =
      Pattern.compile("[\p{L}\p{N}_]+");
    
    pipeList.add(new CharSequence2TokenSequence(tokenPattern));

    Remove stop words, that is, frequent words with no predictive power, using a standard English stop list. Two additional parameters indicate whether stop word removal should be case-sensitive and mark deletions instead of just deleting the words. We'll set both of them to false:

    pipeList.add(new TokenSequenceRemoveStopwords(false, false));

    Instead of storing the actual words, we can convert them into integers, indicating a word index in the bag of words:

    pipeList.add(new TokenSequence2FeatureSequence());

    We'll do the same for the class label; instead of label string, we'll use an integer, indicating a position of the label in our bag of words:

    pipeList.add(new Target2Label());

    We could also print the features and the labels by invoking the PrintInputAndTarget pipe:

    pipeList.add(new PrintInputAndTarget());

    Finally, we store the list of pipelines in a SerialPipes class that will covert an instance through a sequence of pipes:

    SerialPipes pipeline = new SerialPipes(pipeList);

    Now let's take a look at how apply this in a text mining application!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.199.184