Working with text data

One of the main challenges in text mining is transforming unstructured written natural language into structured attribute-based instances. The process involves many steps, as shown here:

First, we extract some text from the internet, existing documents, or databases. At the end of the first step, the text could still be present in the XML format or some other proprietary format. The next step is to extract the actual text and segment it into parts of the document, for example, title, headline, abstract, and body. The third step is involved with normalizing text encoding to ensure the characters are presented in the same way; for example, documents encoded in formats such as ASCII, ISO 8859-1 and Windows-1250 are transformed into Unicode encoding. Next, tokenization splits the document into particular words, while the next step removes frequent words that usually have low predictive power, for example, the, a, I, and we.

The Part-Of-Speech (POS) tagging and lemmatization step could be included to transform each token to its basic form, which is known as lemma, by removing word endings and modifiers. For example, running becomes run, and better becomes good. A simplified approach is stemming, which operates on a single word without any context of how the particular word is used, and therefore cannot distinguish between words having different meaning, depending on the part of speech, for example, axes as a plural of axe as well as axis.

The last step transforms tokens into a feature space. Most often, feature space is a Bag-Of-Words (BoW) presentation. In this presentation, a set of all words appearing in the dataset is created. Each document is then presented as a vector that counts how many times a particular word appears in the document.

Consider the following example with two sentences:

Jacob likes table tennis. Emma likes table tennis too
Jacob also likes basketball

The BoW in this case consists of {Jacob, likes, table, tennis, Emma, too, also, basketball}, which has eight distinct words. The two sentences could be now presented as vectors using the indexes of the list, indicating how many times a word at a particular index appears in the document, as follows:

[1, 2, 2, 2, 1, 0, 0, 0]
[1, 1, 0, 0, 0, 0, 1, 1]

Such vectors finally become instances for further learning.

Another very powerful presentation based on the BoW model is word2vec. Word2vec was introduced in 2013 by a team of researchers led by Tomas Mikolov at Google. Word2vec is a neural network that learns distributed representations for words. An interesting property of this presentation is that words appear in clusters, so that some word relationships, such as analogies, can be reproduced using vector math. A famous example shows that king−man+woman returns queen. Further details and implementation are available at the following link: https://code.google.com/archive/p/word2vec/.

Table of Contents for Working with text data

Create new playlist

Sign In

Sign Up

Table of Contents for
Working with text data