One of the main challenges in text mining is transforming unstructured written natural language into structured attribute-based instances. The process involves many steps, as shown here:
First, we extract some text from the internet, existing documents, or databases. At the end of the first step, the text could still be present in the XML format or some other proprietary format. The next step is to extract the actual text and segment it into parts of the document, for example, title, headline, abstract, and body. The third step is involved with normalizing text encoding to ensure the characters are presented in the same way; for example, documents encoded in formats such as ASCII, ISO 8859-1 and Windows-1250 are transformed into Unicode encoding. Next, tokenization splits the document into particular words, while the next step removes frequent words that usually have low predictive power, for example, the, a, I, and we.
The Part-Of-Speech (POS) tagging and lemmatization step could be included to transform each token to its basic form, which is known as lemma, by removing word endings and modifiers. For example, running becomes run, and better becomes good. A simplified approach is stemming, which operates on a single word without any context of how the particular word is used, and therefore cannot distinguish between words having different meaning, depending on the part of speech, for example, axes as a plural of axe as well as axis.
The last step transforms tokens into a feature space. Most often, feature space is a Bag-Of-Words (BoW) presentation. In this presentation, a set of all words appearing in the dataset is created. Each document is then presented as a vector that counts how many times a particular word appears in the document.
Consider the following example with two sentences:
- Jacob likes table tennis. Emma likes table tennis too
- Jacob also likes basketball
The BoW in this case consists of {Jacob, likes, table, tennis, Emma, too, also, basketball}, which has eight distinct words. The two sentences could be now presented as vectors using the indexes of the list, indicating how many times a word at a particular index appears in the document, as follows:
- [1, 2, 2, 2, 1, 0, 0, 0]
- [1, 1, 0, 0, 0, 0, 1, 1]
Such vectors finally become instances for further learning.