Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Topics in text mining

As we saw in the first section, the area of text mining and performing Machine Learning on text spans a wide range of topics. Each topic discussed has some customizations to the mainstream algorithms, or there are specific algorithms that have been developed to perform the task called for in that area. We have chosen four broad topics, namely, text categorization, topic modeling, text clustering, and named entity recognition, and will discuss each in some detail.

Text categorization/classification

The text classification problem manifests itself in different applications, such as document filtering and organization, information retrieval, opinion and sentiment mining, e-mail spam filtering, and so on. Similar to the classification problem discussed in Chapter 2, Practical Approach to Real-World Supervised Learning, the general idea is to train on the training data with labels and to predict the labels of unseen documents.

As discussed in the previous section, the preprocessing steps help to transform the unstructured document collection into well-known numeric or categorical/binary structured data arranged in terms of a document-term matrix. The choice of performing some preprocessing steps, such as stemming or customizing stop words, depends on the data and applications. The feature choice is generally basic lexical features, n-grams of words as terms, and only in certain cases do we use the entire text as a string without breaking it into terms or tokens. It is common to use binary feature representation or frequency-based representation for document-term structured data. Once this transformation is complete, we do feature selection using univariate analysis, such as information gain or chi-square, to choose discriminating features above certain thresholds of scores. One may also perform feature transformation and dimensionality reduction such as PCA or LSA in many applications.

There is a wide range in the choice of classifiers once we get structured data from the preceding process. In the research as well as in commercial applications, we see the use of most of the common modeling techniques, including linear (linear regression, logistic regression, and so on), non-linear (SVM, neural networks, KNN), generative (naïve bayes, bayesian networks), interpretable (decision trees, rules), and ensemble-based (bagging, boosting, random forest) classifiers. Many algorithms use similarity or distance metrics, of which cosine distance is the most popular choice. In certain classifiers, such as SVM, the string representation of the document can be used as is, with the right choice of string kernels and similarity-based metrics on strings to compute the dot products.

Validation and evaluation methods are similar to supervised classification methodologies—splitting the data into train/validation/test, training on training data, tuning parameters of algorithm(s) on validation data, and estimating the performance of the models on hold-out or test data.

Since most of text classification involves a large number of documents, and the target classes are rare, the metrics used for evaluation, tuning, or choosing algorithms are in most cases precision, recall, and F-score measure, as follows:

Topic modeling

A topic is a distribution over a fixed vocabulary. Topic modeling can be defined as an ability to capture different core ideas or themes in various documents. This has a wide range of applications, such as the summarization of documents, understanding reasons for sentiments, trends, the news, and many others. The following figure shows how topic modeling can discern a user-specified number k of topics from a corpus and then, for every document, assign proportions representing how much of each topic is found in the document:

Figure 13: Probabilistic topic weights assignment for documents

There are quite a few techniques for performing topic modeling using supervised and unsupervised learning in the literature (References [13]). We will discuss the most common technique known as probabilistic latent semantic index (PLSI).

Probabilistic latent semantic analysis (PLSA)

The idea of PLSA, as in the LSA for feature reduction, is to find latent concepts hidden in the corpus by discovering the association between co-occurring terms and treating the documents as mixtures of these concepts. This is an unsupervised technique, similar to dimensionality reduction, but the idea is to use it to model the mixture of topics or latent concepts in the document (References [12]).

As shown in the following figure, the model may associate terms occurring together often in the corpus with a latent concept, and each document can then be said to exhibit that topic to a smaller or larger extent:

Probabilistic latent semantic analysis (PLSA)

Figure 14: Latent concept of Baseball capturing the association between documents and related terms

Input and output

The inputs are:

A collection of documents following a certain format and structure. We will give the notation:
The number of topics that need to be modeled or discovered as k.

The output is:

k topics identified T = {T₁, T₂,…T_k}.
For each document, coverage of the topic given in the document d_i can be written as = {p_i1, p_i2, …p_ik}, where p_ijis the probability of the document di covering the topic T_j.

How does it work?

Implementations of PLSA generally follow the steps described here:

Basic preprocessing steps as discussed previously, such as tokenization, stop words removal, stemming, dictionary of words formation, feature extraction (unigrams or n-grams, and so on), and feature selection (unsupervised techniques) are carried out, if necessary.
The problem can be reduced to estimating the distribution of terms in a document, and, given the distribution, choosing the topic based on the maximum terms corresponding to the topic.
Introducing a "latent variable" z helps us to select whether the term belongs to a topic. Note that z is not "observed", but we assume that it is related to picking the term from the topic. Thus, the probability of the term t given the document t can be expressed in terms of this latent variable as:
By using two sets of variables (?, p) the equation can be written as:
Here, p(t|z; ?) is the probability of latent concepts in terms and p(z|d; p) is the probability of latent concepts in document-specific mixtures.
Using log-likelihood to estimate the parameters to maximize:
Since this equation involves nonconvex optimization, the EM algorithm is often used to find the parameters iteratively until convergence is reached or the total number of iterations are completed (References [6]):
1. The E-step of the EM algorithm is used to determine the posterior probability of the latent concepts. The probability of term t occurring in the document d, can be explained by the latent concept z as:
2. The M-step of the EM algorithm uses the values obtained from the E-step, that is, p(z|d, t) and does parameter estimation as:
3. = how often the term t is associated with concept z:
4. = how often document d is associated with concept z.

Advantages and limitations

The advantages and limitations are as follows:

Though widely used, PLSA has some drawbacks that have been overcome by more recent techniques.
The unsupervised nature of the algorithm and its general applicability allows it to be used in a wide variety of similar text mining applications, such as clustering documents, associating topics related to authors/time, and so on.
PLSA with EM algorithms, as discussed in previous chapters, face the problem of getting "stuck in local optima", unlike other global algorithms, such as evolutionary algorithms.
PLSA algorithms can only do topic identification in known documents, but cannot do any predictive modeling. PLSA has been generalized and is known as latent dirichlet allocation (LDA) to overcome this (References [14]).

Text clustering

The goal of clustering, as seen in Chapter 3, Unsupervised Machine Learning Techniques, is to find groups of data, text, or documents that are similar to one another within the group. The granularity of unstructured data can vary from small phrases or sentences, paragraphs, and passages of text to a collection of documents. Text clustering finds its application in many domains, such as information retrieval, summarization, topic modeling, and document classification in unsupervised situations, to name a few. Traditional techniques in clustering can be employed once the unstructured text data is transformed into structured data via preprocessing. The difficulty with traditional clustering techniques is the high-dimensional and sparse nature of the dataset obtained using the transformed document-term matrix representation. Many traditional clustering algorithms work only on numeric values of features. Because of this constraint, categorical or binary representation of terms cannot be used and often TF or TF-IDF are used for representation of the document-term matrix.

In this section, we will discuss some of the basic processes and techniques in clustering. We will start with pre-processing and transformations and then discuss some techniques that are widely used and the modifications made to them.

Feature transformation, selection, and reduction

Most of the pre-processing steps discussed in this section are normally used to get either unigram or n-gram representation of terms in documents. Dimensionality reduction techniques, such as LSA, are often employed to transform the features into smaller latent space.

Clustering techniques

The techniques for clustering in text include probabilistic models, as well as those that use distance-based methods, which are familiar to us from when we learned about structured data. We will also discuss Non-negative Matrix Factorization (NNMF) as an effective technique with good performance and interpretability.

Generative probabilistic models

There is commonality between topic modeling and text clustering in generative methods. As shown in the following figure, clustering associates a document with a single cluster (generally), compared to topic modeling where each document can have a probability of coverage in multiple topics. Every word in topic modeling can be generated by multiple topics in an independent manner, whereas in clustering all the words are generated from the same cluster:

Figure 15: Exclusive mapping of documents to K-Clusters

Mathematically, this can be explained using two topics T = {T₁, T₂} and two clusters c = {c₁, c₂}.

In clustering, the likelihood of the document can be given as:

If the document has, say, L terms, this can be further expanded as:

Thus, once you "assume" a cluster, all the words come from that cluster. The product of all terms is computed, followed by the summation across all clusters.

In topic modeling, the likelihood of the document can be given as:

Thus, each term ti can be picked independently from topics and hence summation is done inside and the product is done outside.