Chapter 2. Text Classification

Organizing is what you do before you do something, so that when you do it, it is not all mixed up.

- A.A. Milne

Text classification is the task of assigning one or more groups for a given piece of text, from a larger set of possible groups. It has a wide range of applications across diverse domains such as social media, e-commerce, healthcare, law, marketing, to name a few. A common example of text classification in our daily lives is classifying emails as spam and non-spam. Even though the purpose and application of text classification may vary from domain to domain, the underlying abstract problem remains the same. Due to this invariance of the core problem and its applications in a myriad of domains, text classification has become a widely studied topic both by researchers and industry practitioners. In this chapter, we will discuss about the usefulness of text classification, and how to build text classifiers for your usecases, along with some practical tips for real world scenarios.

In Machine Learning, classification is the problem of categorizing a data instance into one or more of known classes. The data instance can originally be of different formats such as text/speech/image/numeric etc. Text classification is a special instance of the classification problem, where the input is text and goal is to categorize the piece of text into one or more buckets (called a class) among a set of predefined buckets (classes). A “text” can be of arbitrary length - a character, a word, a sentence, a paragraph, or a full document. Consider a scenario where we want to classify all customer reviews on a product into three categories: positive, negative, and neutral. The challenge of text classification is to “learn” this categorization from a collection of examples for all these categories, and predict the categories for new products and new customer reviews. Thus, it is a supervised learning problem.

A taxonomy of text classification

Any supervised classification approach, including text classification, can be further distinguished into three types based on the number of categories involved: binary, multiclass and multilabel classification. If the number of classes is two, it is called binary classification. If they are more than two, it is referred to as multiclass classification. Classifying an email as spam or not spam is an example of binary classification setting. Classifying sentiment of a customer review as negative, neutral or positive is an example of multiclass classification. In both, binary and multiclass setting, each document belongs to exactly one of the C possible classes. In multi-label classification, a document can have one or more labels/classes attached to it. For example, a news article on soccer match may simultaneously belong to more than one category such as “sports” and “soccer”. While another news article, say on US elections, may have labels : “politics”, “USA”, and “elections”. Thus, each document has labels that are subset of C. Each article can be in no class, exactly one, or multiple classes. There may be a very large number of labels in C (known as “extreme classification”). In this chapter we will focus only on binary and multiclass classification, as those are the most common use cases of text classification in the industry.

Text classification is sometimes also called as topic classification, text categorization, or document categorization. We will use the term text classification for the rest of this book. Note that this is different from topic detection, which refers to the problem of uncovering/extracting “topics” from texts, which we will study in Chapter 10. In this chapter we will take a closer look at text classification, and build text classifiers using different approaches. Our aim is to give you an overview of some of the most commonly applied techniques, along with practical advice on handling different decisions one has to make while training text classification systems in practice. To achieve this, we start with an overview of text classification and its applications. After introducing an NLP pipeline for text classification, we illustrate the usage of this pipeline to train and test text classifiers following different approaches, ranging from the traditional ones to the state of the art. We then discuss how to deal with the problem of training data collection/sparsity and different methods to handle it. We end the chapter summarizing what we learnt in all these sections, along with some practical advice. Note that we will only deal with the aspect of training and evaluating the text classifiers in this chapter. Issues related to deploying NLP systems in general, and performing quality assurance will be discussed in the last part of the book (Chapters 12 and 13).

Applications

Text classification has been of interest for a number of research and application scenarios, ranging from identifying the author of an unknown text in 1800s to USPS’ efforts in 1960s to perform optical character recognition on addresses and zip codes1. In the 1990’s, researchers began to successfully apply machine learning algorithms for text classification using large datasets. Email filtering, popularly known as spam classification, can be considered as one of the earliest examples of automatic text classification, which impacts our daily lives to this day. From manual analyses of text documents, to purely statistical computer-based approaches, to the state of the art deep neural networks, we have come a long way with text classification. Let us briefly discuss some of the popular ones below, before diving into the different approaches to perform text classification. These examples will also be useful in identifying problems that can be solved using text classification methods in your organization.

  • Content classification and organization: This refers to the problem of classifying/tagging large amounts of textual data for content organization, search and recommendation, to name a few use cases. Some examples of such data include news websites, blogs, online book shelves etc. Tagging product descriptions in an e-commerce website, routing customer service requests in a company to the appropriate support team, organizing our emails into personal/social/promotions etc on Gmail -are all examples of text classification for content classification and organization.

  • Customer Support: Customers of a product or service commonly use social media to express their opinions. Text classification can be useful in understanding customer feedback for aspects such as identifying actionable items in a stream of tweets about a product or flagging tweets that can potentially damage the brand reputation. To illustrate this better, consider the following three tweets in Figure 4-1 about the brand Macy’s:

Figure 2-1. Tweets mentioning a brand. First one is actionable, other two are noise.

All three tweets have explicitly mentioned the brand Macys, although only the first one needs action from customer support.

  • E-Commerce: Customers leave their reviews on a range of products on e-commerce websites such as amazon, ebay etc. An example use of text classification in such scenarios is to understand and analyze the perception of customers towards a product or service based on their comments. This is commonly known as sentiment analysis. It is used extensively by brands across the globe to help the brand understand if it is going closer or away from its customers. Rather than categorizing a customer feedback as just positive/negative/neutral, over a period of time, sentiment analysis has evolved into a more sophisticated paradigm: ‘aspect’ based sentiment analysis. To understand this, consider the following customer review of a restaurant:

Figure 2-2. A review that praises some aspects while criticizes few.

Would you call the above review as negative, positive or neutral? It is difficult to answer this - the food was great but the service was bad. Researchers and brands while working with sentiment have realized that any product or service, has multiple facets. Clearly, text classification plays a major role in performing such fine grained analysis of customer feedback.

  • Other Applications: Apart from the above mentioned areas, text classification is also used in several other application scenarios in various domains. Some of them are listed below:

    Text classification is used in language identification, for example, to identify the language of new tweets/posts. Google Translate also has an automatic language identification feature.

    Authorship attribution i.e., identifying the unknown authors of texts from a pool of authors is another popular use case of text classification, used in a range of fields from forensic analysis to literary studies.

    Text classification was used in the recent past for triaging the forum posts in an online support forum for mental health services2. In NLP research community, annual competitions are conducted (e.g., clpsych.org) for solving such text classification problems originating from clinical research.

    In the recent past, text classification is also used in segregating fake news from actual news.

This section only serves as an illustration of the wide range of applications text classification has, and the list is not exhaustive. Let us now move to building text classification models.

A Pipeline for Building Text Classification Systems

We discussed some of the common NLP pipelines in Chapter 3. Text classification shares some of its pipeline with what we learnt in that chapter. [Link to Come].3 shows the typical steps in building a text classification system. The different steps marked in the figure are described below.

Figure 2-3. Flowchart of a text classification pipeline.

We typically follow the following steps in building a text classification system:

  1. Collect or create a labeled dataset suitable for the task

  2. Split the dataset into two (train and test) or 3 parts (train, validation (a.k.a development) and test set), decide on an evaluation metric.

  3. Transform raw text into feature vectors

  4. Train/create a classifier on using the feature vectors and labels for the training set

  5. Benchmark the model performance using the test set

  6. Deploy the model in the context of a real-world product and monitor its performance.

Steps 3--5 are iterated to explore different variants of features, classifiers and their parameters and tuning them before deploying the most optimal solution in production.

Some of the individual steps related to data collection and pre-processing were discussed in the past chapters. For example, Step 1 and Step 2 were discussed in detail in Chapter 2. Chapter 3 focused entirely on Step 3. Our focus in this chapter is on Steps 4--5, considering several approaches and use cases, with a special focus on Step 1 later in this chapter. Step 6 is dealt with in Chapters 13--14. To be able to do Steps 4--5, i.e., to compare multiple classifiers or benchmark the performance, we need some measures of evaluation. Chapter 2 discussed the general measures of evaluation used in evaluating NLP systems. Specifically for evaluating classifiers, the following measures from among those introduced in Chapter 2 are more commonly used: classification accuracy, precision, recall, and F-score. We will also look at confusion matrices to understand the model performance in detail.

Apart from these, when such classification systems are deployed in real-world applications, Key Performance Indicators (KPIs) specific to a given use case are also used to evaluate their impact. For example, if we are using text classification to route customer service requests automatically, one KPI could be the reduction in the response time to provide support over a given timeframe. In this chapter, we will focus on the NLP evaluation measures. In Section 3 of the book, where we discuss specific industry verticals, we will introduce some KPIs for those verticals.

Before we start taking a look at how to build text classifiers using this pipeline, let us take a look a two scenarios where we won’t need this pipeline first.

A simple classifier without this pipeline

When we talk about the above pipeline, we are referring to a supervised machine learning scenario. However, it is possible to build a simple classifier without machine learning, and without this pipeline. Consider the following problem statement: we are given a corpus of tweets where each tweet is labeled with its corresponding sentiment - negative or positive. For example, a tweet saying: “The new James Bond movie is great!” is clearly expressing a positive sentiment, whereas a tweet saying: “I would never visit this restaurant again, horrible place!!” has a negative sentiment. We wish to build a classification system that will predict the sentiment of an unseen tweet using only the text of the tweet. A simple solution could be to create lists of positive and negative words in English, compare the usage of positive vs negative words in the tweet and make a prediction based on this information. Further enhancements could involve creating more sophisticated dictionaries with degrees of positive/negative/neutral sentiment of words or formulating specific heuristics (e.g., usage of certain smileys indicate positive sentiment) and using them to make predictions.

Clearly, this does not involve any “learning” of text classification i.e., it is based on a set of heuristics or rules and custom built resources such as dictionaries. While this approach may seem too simple to handle many real world scenarios, it may enable you to quickly deploy a Minimum Viable Product (MVP), which is nothing but an initial working approach which can be tested with customers and obtain feedback for further development. Most importantly, this simple model can lead to better understanding of the problem and give you a simple baseline for your evaluation metric and speed. From our experience, it is always good to start with such simpler models when you tackle a new NLP problem with data. However, eventually, we will need learning methods which can infer more insights from large collections of text data.

Using existing text classification APIs

Another scenario where you may not have to “learn” a classifier or follow this pipeline is when your task is more generic in nature, such as, identifying a general category of a text (e.g., whether it is about technology or music). In such cases, existing APIs such as Google Cloud Natural Language3 can provide you with off the shelf content classification models which can identify close to 700 different categories of text. Another such popular classification task is sentiment analysis. All major service providers (e.g., Google, Microsoft) serve sentiment analysis APIs, with varying payment structures. If you are tasked with building a sentiment classifier, you may not have to build your own system anymore, if an existing API addresses your business need. However, many classification tasks could be specific to your organization’s business needs. For the rest of this chapter, we will address that task of building our own classifier, by considering the pipeline described earlier in this section.

One Pipeline, Many Classifiers

Let us now look at building text classifiers by altering steps 3--5 in the pipeline and keeping the remaining steps constant. The first thing we need to start using this pipeline is a good dataset. Throughout this chapter, we will use some of the publicly available datasets for text classification. A wide range of NLP related datasets including ones for text classification are listed online4. Additionally, Figure Eight (Figure-Eight.com, 2019)5 contains a collection of crowd sourced data sets, some of which are relevant for text classification. The UCI Machine Learning repository6 also contains a few text classification datasets. We will use multiple datasets throughout this chapter instead of sticking to one, to illustrate dataset specific issues one may come across.

An important point to note in this context is that - our goal in this chapter is only to give you an overview of different approaches. No single approach is known to work universally well on all kinds of data and all classification problems. In real-world, we will experiment with multiple approaches, evaluate them, and choose one final approach to deploy in practice. [Link to Come].4 shows a timeline of the different algorithms and feature representation approaches we will use for the rest of this chapter.

Figure 2-4. Evolution of Text Classification?

For the rest of this section, we will use the “Economy news article tone and relevance” dataset from Figure Eight (check footnote 5) for demonstrating text classification. It consists of 8000 news articles annotated with whether they are relevant for US economy or not (i.e. Yes/No binary classification). The dataset is also imbalanced with ~1500 relevant and ~6500 non-relevant articles, which poses us a challenge of guarding against learning a bias towards the majority category i.e., non-relevant articles. Clearly, learning what is a relevant news article is more challenging with this dataset, than learning what is irrelevant. After all, just guessing everything is irrelevant already gives us 80% accuracy!

Let us explore how a bag of words feature representation can be used with this dataset following the pipeline described in Section 3. We will build classifiers using three well-known algorithms: naive bayes, logistic regression, and support vector machines. The notebook related to this section (Notebook_1.ipynb in this Chapter’s folder) shows the step by step process following our pipeline using these three algorithms. We will discuss some of the important aspects in this section.

Naive Bayes Classifier

Naive Bayes is a probabilistic classifier which uses Bayes’ theorem to classify texts, based on the evidence seen in training data. It estimates the conditional probability of each feature of a given text for each class based on the occurrence of that feature in that class, and multiplies these probabilities of all features of a given text to get final probability of classification for each class. Finally, it chooses the class with maximum probability. A detailed step by step explanation of the classifier is beyond the scope of this book. However, for an interested reader, Naive Bayes, specifically in the context of text classification, is explained in detail in Chapter 4 of Jurafsky & Martin (2018)7. Although simple, naive bayes is commonly used as a baseline algorithm in classification experiments.

Let us walk through the pipeline described earlier for our dataset, using Naive Bayes implementation in sklearn. Since we already have access to the dataset, our next step is to split the data into train and test data, as shown in the code snippet below:

          #Step 1: train-test split
          X = our_data.text #the column text contains textual data to extract features from
          y = our_data.relevance #this is the column we are learning to predict.
          # split X and y into training and testing sets. By default, it splits 75% training and 25% test
          #random_state=1 for reproducibility
          X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
          

The next step now is to pre-process the texts and convert them into feature vectors. While there are many different ways to do the pre-processing, let us say we want to do the following: lowercasing, removal of punctuation, digits and any custom strings, and stopwords. The below code snippet shows this pre-processing and converting the train/test data into feature vectors using CountVectorizer in sklearn, which is the implementation of bag of words approach we discussed in Chapter 3.

          #Step 2-3: Preprocess and Vectorize train and test data
          vect = CountVectorizer(preprocessor=clean) 
          #clean is a function we defined for pre-processing, seen in the notebook.
          X_train_dtm = vect.fit_transform(X_train)
          X_test_dtm = vect.transform(X_test)
          print(X_train_dtm.shape, X_test_dtm.shape)
          

Once you run this in the notebook, you will see that we ended up having a feature vector with over 45K features! We now have the data in a format we want i.e., feature vectors. So, the next step is to train and evaluate a classifier. The code snippet below shows how to do the training and evaluation of a naive bayes classifier with the features we extracted above.

          nb = MultinomialNB() #instantiate a Multinomial Naive Bayes classifier
          nb.fit(X_train_dtm, y_train)#train the mode 
          y_pred_class = nb.predict(X_test_dtm)#make class predictions for test data
          
Figure 2-5. Confusion Matrix for Naive Bayes classifier

As we can see here, the classifier is doing fairly well with identifying the non-relevant articles correctly, only making errors 14% of the time. However, it does not perform well in comparison on the second category i.e., relevance. The category is identified correctly only 42% of the time. Since the dataset is already given, and we cannot change it or collect additional data, we can think of a few possible reasons for this performance and ways to improve this classifier. These are summarized in Table 1.

Table 2-1. Reasons for poor classifier performance
Reason 1 Since we extracted all possible features, we ended up in a large, sparse feature vector, where most features are too rare and end up being noise.
Reason 2 There are very few examples of relevant articles (~20%) compared to the non-relevant articles (~80%) in the dataset. This makes the learning process skewed towards the non-relevant articles category as there are very few examples of “relevant” articles.
Reason 3 Perhaps we need a better learning algorithm
Reason 4 Perhaps we need a better pre-processing and feature extraction mechanism addressed in [Sec. 4.5]
Reason 5 Perhaps we should look for tuning the classifier’s parameters and hyper-parameters [partially addressed in Sec 4.6]

Let us see how to improve our classification performance by addressing some of these possible reasons. One way to approach Reason 1 above is to reduce noise in the feature vectors. The approach in [Link to Come].5 had close to 40,000 features (refer to the jupyter notebook for details). Let us see what happens if we restrict this to 5000 and re-run the training and evaluation process. This requires us to change the CountVectorizer instantiation in the process as shown in the code snippet below and repeating all the steps.

          vect = CountVectorizer(preprocessor=clean, max_features=5000)
          
Figure 2-6. Improved classification performance with Naive Bayes and feature selection

Now, clearly, while the average performance seems lower than before, the identification of relevant articles correctly increased by over 20%. At that point, one may wonder whether this is what we want. The answer to that question depends on the problem we are trying to solve. If we care about doing reasonably well with non-relevant article identification, and doing as best as possible with relevant article identification, or do equally on both, we could conclude that reducing the feature vector size was useful for this data set, with Naive Bayes classifier.

Reason 2 in our list was the problem of a skew towards majority class. There are several ways to address this. Two typical approaches are oversampling the instances belonging to minority classes or undersampling the majority class to create a balanced dataset. Imbalanced-Learn8 is a python library that incorporates some of the sampling methods to address this issue. While we will not delve in to the details of this library here, classifiers also have a built in mechanism to address such imbalanced datasets. We will see how to use that in the next subsection, by taking another classifier - Logistic Regression.

Logistic Regression

When we described Naive Bayes classifier, we mentioned that it learns the probability of a text for each class, and chooses the one with maximum probability. Such a classifier is called a “generative classifier”. In contrast, there are “discriminative classifier” which aims to learn the probability distribution over all classes. Logistic regression is an example of such a classifier, and is commonly used in text classification, as a baseline in research and as an MVP in real-world industry scenarios.

Unlike naive bayes, which estimates probabilities based on feature occurrence in classes, logistic regression “learns” the weights for individual features based on how important they are to make a classification decision. The goal of logistic regression is to learn a linear separator between classes in the training data with the aim of maximizing the probability of the data. This “learning” of feature weights and probability distribution over all classes is done through a function called “logistic” function, and hence the name, logistic regression. For a detailed mathematical description of logistic regression, refer to Chapter 5 in Jurafsky & Martin (2018).

Let us take the 5000 dimensional feature vector from the last step of the naive bayes example and train a logistic regression classifier instead of naive bayes. The code snippet below shows how to use Logisitic Regression for this task.

          from sklearn.linear_model import LogisticRegression 
          logreg = LogisticRegression(class_weight="balanced")
          logreg.fit(X_train_dtm, y_train) 
          y_pred_class = logreg.predict(X_test_dtm)
          print("Accuracy: ", metrics.accuracy_score(y_test, y_pred_class))
          

This results in a classifier with an accuracy of 73.7%. [Link to Come].7 shows the confusion matrix with this approach.

Figure 2-7. Classification performance with Logistic Regression

Our Logistic Regression classifier instantiation has an argument: class_weight, which is given a value “balanced”. This tells the classifier to boost the weights for classes in the inverse proportion to the number of samples for that class i.e., we expect to see better performance for the less represented classes. You can experiment with this code by removing that argument and re-training the classifier, to witness a fall (by approximately 5%) in the bottom right cell in the confusion matrix. However, Logistic Regression clearly seems to perform worse than Naive Bayes for this dataset.

Reason 3 in our list was: “perhaps we need a better learning algorithm”. This will give raise to the question - “What is a better learning algorithm?”. A general rule of thumb in working with machine learning approaches is: there is no algorithm that learns well on all datasets. A commonly followed approach is to experiment with multiple algorithms and compare them. Let us see if this idea helps us, by replacing logistic regression with a well-known classification algorithm that was shown to be useful for several text classification tasks, called the “Support Vector Machine.”

Support Vector Machine (SVM)

We described logistic regression as a discriminative classifier that learns the weights for individual features, and predicts a probability distribution over the classes. Support Vector Machine, first invented in early 1960s, is a discriminative classifier like logistic regression. However, unlike logistic regression, it does not aim to maximize the dataset’s probability. Instead, it aims to look for an optimal hyperplane in a higher dimensional space, which can separate the classes in the data by a maximum margin possible. Further, SVMs are capable of learning even non-linear separations between classes, unlike logistic regression. However, they may also take longer to train.

SVMs come in different variants in sklearn. Let us see the use of one of them by keeping everything else the same, and altering maximum features to 1000 instead of 5000 in the previous example. The code snippet below shows how to do this and [Link to Come].8 shows the resultant confusion matrix.

          from sklearn.svm import LinearSVC
          vect = CountVectorizer(preprocessor=clean, max_features=1000) #Step-1
          X_train_dtm = vect.fit_transform(X_train)#combined step 2 and 3
          X_test_dtm = vect.transform(X_test)
          classifier = LinearSVC(class_weight='balanced') #notice the “balanced” option
          classifier.fit(X_train_dtm, y_train) #fit the model with training data
          y_pred_class = classifier.predict(X_test_dtm)
          print("Accuracy: ", metrics.accuracy_score(y_test, y_pred_class))
          
Figure 2-8. Classification with SVM

When compared to logistic regression, SVMs seem to have done better with “relevant” articles category, although, among this small set of experiments we did, naive bayes, with the smaller set of features, seems to be the best classifier for this dataset.

All the examples in this section showed how changes in different steps affected the classification performance, and how to interpret the results. Clearly, we excluded many other possibilities such as: exploring different other classifiers, changing the different parameters of all classifiers, coming up with better pre-processing methods etc. We leave them as further exercises for the reader, using the notebook as a playground. A real world text classification project involves exploring multiple options, starting with the simplest approach in terms of modeling, deployment and scale, and gradually increasing the complexity. Our eventual goal is to build the classifier that best meets our business need given all the other constraints.

Let us now consider a part of Reason 4 in the Table 1: better feature representation. So far, in this chapter, we have used bag-of-words features. Let us see how we can use other feature representations we saw in Chapter 3 for text classification.

Using Neural Embeddings in Text Classification

In the later half of Chapter 3, we discussed advanced feature engineering techniques using neural networks, such as word-embeddings, character-embeddings, and document embeddings. The advantage of using embedding based features is that they create a dense, low-dimensional feature representation instead of the sparse, high-dimensional structure of bag of words/TFIDF and other such features. There are different ways of designing and using features based on such neural embeddings. In this section, let us see some ways of using such embedding representations for text classification.

Word Embeddings

Words and ngrams have been primarily used as features in text classification for a long time. Different ways of vectorizing words have been proposed, and we used one such representation in the past section, using CountVectorizer. In the past 5, 6 years, neural network based architectures became popular for “learning” word representations, which are known as “word embeddings”. We briefly surveyed some of the intuitions behind this in Chapter 3. Let us now take a look at how to use word embeddings as features for text classification. We will use the sentiment labelled sentences dataset from the UCI repository, consisting of dataset consists of 1500 positive, and 1500 negative sentiment sentences from Amazon, Yelp, and IMDB. All the steps are detailed in the notebook (Notebook_2.ipynb). Let us walk through the important steps, and where this approach differs from the previous section’s procedures.

Loading and pre-processing the text data remains a common step. However, instead of vectorizing the texts using bag of words based features, we will now rely on neural embedding models. As mentioned earlier, we will use a pre-trained embedding model. Word2Vec is a popular algorithm we briefly discussed in Chapter 3, for training word embedding models. There are several downloadable pre-trained word2vec models trained on large corpora. Let us use the one from Google. The following code snippet shows how to load this model into python using gensim.

          data_path= "/your/folder/path"
          path_to_model = os.path.join(data_path,'GoogleNews-vectors-negative300.bin')
          training_data_path = os.path.join(data_path, "sentiment_sentences.txt")
          #Load W2V model. This will take some time.
          w2v_model = KeyedVectors.load_word2vec_format(path_to_model, binary=True)
          print('done loading Word2Vec')
          

This is a large model which can be seen as a dictionary where the keys are words in the vocabulary, and the values are their learnt embedding representations. How do we use this model to extract features? As we discussed in Chapter 3, there are multiple ways of doing this. A simple approach is just to average the embeddings for individual words in text. The code snippet below shows a simple function to do this.

          # Creating a feature vector by averaging all embeddings for all sentences
          def embedding_feats(list_of_lists):
           DIMENSION = 300
           zero_vector = np.zeros(DIMENSION)
           feats = []
           for tokens in list_of_lists:
           feat_for_this = np.zeros(DIMENSION)
           count_for_this = 0
           for token in tokens:
           if token in w2v_model:
           feat_for_this += w2v_model[token]
           count_for_this +=1
           feats.append(feat_for_this/count_for_this) 
           return feats
          train_vectors = embedding_feats(texts_processed)
          print(len(train_vectors))
          

Once the feature extraction is done, the final step is similar to what we did in the previous section i.e., use this feature set and train a classifier. We leave that as an exercise to you (you can refer to the notebook for the code). When trained with a Logistic Regression classifier, these features gave a classification accuracy of 81% on our dataset (see the notebook for more details). Considering that we just used an existing word embeddings model, and followed only basic pre-processing steps, this is a great model to have as a baseline! We saw in Chapter 3 that there are other pre-trained embedding approaches, which can all be experimented with for this approach. Gensim, which we used in this example, also supports training our own word embeddings if necessary. If we are working on a custom domain, whose vocabulary is remarkably different from that of the pre-trained news embeddings we used here, it would make sense to train our own embeddings to extract features. An important factor to consider when deploying models with embedding based feature extraction approaches is that the learnt or pre-trained embedding models have to be stored and loaded into memory while using these approaches. If the model itself is bulky (e.g., the pre-trained model we used is 3.6 GB), we need to factor this into our deployment needs.

Subword Embeddings and fastText

Word embeddings, as the name indicates, are about word representations. Even off the shelf embeddings seem to work well on a classification task, as we saw earlier. However, if a word in your dataset was not seen in the pre-trained model’s vocabulary, how will we get a representation for this word? In our previous example, we just ignored such words from feature extraction. Is there a better way?

We discussed fastText embeddings in Chapter 3. They are based on the idea of enriching word embeddings with sub-word level information. Thus, the embedding representation for each word is represented as a sum of the representations of individual character n-grams. While this may seem like a longer process compared to just estimating word level embeddings, this has two advantages:

  • This approach can handle words that did not appear in training data.

  • The implementation facilitates extremely fast learning on even very large corpora.

While fastText is a general purpose library to learn the embeddings, it also supports off the shelf text classification by providing end-to-end classifier training/testing. i.e., we don’t have to handle feature extraction separately.

The remaining part of this subsection talks about using fastText for text classification We will work with DBPedia dataset9. It is a balanced dataset consisting of 14 classes, with 40,000 training and 5000 testing examples per class. Thus, the total size of the dataset is 560,000 training and 70,000 testing data points. The step by step process is detailed in the associated Jupyter notebook (Notebook_3.ipynb). Let us get started!

The training and test sets are provided as csv files. So, the first step involves reading these files into your python environment, and cleaning the text to remove extraneous characters, similar to what we did in the pre-processing steps for the other classifier examples we saw so far. Once this is done, the process to use fastText is quite simple. The code snippet shows a simple fastText model.

          ## Using fastText for feature extraction and training
          from fasttext import supervised
          """fastText expects and training file (csv), a model name as input arguments.
          label_prefix refers to the prefix before label string in the dataset.
          default is __label__. In our dataset, it is __class__.
          There are several other parameters which can be seen in:
          https://pypi.org/project/fasttext/
          """
          model = supervised(train_file, 'temp', label_prefix="__class__")
          results = model.test(test_file)
          print(results.nexamples, results.precision, results.recall)
          

If you run this code in the notebook, you will notice that despite the fact that this is a huge dataset, and we gave the classifier raw text and not the feature vector, the training took only a few seconds, and we got an close to 98% precision and recall! As an exercise, try to build a classifier using the same dataset, but with either BOW or word embedding features, and using, say, logistic regression. Notice how long will it take for the individual steps of feature extraction and classification learning!

When you have a large dataset, and when learning seems infeasible with the approaches described so far, fastText is definitely a good option to use to set up a strong working baseline! However, there is one concern to keep in mind when using fastText, as it was the case with word2vec embeddings. It uses pretrained character ngram embeddings. Thus, when you save the trained model, it carries the entire character ngram embeddings dictionary with it. This results in a bulky model. For example, the model stored with the name “temp” in the above code sample has the size of close to 450 MB. However, fastText implementation also comes with options to reduce the memory footprint of its classification models with minimal reduction in classification performance. They do this by doing vocabulary pruning and using compression algorithms. Exploring these possibilities could be a good option in cases where large model sizes are a constraint.

We hope this discussion gives a good overview of the usefulness of fastText for text classification. What we showed here is a default classification model without any tuning of the hyper-parameters. fastText’s documentation contains more information on the different options to tune your classifier, and to train custom embedding representations for a dataset you want. However, both the embedding representations we saw so far learn a representation of words and characters, and collect them together to form a text representation. Let us see how to directly learn the representation for a document, using the doc2vec approach we discussed earlier in Chapter 3.

Document Embeddings

In doc2vec embedding scheme, we learn a direct representation for the entire document rather than each word. Just as we used word and character embeddings as features for performing text classification, we can also use doc2vec as a feature representation mechanism. Since there are no existing pre-trained models that work with the latest version doc2vec10, let us see how to build our own doc2vec model and use it for text classification.

We will use a dataset called “sentiment analysis: emotion in text” from figure-eight.com, which contains 40,000 tweets labeled with 13 labels signifying different emotions. Let us take the three most frequent labels in this dataset (neutral, worry, happiness) and build a text classifier for classifying new tweets into one of these three classes! The notebook for this subsection (Notebook_4.ipynb) walks you through the steps involved in using doc2vec for text classification and the dataset is also provided with the notebook.

After loading the dataset and taking a subset of the three most frequent labels, an important step to consider here is pre-processing of the data. What is different here, compared to previous examples? Why can’t we just follow the same procedure as before? There are a few things that are different about tweets, compared to news articles or other such text, as we also briefly discussed in Chapter 2 when we talked about text pre-processing. Firstly, they are very short. Second - our traditional tokenizers may not work well with tweets, splitting smileys, hashtags, twitter handles etc into multiple tokens. Such specialized needs prompted a lot of research in the recent past into NLP for Twitter, which resulted in several pre-processing options for tweets. One such solution is a TweetTokenizer, implemented in NLTK11 library in Python. Let us use that instead of what we did in the past few examples. The code snippet below shows how to use it.

          tweeter = TweetTokenizer(strip_handles=True,preserve_case=False)
          mystopwords = set(stopwords.words("english"))
          #Function to pre-process and tokenize tweets
          def preprocess_corpus(texts):
           def remove_stops_digits(tokens):
           #Nested function that removes stopwords and digits from a list of tokens
           return [token for token in tokens if token not in mystopwords and not token.isdigit()]
           return [remove_stops_digits(tweeter.tokenize(content)) for content in texts]
          mydata = preprocess_corpus(df_subset['content'])
          mycats = df_subset['sentiment']
          

The next step in this process is to train a doc2vec model to learn tweet representations. Ideally, any large dataset of tweets will work for this step. However, since we don’t have such a ready made corpus, we will split our dataset into train-test, and use the training data for learning the doc2vec representations. First part of this process involves converting the data into a format readable by the doc2vec implementation, which can be done using TaggedDocument class. It is used to represent a document as a list of tokens, followed by a “tag”, which, in its simplest form, can just be the file name or id of the document. However, doc2vec by itself can also be used as a nearest neighbor classifier for both multi-class and multi-label classification problems, using TaggedDocument. We will leave this as an exploratory exercise for the reader. Let us now see how to use this class to use the text of the tweets in our training data to train a doc2vec mode, through the code snippet below.

          train_data, test_data, train_cats, test_cats = train_test_split(mydata,mycats,random_state=1234)
          #prepare training data in doc2vec format:
          train_doc2vec = [TaggedDocument((d), tags=[str(i)]) for i, d in enumerate(train_data)]
          #Train a doc2vec model to learn tweet representations. Use only training data!!
          model = Doc2Vec(vector_size=50, alpha=0.025, min_count=10, dm =1, epochs=100)
          model.build_vocab(train_doc2vec)
          model.train(train_doc2vec, total_examples=model.corpus_count, epochs=model.epochs)
          model.save("d2v.model")
          print("Model Saved")
          

Training for doc2vec involves making several choices regarding parameters, as seen in the model definition in the code snippet above. Vector_size refers to the dimensionality of the learned embeddings, alpha is the learning rate, min_count is the minimum frequency of words that remain in vocabulary, dm, which stands for distributed memory is one of the representation learners implemented in doc2vec (the other is dbow, distributed bag of words), epochs are the number of training iterations. There are a few other parameters which can be customized. While there are some guidelines on choosing optimal parameters for training doc2vec models12, these are not exhaustively validated, and we don’t know if the guidelines work for tweets.

The best way to address this issue is to explore a range of values for the ones that matter to you (e.g., dm vs dbow, vector sizes, learning rate) and compare multiple models. How do we compare these models, as they only learn the text representation? One way to do that is to start using these learned representations in a downstream task, i.e., text classification in this case. Doc2vec’s infer_vector function can be used to infer the vector representation for a given text, using a pre-trained model. Since there is some amount of randomness due to negative sampling and other parameters, the inferred vectors differ each time we extract them. For this reason, to get a stable representation, we run it multiple times (called steps) and aggregate the vectors. Let us use the learned model, to infer features for our data, and train a logistic regression classifier.

          #Infer the feature representation for training and test data using the trained model
          model= Doc2Vec.load("d2v.model")
          #infer in multiple steps to get a stable representation.
          train_vectors = [model.infer_vector(list_of_tokens, steps=50) for list_of_tokens in train_data]
          test_vectors = [model.infer_vector(list_of_tokens, steps=50) for list_of_tokens in test_data]
          myclass = LogisticRegression(class_weight="balanced") #because classes are not balanced.
          myclass.fit(train_vectors, train_cats)
          preds = myclass.predict(test_vectors)
          print(classification_report(test_cats, preds))
          

Now, the performance of this model seems rather poor, achieving an F1 score of 0.51 on a reasonably large corpus, and with only three classes. There can be a couple of interpretations for this poor result. Firstly, unlike full news articles, or even well-formed sentences, tweets contain very little data per instance. Further, people write with a wide variety in spelling and syntax when they tweet. There are a lot of emoticons, in different forms. Our feature representation should be able to capture such aspects. While tuning the algorithm’s by searching a large parameter space for the best model may help, an alternative in such situations could be to explore problem specific feature representations, as we discussed in Chapter 3.

An important concern to keep in mind when using doc2vec is the same as for fastText. If we have to use doc2vec for feature representation, we have to store the model that learnt the representation. Why it is not typically as bulky as fastText, it is also not as fast to train. Such trade-offs need to be considered and compared against, before taking a deployment decision.

So far, we saw a range of feature representations and how they can be useful for text classification using machine learning algorithms. Let us now turn to a family of algorithms that became popular in the past few years, known as “deep learning.”

Deep Learning for Text Classification

Deep Learning is a family of machine learning algorithms where the learning happens through different kinds of multi-layered neural network architectures. Over the past few years, it has shown remarkable improvements on standard machine learning tasks such as image/speech recognition and machine translation. This resulted in a widespread interest on using deep learning for various tasks, including text classification. So far, we have seen how to train different machine learning classifiers, using bag of words and different kinds of embedding representations. Let us now look at how to use deep learning architectures for text classification.

Two commonly used neural network architectures for text classification are: Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). Long Short Term Memory (LSTM) networks are a popular form of RNNs. In this section, we will learn how to train CNNs and LSTMs for text classification, using the IMDB sentiment classification dataset13. A detailed discussion on how the neural network architectures work is beyond the scope of this book. Interested readers are recommended to read the textbook by Goodfellow et.al.14 for a general theoretical discussion, and Goldberg’s book15 for NLP specific uses of neural network architectures. Jurafsky and Martin’s book (see footnote 7) also provides a quick overview of different neural network methods for NLP.

The first step towards training any machine learning or deep learning model is to define a feature representation. This step was relatively straight forward in the approaches we saw so far, with bag of words or embedding vectors. However, for neural networks, we need further processing of input vectors, as we saw in Chapter 2. Let us quickly recap the steps towards converting training/test data into a format suitable for the neural network input layers.

  1. Tokenize the texts and convert them into word index vectors

  2. Pad the text sequences so that all text vectors are of the same length

  3. Map every word index to an embedding vector. We do so by multiplying word index vectors with the embedding matrix. The embedding matrix can either be populated using pre-trained embeddings or be trained for embeddings on this corpus.

  4. Use the output from Step 3 as the input to a neural network architecture.

Once these are done, we can proceed with the specification of neural network architectures and training classifiers with them. The Jupyter notebook associated with this section (Notebook_5.ipynb) will walk you through the entire process from text pre-processing to neural network training and evaluation. We will use Keras, a python based deep learning library. The code snippet below illustrates Steps 1--2 above:

        #Vectorize these text samples into a 2D integer tensor using Keras Tokenizer
        #Tokenizer is fit on training data only, and that is used to tokenize both train and test data.
        tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
        tokenizer.fit_on_texts(train_texts)
        train_sequences = tokenizer.texts_to_sequences(train_texts) 
        test_sequences = tokenizer.texts_to_sequences(test_texts)
        word_index = tokenizer.word_index
        print('Found %s unique tokens.' % len(word_index))
        #Converting this to sequences to be fed into neural network. Max seq. len is 1000 as set earlier. Initial padding of 0s, until vector is of size MAX_SEQUENCE_LENGTH
        trainvalid_data = pad_sequences(train_sequences, maxlen=MAX_SEQUENCE_LENGTH)
        test_data = pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)
        trainvalid_labels = to_categorical(np.asarray(train_labels))
        test_labels = to_categorical(np.asarray(test_labels))
        

Step 3: If we want to use pre-trained embeddings to convert the train/test data into an embedding matrix, as we did in the earlier examples with word2vec and fastText, we have to download them, and use them to convert our data into the input format for the neural networks. The following code snippet shows an example of how to do this using GloVe embeddings, which were introduced in Chapter 3. GloVe embeddings come with multiple dimensionalities, and we chose 100 as our dimension here. The value of dimensionality is a hyper-parameter and one can experiment with other dimensions as well16.

        embeddings_index = {}
        with open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt')) as f:
         for line in f:
         values = line.split()
         word = values[0]
         coefs = np.asarray(values[1:], dtype='float32')
         embeddings_index[word] = coefs
        num_words = min(MAX_NUM_WORDS, len(word_index)) + 1
        embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
        for word, i in word_index.items():
         if i > MAX_NUM_WORDS:
         continue
         embedding_vector = embeddings_index.get(word)
         if embedding_vector is not None:
         embedding_matrix[i] = embedding_vector
        

Step 4: Now, we are ready to train our deep learning models for text classification! Deep learning architectures consist of an input layer, an output layer, and several hidden layers in between these two layers. Depending on the architecture, different hidden layers are used. The input layer for textual input is typically a embedding layer and an output layer, especially in the context of text classification is a softmax layer with categorical output. If we want to train the input layer instead of using pre-trained embeddings, the easiest way is to call the Embedding layer class in Keras, specifying the input and output dimensions. However, since we want to use pre-trained embeddings, we should create a custom embedding layer which uses the embedding matrix we just built. The following code snippet shows you how to do that.

        embedding_layer = Embedding(num_words,
         EMBEDDING_DIM,
         embeddings_initializer=Constant(embedding_matrix),
         input_length=MAX_SEQUENCE_LENGTH,
         trainable=False)
        print("Preparing of embedding matrix is done")
        

This will serve as the input layer for any neural network we want to use (CNN or LSTM). Now that we know how to pre-process the input and define an input layer, let us move on specifying the rest of the neural network architecture, using CNNs and LSTMs.

CNN for Text Classification

Let us now look at how to define, train, and evaluate a CNN model for text classification. CNNs typically consist of a series of Convolution and Pooling layers as the hidden layers. In the context of text classification, CNNs can be thought of as learning the most useful bag-of-words/ngrams features, instead of taking the entire collection of words/ngrams as features as we did in earlier in this chapter. Since our dataset has only two classes - positive and negative, the output layer has two outputs, with the softmax activation function. We will define a CNN with 3 convolution-pooling layers using the Sequential model class in Keras, which allows us to specify deep learning models as a sequential stack of layers - one after another. Once the layers and their activation functions are specified, the next task is to define other important parameters such as the optimizer, loss function and the evaluation metric to tune the hyperparameters of the model. Once all this is done, the next step is to train and evaluate the model. The following code snippet shows one way of specifying a CNN architecture for this task using the Python library Keras, and the results with IMDB dataset for this model.

          print('Define a 1D CNN model.')
          cnnmodel = Sequential()
          cnnmodel.add(embedding_layer)
          cnnmodel.add(Conv1D(128, 5, activation='relu'))
          cnnmodel.add(MaxPooling1D(5))
          cnnmodel.add(Conv1D(128, 5, activation='relu'))
          cnnmodel.add(MaxPooling1D(5))
          cnnmodel.add(Conv1D(128, 5, activation='relu'))
          cnnmodel.add(GlobalMaxPooling1D())
          cnnmodel.add(Dense(128, activation='relu'))
          cnnmodel.add(Dense(len(labels_index), activation='softmax'))
          cnnmodel.compile(loss='categorical_crossentropy',
           optimizer='rmsprop',
           metrics=['acc'])
          cnnmodel.fit(x_train, y_train,
           batch_size=128,
           epochs=1, validation_data=(x_val, y_val))
          score, acc = cnnmodel.evaluate(test_data, test_labels)
          print('Test accuracy with CNN:', acc)
          

As you can see in this code snippet, we made a lot of choices in specifying the model such as: activation functions, hidden layers, layer sizes, loss function, optimizer, metrics, epochs and batch size. While there are some commonly recommended options for these, there is no consensus on one combination that works best for all data sets and problems. A good approach while building your models is to experiment with different settings (called hyper parameters). Keep in mind that all these decisions come with some associated cost. For example, in practice, we have the number of epochs as 10 or above. But it also increases the amount of time it takes to train the model. Another thing to note is: if you want to train a embedding layer instead of using pre-trained embeddings in this model, the only thing that changes is the line: cnnmodel.add(embedding_layer). Instead of that, we can specify a new embedding layer, for example, as: cnnmodel.add(Embedding(Param1, Param2)). The Figure below shows the code snippet and model performance for the same.

          print("Defining and training a CNN model, training embedding layer on the fly instead of using pre-trained embeddings")
          cnnmodel = Sequential()
          cnnmodel.add(Embedding(MAX_NUM_WORDS, 128))
          
          ...
          cnnmodel.fit(x_train, y_train,
           batch_size=128,
           epochs=1, validation_data=(x_val, y_val))
          score, acc = cnnmodel.evaluate(test_data, test_labels)
          print('Test accuracy with CNN:', acc)
          

If you run this code in the notebook, we notice that, in this case, training the embedding layer on our own dataset seems to result in better classification on test data. However, if the training data were substantially small, sticking to the pre-trained embeddings, or using the domain adaptation techniques we will discuss later in this chapter would be a better choice.

LSTMs for text classification

LSTMs, and other variants of RNNs in general, have become the goto way of doing neural language modeling in the past few years. This is primarily because language is sequential in nature and RNNs are specialized in working with sequential data. Current word in the sentence depends on its context - words before and after. However, when we model text using CNNs, this crucial fact is not taken into account. RNNs work on the principle of using this context while learning the language representation or a model of language. Hence, they are known to work well for NLP tasks. There are also CNN variants that can take such context into account and CNNs versus RNNs is still an open area of debate. In this section, we will see an example of using RNNs for text classification. Now that we already saw one neural network in action, it is relatively easy to train another! Just replace the convolutional/pooling parts with an LSTM in the above two code examples! The following code snippet shows how to train an LSTM model using the same IMDB dataset for text classification.

          print("Defining and training an LSTM model, training embedding layer on the fly")
          rnnmodel = Sequential()
          rnnmodel.add(Embedding(MAX_NUM_WORDS, 128))
          rnnmodel.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
          rnnmodel.add(Dense(2, activation='sigmoid'))
          rnnmodel.compile(loss='binary_crossentropy',
           optimizer='adam',
           metrics=['accuracy'])
          print('Training the RNN')
          rnnmodel.fit(x_train, y_train,
           batch_size=32,
           epochs=1,
           validation_data=(x_val, y_val))
          score, acc = rnnmodel.evaluate(test_data, test_labels,
           batch_size=32)
          print('Test accuracy with RNN:', acc)
          

You will notice that this code took much longer to run than the CNN example. One needs to note that while LSTMs are more powerful in utilizing he sequential nature of text, they are much more data hungry as compared to CNNs. Thus, the relative lower performance of the LSTM on a dataset need not necessarily be interpreted as a shortcoming of the model itself. It is possible that the amount of data we have is not sufficient to utilize the full potential of an LSTM. As with the case of CNN, several parameters, and hyper parameters play a very important role in the model performance, and it is always a good practice to explore multiple options and compare different models before finalizing on one.

In this section, we introduced the idea of using deep learning for text classification, using two neural network architectures - CNN and LSTM. There are several variants to these architectures, and new models are being proposed everyday by NLP researchers. However, in our experience as industry practitioners, several NLP tasks, especially text classification, still widely use several of the non-deep learning approaches we described earlier in the chapter. Two primary reasons for this could be: lack of large amounts of task specific training data that neural networks demand, and issues related to computing and deployment costs. We will end this section reiterating what we mentioned earlier when we discussed about a text classification pipeline- in a practical scenario, it always makes sense to start with a simpler, easy to deploy approach as your MVP and incrementally go from there, taking customer needs and feasibility into account. Let us now look at how to address the first reason - training data.

Learning with Less (or No) Data, and Adapting to New Domains

So far, we have seen examples of training different text classifiers with different text representations. In all these examples, we had a relatively large training dataset available for the task. However, in most real world scenarios, such datasets are not readily available. In other cases, you may indeed have an available annotated dataset, but it might not be large enough to train a good classifier. There can also be cases where you have a large dataset of, say, customer complaints for one product suite, but you are asked to customize your classifier to another product suite, for which we have a very small amount of data i.e., adapting an existing model to a new domain. In this section, let us discuss how to build good classification systems for these no/low/new domain training data scenarios.

No Training Data

Let us say you were asked to design a customer complaint classifier for your ecommerce company. The classifier is expected to automatically route customer complaint emails into a set of categories, say, billing, delivery, and others. If you are fortunate, you may discover a source of large amounts of annotated data for this task within the organization, in the form of a historical database of customer requests and their categories. If such a database does not exist, where should we start to build our classifier?

The first step in such a case is the creation of an annotated dataset, where customer complaints are mapped to a set of categories mentioned above. One way to approach this is to ask customer service agents to manually label some of the requests, and use that as the training data for your machine learning model. Another approach is called “bootstrapping” or “weak supervision”. There can be certain patterns of information in different categories of customer requests. Perhaps, billing related requests mention variants of the word bill, amounts in a currency etc. Delivery related requests talk about shipping, delays etc. One can get started with compiling some such patterns, and using their presence in a customer request to label it, thereby creating a small (perhaps noisy) annotated dataset for this classification task. From here, one can build a classifier to annotate a larger collection of data. Snorkel17, a recent software tool developed by Stanford University, is a useful tool to deploy weak supervision for various learning tasks, including classification. Snorkel was used deploy weak supervision based text classification models at industrial scale, at Google18. They showed that weak supervision could create classifiers comparable in quality to those trained on tens of thousands of hand labeled examples!

In some other scenarios, where large scale collection of data is necessary and feasible, crowdsourcing can be seen as an option. Websites such as Amazon Mechanical Turk (https://www.mturk.com/) and figure-eight.com provide platforms to make use of human intelligence to create high quality training data for machine learning tasks. A popular example of using crowd wisdom to create a classification dataset is the Captcha test Google uses to ask if a set of images contain an object (e.g., Select all images that contain a street sign.).

Less Training Data: Active Learning and Domain Adaptation

In the scenario described earlier when you collected small amounts of data using human annotations or bootstrapping, it may sometimes turn out that the amount of data was too small to build a good classification model. It is also possible that most of the requests we collected belonged to billing, and very few belonged to the other categories - which will result in a very imbalanced dataset. Asking the agents to spend many hours doing manual annotation is not realistic either. What should we do in such scenarios?

One approach to address such problems is “active learning”, which is primarily about choosing the examples to use as training data. The first step in this involves training the classifier with the available amount of data, and start using it to make predictions on new data. Wherever the classifier is very unsure of its predictions, send them to human annotators for correct classification. Then, re-train the model using the previous training data and the newly compiled human annotations. This process is repeated until a satisfactory model performance is reached. Tools such as Prodi.gy19 have active learning solutions implemented for text classification, and support the efficient usage of active learning to create annotated data and text classification models quickly.

Imagine a scenario for your customer complaint classifier, where you have a lot of historical data for a range of products. However, you are now asked to tune it to work on a newer products. What is potentially challenging in this situation? Typical text classification approaches rely on the vocabulary of the training data. Hence, they are inherently biased towards the kind of language seen in the training data. So, if the new products are of a very different nature (e.g., model is trained on a suite of electronic products, and we are using it with complaints on cosmetic products), the pre-trained classifiers trained on some other source data may not do well. However, it is also not realistic to train a new model on each product or product suite, as we will again run into the problem of insufficient training data. Domain adaptation is a method to address such scenarios, using which we can “transfer” what we learnt with one domain (source) with large amounts of data to another domain (target), with lesser amount of labeled, but large amounts of unlabeled data.

A typical pipeline for domain adaptation in text classification works as follows:

  1. Start with a large, pre-trained language model of the source domain (e.g., Wikipedia data).

  2. Fine-tune this model using the target language’s unlabeled data

  3. Train a classifier on the labeled target domain data, by extracting feature representations from the fine-tuned language model from Step 2.

ULMFit20 is a popular domain adaptation approach for text classification. In research experiments, it was shown that this approach matches the performance of training from scratch with 10-20 times more training examples with only 100 labeled examples in text classification tasks. When unlabeled data was used to fine tune the pre-trained language model, it matched the performance of using 50-100 times more labeled examples when trained from scratch, on the same text classification tasks. Domain adaptation and transfer learning methods are currently an active area of research in NLP. While their use for text classification did not show dramatic improvements yet on standard datasets and they are not so commonly used in industry setups yet, we can expect to see this getting better in near future.

A Case Study

Let us consider a real-world scenario and how we can use some of these concepts we discussed in this section for that. Imagine you are asked to build a corporate ticketing system for your organization. This system will track all the tickets or issues people face in the organization and route it in internally or to externally. [Link to Come].9 shows a representative screenshot for such a system:

Figure 2-9. A corporate ticketing system

Now let us say your company has recently hired a medical counsel and partnered with a hospital. So your system should also be able to pinpoint any medical related issue and pinpoint it to relevant people and teams. But while you have some past tickets, none of them are labeled. How will you go about building such a system?

Let us explore a couple of options:

  1. Existing APIs or Libraries: One option is to start with a public API or a library and map its classes to what is relevant to you. For instance, if we look at the Google APIs we mentioned earlier in the chapter, it can classify content into over 700 categories. There are 82 categories that are associated with medical or health issues. These include categories like /Health/Health Conditions/Pain Management, /Health/Medical Facilities & Services/Doctors’ Offices, /Finance/Insurance/Health Insurance etc.

  2. While not all categories are relevant to your organisation, some could be and you can map these accordingly. For instance, let us say your company does not consider substance abuse and obesity issues as relevant for medical counsel. So, you can ignore /Health/Substance Abuse and /Health/Health Conditions/Obesity in this API. Similarly whether insurance should be a part of HR or referred outside, can be handled with these categories.

  3. Public Datasets: You can also adopt public datasets for your needs. For example, 20 Newsgroups is a popular text classification dataset, which is also a part of sklearn library. It has a range of topics including sci.med. We can also use it to train a basic classifier, classifying all other topics in one category and sci.med in another.

  4. Weak Supervision: We have a history of past tickets, but they are not labelled. So, we can consider bootstrapping a dataset out of it using the approaches described earlier in this section. For example, consider having a rule - “if the past ticket contains words like Fever, Diarrhea, Headache or Nausea you put them in the medical counsel category”. This rule can create a small amount of data, which we can use as a starting point for our classifier.

  5. Active Learning: We can use tools like Prodigy to conduct a data collection experiments, where we ask someone working in the customer service desk to look at ticket descriptions and tag them with a preset list of categories. [Link to Come] shows an example of using prodigy for this purpose.

  6. [Link to Come] Active Learning with Prodigy

  7. Learning from Implicit and Explicit Feedback: Throughout the process of building, iterating and deploying this solution, you are getting feedback that you can use to improve your system. Explicit feedback can be when the medical counsel or the hospital explicitly says that the ticket was not relevant. Implicit feedback can be extracted from other dependent variables like ticket response times and ticket response rates. All of these could be factored to improve your model, using active learning techniques.

A sample pipeline incorporating these ideas can look as shown in [Link to Come].11:

Figure 2-10. A pipeline for building a classifier when there is no training data

In this section, we started looking at a practical scenario of not having enough training data for building our own text classifier, for our custom problem. We discussed several possible solutions to address this issue. We expect that this prepared you to anticipate some of the scenarios related to data collection/creation in your future projects related to text classification.

Practical Advice

So far, we showed a range of different methods of building text classifiers, and potential issues you may run into. We would like to end this chapter with some practical advice that summarizes what we observed so far and our experience with building text classification systems in industry settings. Some of these are generic enough to be applied to other topics in the book as well.

  1. Establish strong baselines: A common fallacy is to start with a state-of-the-art algorithm. This is especially true in present era of deep learning, where every day new approaches/algorithms keep coming up. However, it is always good to start with simpler approaches and try to establish strong baselines first. This is useful for two main reasons: a) Building a quick MVP helps us get initial feedback from end-users and stakeholders.

  2. b) A state of the art research model may give us only a minor improvement compared to the baseline, but comes with a huge amount of technical debt.

  3. Measure Return On Investment (ROI): While it is great to establish various machine learning metrics to benchmark your model/solution, It is equally important to establish the business impact of your machine learning solution. This is captured by ROI. Senior leaders must establish the ROI metrics and the process to measure it, before the data scientist start to work on the problem. To get a better sense of it, we strongly encourage you to read the last paragraph of Priority inbox paper from Google21. Most machine learning practitioners are embedded in product team. In such a setting, building a state-of-the-art model that is awesome but does not have a positive impact on ROI is treated as red flag.

  4. Fail early, fail fast: As of today, machine learning is still a science and not software engineering. It requires lot of experimentation with features, learners and their respective parameters. Often, many (if not all) experiments end up in giving negative result. Arriving at a suitable approach often goes through many failures. Hence, it is important to have many cycles of experiments, but each cycle should be short. The objective of the experiment should be small and well defined, and must be set very clearly at the start of experiment.

  5. Balanced Training Data: While working with classification, it is very important to have a balanced dataset where all categories are represented similarly. An imbalanced dataset can adversely impact the learning of the algorithm and result in a biased classifier. While we cannot always control this aspect of the training data, there are various techniques to fix class imbalance in training data. Some of them are: collecting more data, resampling – under sample from majority classes or oversample from minority classes, and weight balancing.

  6. Combining decisions and human in the loop: In practical scenarios, it makes sense to combine outputs of multiple classification models, and hand-crafted rules from domain experts to achieve the best performance for the business. In other cases, it is practical to defer the decision to a human evaluator, if the machine is not sure of its classification decision. Finally, there could also be scenarios where the learnt model has to change with time and newer data. We will discuss some solutions for such scenarios in the last part of the book which focuses on end to end systems.

Summary

In this chapter, we saw how to address the problem of text classification from multiple viewpoints. We discussed how to tackle the various stages in a text classification pipeline, along with how to identify a classification problem, how to collect and/or create relevant datasets, use different feature representations and train several classification algorithms.With this, we hope you now have a broad picture of the relevance of text classification for various industry scenarios, how to use existing solutions, how to build our own classifiers using various methods, and how to tackle the roadblocks one may face in this process. We only focused on one aspect of deploying text classification systems in industry applications i.e., building the actual system. Issues related to the end to end deployment of NLP systems will be dealt with in Chapters 12-13. In the next chapter, we will use some of the ideas we learned here to tackle a related, but different NLP problem - Information Extraction.

1 https://about.usps.com/publications/pub100/pub100_042.htm

2 http://clpsych.org/shared-task-2019-2/

3 https://cloud.google.com/natural-language/

4 https://github.com/niderhoff/nlp-datasets

5 https://www.figure-eight.com/data-for-everyone/

6 https://archive.ics.uci.edu/ml/index.php

7 Jurafsky, Dan and James H. Martin. Speech and language processing. 3rd Edition (Draft). 2018.
Url: web.stanford.edu/~jurafsky/slp3/. Last accessed on: 22 March 2019

8 https://imbalanced-learn.org/

9 https://github.com/srhrshr/torchDatasets/blob/master/dbpedia_csv%20.tar.gz. Last accessed on 22 March 2019.

10 For older doc2vec versions, there are some pre-trained models e.g., https://github.com/jhlau/doc2vec

11 http://www.nltk.org/

12 Lau, Jey Han, and Timothy Baldwin. “An empirical evaluation of doc2vec with practical insights into document embedding generation.” arXiv preprint arXiv:1607.05368 (2016).

13 http://ai.stanford.edu/~amaas/data/sentiment/

14 Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

15

16 There are other such pre-trained embeddings available. Our choice in this case is arbitrary.

17 https://hazyresearch.github.io/snorkel/

18 Bach, Stephen H., Daniel Rodriguez, Yintao Liu, Chong Luo, Haidong Shao, Cassandra Xia, Souvik Sen et al. “Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale.” arXiv preprint arXiv:1812.00417 (2018).

19 https://prodi.gy/

20 http://nlp.fast.ai/classification/2018/05/15/introducting-ulmfit.html

21 Aberdeen, Douglas, Ondrey Pacovsky, and Andrew Slater. “The learning behind gmail priority inbox.” (2010).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.134.107