Chapter 10. Text Mining with Mallet – Topic Modeling and Spam Detection

In this chapter, we will first discuss what text mining is, what kind of analysis is it able to offer, and why you might want to use it in your application. We will then discuss how to work with Mallet, a Java library for natural language processing, covering data import and text pre-processing. Afterwards, we will look into two text mining applications: topic modeling, where we will discuss how text mining can be used to identify topics found in the text documents without reading them individually; and spam detection, where we will discuss how to automatically classify text documents into categories.

This chapter will cover the following topics:

  • Introducing text mining
  • Installing and working with Mallet
  • Topic modeling
  • Spam detection

Introducing text mining

Text mining, or text analytics, refers to the process of automatically extracting high-quality information from text documents, most often written in natural language, where high-quality information is considered to be relevant, novel, and interesting.

While a typical text-analytics application is to scan a set of documents to generate a search index, text mining can be used in many other applications, including text categorization into specific domains; text clustering to automatically organize a set of documents; sentiment analysis to identify and extract subjective information in documents; concept/entity extraction that is capable of identifying people, places, organizations, and other entities from documents; document summarization to automatically provide the most important points in the original document; and learning relations between named entities.

The process based on statistical pattern mining usually involves the following steps:

  1. Information retrieval and extraction.
  2. Transforming unstructured text data into structured; for example, parsing, removing noisy words, lexical analysis, calculating word frequencies, deriving linguistic features, and so on.
  3. Discovery of patterns from structured data and tagging/annotation.
  4. Evaluation and interpretation of the results.

Later in this chapter, we will look at two application areas: topic modeling and text categorization. Let's examine what they bring to the table.

Topic modeling

Topic modeling is an unsupervised technique and might be useful if you need to analyze a large archive of text documents and wish to understand what the archive contains, without necessarily reading every single document by yourself. A text document can be a blog post, e-mail, tweet, document, book chapter, diary entry, and so on. Topic modeling looks for patterns in a corpus of text; more precisely, it identifies topics as lists of words that appear in a statistically meaningful way. The most well-known algorithm is Latent Dirichlet Allocation (Blei et al, 2003), which assumes that author composed a piece of text by selecting words from possible baskets of words, where each basket corresponds to a topic. Using this assumption, it becomes possible to mathematically decompose text into the most likely baskets from where the words first came. The algorithm then iterates over this process until it converges to the most likely distribution of words into baskets, which we call topics.

For example, if we use topic modeling on a series of news articles, the algorithm would return a list of topics and keywords that most likely comprise of these topics. Using the example of news articles, the list might look similar to the following:

  • Winner, goal, football, score, first place
  • Company, stocks, bank, credit, business
  • Election, opponent, president, debate, upcoming

By looking at the keywords, we can recognize that the news articles were concerned with sports, business, upcoming election, and so on. Later in this chapter, we will learn how to implement topic modeling using the news article example.

Text classification

In text classification, or text categorization, the goal is to assign a text document according to its content to one or more classes or categories, which tend to be a more general subject area such as vehicles or pets. Such general classes are referred to as topics, and the classification task is then called text classification, text categorization, topic classification, or topic spotting. While documents can be categorized according to other attributes such as document type, author, printing year, and so on, the focus in this chapter will be on the document content only. Examples of text classification include the following components:

  • Spam detection in e-mail messages, user comments, webpages, and so on
  • Detection of sexually-explicit content
  • Sentiment detection, which automatically classifies a product/service review as positive or negative
  • E-mail sorting according to e-mail content
  • Topic-specific search, where search engines restrict searches to a particular topic or genre, thus providing more accurate results

These examples show how important text classification is in information retrieval systems, hence most modern information retrieval systems use some kind of text classifier. The classification task that we will use as an example in this book is text classification for detecting e-mail spam.

We continue this chapter with an introduction to Mallet, a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. We will then cover two text-analytics applications, namely, topics modeling and spam detection as text classification.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.104.127