Chapter 8. Text Mining and Natural Language Processing

Natural language processing (NLP) is ubiquitous today in various applications such as mobile apps, ecommerce websites, emails, news websites, and more. Detecting spam in e-mails, characterizing e-mails, speech synthesis, categorizing news, searching and recommending products, performing sentiment analysis on social media brands—these are all different aspects of NLP and mining text for information.

There has been an exponential increase in digital information that is textual in content—in the form of web pages, e-books, SMS messages, documents of various formats, e-mails, social media messages such as tweets and Facebook posts, now ranges in exabytes (an exabyte is 1,018 bytes). Historically, the earliest foundational work relying on automata and probabilistic modeling began in the 1950s. The 1970s saw changes such as stochastic modeling, Markov modeling, and syntactic parsing, but their progress was limited during the 'AI Winter' years. The 1990s saw the emergence of text mining and a statistical revolution that included ideas of corpus statistics, supervised Machine Learning, and human annotation of text data. From the year 2000 onwards, with great progress in computing and Big Data, as well as the introduction of sophisticated Machine Learning algorithms in supervised and unsupervised learning, the area has received rekindled interest and is now among the hottest topics in research, both in academia and the R&D departments of commercial enterprises. In this chapter, we will discuss some aspects of NLP and text mining that are essential in Machine Learning.

The chapter begins with an introduction to the key areas within NLP, and it then explains the important processing and transformation steps that make the documents more suitable for Machine Learning, whether supervised or unsupervised. The concept of topic modeling, clustering, and named entity recognition follow, with brief descriptions of two Java toolkits that offer powerful text processing capabilities. The case study for this chapter uses another widely-known dataset to demonstrate several techniques described here through experiments using the tools KNIME and Mallet.

The chapter is organized as follows:

  • NLP, subfields, and tasks:
    • Text categorization
    • POS tagging
    • Text clustering
    • Information extraction and named entity recognition
    • Sentiment analysis
    • Coreference resolution
    • Word-sense disambiguation
    • Machine translation
    • Semantic reasoning and inferencing
    • Summarization
    • Questions and answers
    • Issues with mining and unstructured data
  • Text processing components and transformations:
    • Document collection and standardization
    • Tokenization
    • Stop words removal
    • Stemming/Lemmatization
    • Local/Global dictionary
    • Feature extraction/generation
    • Feature representation and similarity
    • Feature selection and dimensionality reduction
  • Topics in text mining:
    • Topic modeling
    • Text clustering
    • Named entity recognition
    • Deep learning and NLP
  • Tools and usage:
    • Mallet
    • KNIME
  • Case study

NLP, subfields, and tasks

Information about the real world exists in the form of structured data, typically generated by automated processes, or unstructured data, which, in the case of text, is created by direct human agency in the form of the written or spoken word. The process of observing real-world situations and using either automated processes or having humans perceive and convert that information into understandable data is very similar in both structured and unstructured data. The transformation of the observed world into unstructured data involves complexities such as the language of the text, the format in which it exists, variances among different observers in interpreting the same data, and so on. Furthermore, the ambiguity caused by the syntax and semantics of the chosen language, subtlety in expression, the context in the data, and so on, make the task of mining text data very difficult.

Next, we will discuss some high-level subfields and tasks that involve NLP and text mining. The subject of NLP is quite vast, and the following topics is in no way comprehensive.

Text categorization

This field is one of the most well-established, and in its basic form classifies documents with unstructured text data into predefined categories. This can be viewed as a direct extension of supervised Machine Learning in the unstructured text world, learning from historic documents to predict categories of unseen documents in the future. Basic methods in spam detection in e-mails or news categorization are among some of the most prominent applications of this task.

Text categorization

Figure 1: Text Categorization showing classification into different categories

Part-of-speech tagging (POS tagging)

Another subtask in NLP that has seen a lot of success is associating parts-of-speech of the language—such as nouns, adjectives, verbs—to words in a text, based on context and relationship to adjacent words. Today, instead of manual POS tagging, automated and sophisticated POS taggers perform the job.

Part-of-speech tagging (POS tagging)

Figure 2: POS Tags associated with segment of text

Text clustering

Clustering unstructured data for organization, retrieval, and groupings based on similarity is the subfield of text clustering. This field is also well-developed with advancements in different clustering and text representations suited for learning.

Text clustering

Figure 3: Clustering of OS news documents to various OS specific clusters

Information extraction and named entity recognition

The task of extracting specific elements, such as time, location, organization, entities, and so on, comes under the topic of information extraction. Named entity recognition is a sub-field that has wide applications in different domains, from reviews of historical documents to bioinformatics with gene and drug information.

Information extraction and named entity recognition

Figure 4: Named Entity Recognition in a sentence

Sentiment analysis and opinion mining

Another sub-field in the area of NLP involves inferring the sentiments of observers in order to categorize them with an understandable metric or to give insights into their opinions. This area is not as advanced as some of the ones mentioned previously, but much research is being done in this direction.

Sentiment analysis and opinion mining

Figure 5: Sentiment Analysis showing positive and negative sentiments for sentences

Coreference resolution

Understanding references to multiple entities existing in the text and disambiguating that reference is another popular area of NLP. This is considered as a stepping stone in doing more complex tasks such as question answering and summarization, which will be discussed later.

Coreference resolution

Figure 6: Coreference resolution showing how pronouns get disambiguated

Word sense disambiguation

In a language such as English, since the same word can have multiple meanings based on the context, deciphering this automatically is an important part of NLP, and the focus of word sense disambiguation (WSD).

Word sense disambiguation

Figure 7: Showing how word "mouse" is associated with right word using the context

Machine translation

Translating text from one language to another or from speech to text in different languages is broadly covered in the area of machine translation (MT). This field has made significant progress in the last few years, with the usage of Machine Learning algorithms in supervised, unsupervised, and semi-supervised learning. Deep learning with techniques such as LSTM has been proved to be the most effective technique in this area, and is widely used by Google for its translation.

Machine translation

Figure 8: Machine Translation showing English to Chinese conversion

Semantic reasoning and inferencing

Reasoning, deriving logic, and inferencing from unstructured text is the next level of advancement in NLP.

Semantic reasoning and inferencing

Figure 9: Semantic Inferencing answering complex questions

Text summarization

A subfield that is growing in popularity in NLP is the automated summarization of large documents or passages of text to a small representative text that can be easily understood. This is one of the budding research areas in NLP. Search engines' usage of summaries, multi-document summarizations for experts, and so on, are some of the applications that are benefiting from this field.

Automating question and answers

Answering questions posed by humans in natural language, ranging from questions specific to a certain domain to generic, open-ended questions is another emerging field in the area of NLP.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.124.49