Natural language processing (NLP) is ubiquitous today in various applications such as mobile apps, ecommerce websites, emails, news websites, and more. Detecting spam in e-mails, characterizing e-mails, speech synthesis, categorizing news, searching and recommending products, performing sentiment analysis on social media brands—these are all different aspects of NLP and mining text for information.
There has been an exponential increase in digital information that is textual in content—in the form of web pages, e-books, SMS messages, documents of various formats, e-mails, social media messages such as tweets and Facebook posts, now ranges in exabytes (an exabyte is 1,018 bytes). Historically, the earliest foundational work relying on automata and probabilistic modeling began in the 1950s. The 1970s saw changes such as stochastic modeling, Markov modeling, and syntactic parsing, but their progress was limited during the 'AI Winter' years. The 1990s saw the emergence of text mining and a statistical revolution that included ideas of corpus statistics, supervised Machine Learning, and human annotation of text data. From the year 2000 onwards, with great progress in computing and Big Data, as well as the introduction of sophisticated Machine Learning algorithms in supervised and unsupervised learning, the area has received rekindled interest and is now among the hottest topics in research, both in academia and the R&D departments of commercial enterprises. In this chapter, we will discuss some aspects of NLP and text mining that are essential in Machine Learning.
The chapter begins with an introduction to the key areas within NLP, and it then explains the important processing and transformation steps that make the documents more suitable for Machine Learning, whether supervised or unsupervised. The concept of topic modeling, clustering, and named entity recognition follow, with brief descriptions of two Java toolkits that offer powerful text processing capabilities. The case study for this chapter uses another widely-known dataset to demonstrate several techniques described here through experiments using the tools KNIME and Mallet.
The chapter is organized as follows:
Information about the real world exists in the form of structured data, typically generated by automated processes, or unstructured data, which, in the case of text, is created by direct human agency in the form of the written or spoken word. The process of observing real-world situations and using either automated processes or having humans perceive and convert that information into understandable data is very similar in both structured and unstructured data. The transformation of the observed world into unstructured data involves complexities such as the language of the text, the format in which it exists, variances among different observers in interpreting the same data, and so on. Furthermore, the ambiguity caused by the syntax and semantics of the chosen language, subtlety in expression, the context in the data, and so on, make the task of mining text data very difficult.
Next, we will discuss some high-level subfields and tasks that involve NLP and text mining. The subject of NLP is quite vast, and the following topics is in no way comprehensive.
This field is one of the most well-established, and in its basic form classifies documents with unstructured text data into predefined categories. This can be viewed as a direct extension of supervised Machine Learning in the unstructured text world, learning from historic documents to predict categories of unseen documents in the future. Basic methods in spam detection in e-mails or news categorization are among some of the most prominent applications of this task.
Another subtask in NLP that has seen a lot of success is associating parts-of-speech of the language—such as nouns, adjectives, verbs—to words in a text, based on context and relationship to adjacent words. Today, instead of manual POS tagging, automated and sophisticated POS taggers perform the job.
Clustering unstructured data for organization, retrieval, and groupings based on similarity is the subfield of text clustering. This field is also well-developed with advancements in different clustering and text representations suited for learning.
The task of extracting specific elements, such as time, location, organization, entities, and so on, comes under the topic of information extraction. Named entity recognition is a sub-field that has wide applications in different domains, from reviews of historical documents to bioinformatics with gene and drug information.
Another sub-field in the area of NLP involves inferring the sentiments of observers in order to categorize them with an understandable metric or to give insights into their opinions. This area is not as advanced as some of the ones mentioned previously, but much research is being done in this direction.
Understanding references to multiple entities existing in the text and disambiguating that reference is another popular area of NLP. This is considered as a stepping stone in doing more complex tasks such as question answering and summarization, which will be discussed later.
In a language such as English, since the same word can have multiple meanings based on the context, deciphering this automatically is an important part of NLP, and the focus of word sense disambiguation (WSD).
Translating text from one language to another or from speech to text in different languages is broadly covered in the area of machine translation (MT). This field has made significant progress in the last few years, with the usage of Machine Learning algorithms in supervised, unsupervised, and semi-supervised learning. Deep learning with techniques such as LSTM has been proved to be the most effective technique in this area, and is widely used by Google for its translation.
Reasoning, deriving logic, and inferencing from unstructured text is the next level of advancement in NLP.
A subfield that is growing in popularity in NLP is the automated summarization of large documents or passages of text to a small representative text that can be easily understood. This is one of the budding research areas in NLP. Search engines' usage of summaries, multi-document summarizations for experts, and so on, are some of the applications that are benefiting from this field.
3.149.234.188