Summary

A large proportion of information in the digital world is textual. Text mining and NLP are areas concerned with extracting information from this unstructured form of data. Several important sub areas in the field are active topics of research today and an understanding of these areas is essential for data scientists.

Text categorization is concerned with classifying documents into pre-determined categories. Text may be enriched by annotating words, as with POS tagging, in order to give it more structure for subsequent processing tasks to act on. Unsupervised techniques such as clustering can be applied to documents as well. Information extraction and named entity recognition help identify information-rich specifics such as location, person or organization name, and so on. Summarization is another important application for producing concise abstracts of larger documents or sets of documents. Various ambiguities of language and semantics such as context, word sense, and reasoning make the tasks of NLP challenging.

Transformations of the contents of text include tokenization, stop words removal, and word stemming, all of which prepare the corpus by standardizing the content so Machine Learning techniques can be applied productively. Next, lexical, semantic, and syntactic features are extracted so numerical values can represent the document structure more conventionally with a vector space model. Similarity and distance measures can then be applied to effectively compare documents for sameness. Dimensionality reduction is key due to the large number of features that are typically present. The details of the techniques for topic modeling, PLSA and text clustering, and named entity recognition are described in this chapter. Finally, the recent techniques employing deep learning in various fields of NLP are introduced to the readers.

Mallet and KNIME are two open source Java-based tools that provide powerful NLP and Machine Learning capabilities. The case study examines performance of different classifiers on the Reuters corpus using KNIME.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.83.96