Introducing text mining

Text mining, or text analytics, refers to the process of automatically extracting high-quality information from text documents, most often written in natural language, where high-quality information is considered to be relevant, novel, and interesting.

While a typical text analytics application is used to scan a set of documents to generate a search index, text mining can be used in many other applications, including text categorization into specific domains; text clustering to automatically organize a set of documents; sentiment analysis to identify and extract subjective information in documents; concept or entity extraction that is capable of identifying people, places, organizations, and other entities from documents; document summarization to automatically provide the most important points in the original document; and learning relations between named entities.

The process based on statistical pattern mining usually involves the following steps:

  1. Information retrieval and extraction
  2. Transforming unstructured text data into structured data; for example, parsing, removing noisy words, lexical analysis, calculating word frequencies, and deriving linguistic features
  3. Discovery of patterns from structured data and tagging or annotation
  4. Evaluation and interpretation of the results

Later in this chapter, we will look at two application areas: topic modeling and text categorization. Let's examine what they bring to the table.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.186.202