Text mining

Text mining is based on the data of text, concerned with exacting relevant information from large natural language text, and searching for interesting relationships, syntactical correlation, or semantic association between the extracted entities or terms. It is also defined as automatic or semiautomatic processing of text. The related algorithms include text clustering, text classification, natural language processing, and web mining.

One of the characteristics of text mining is text mixed with numbers, or in other point of view, the hybrid data type contained in the source dataset. The text is usually a collection of unstructured documents, which will be preprocessed and transformed into a numerical and structured representation. After the transformation, most of the data mining algorithms can be applied with good effects.

The process of text mining is described as follows:

  • Text mining starts from preparing the text corpus, which are reports, letters and so forth
  • The second step is to build a semistructured text database that is based on the text corpus
  • The third step is to build a term-document matrix in which the term frequency is included
  • The final result is further analysis, such as text analysis, semantic analysis, information retrieval, and information summarization

Information retrieval and text mining

Information retrieval is to help users find information, most commonly associated with online documents. It focuses on the acquisition, organization, storage, retrieval, and distribution for information. The task of Information Retrieval (IR) is to retrieve relevant documents in response to a query. The fundamental technique of IR is measuring similarity. Key steps in IR are as follows:

  • Specify a query. The following are some of the types of queries:
    • Keyword query: This is expressed by a list of keywords to find documents that contain at least one keyword
    • Boolean query: This is constructed with Boolean operators and keywords
    • Phrase query: This is a query that consists of a sequence of words that makes up a phrase
    • Proximity query: This is a downgrade version of the phrase queries and can be a combination of keywords and phrases
    • Full document query: This query is a full document to find other documents similar to the query document
    • Natural language questions: This query helps to express users' requirements as a natural language question
  • Search the document collection.
  • Return the subset of relevant documents.

Mining text for prediction

Prediction of results from text is just as ambitious as predicting numerical data mining and has similar problems associated with numerical classification. It is generally a classification issue.

Prediction from text needs prior experience, from the sample, to learn how to draw a prediction on new documents. Once text is transformed into numeric data, prediction methods can be applied.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.60.220