In this chapter, you will learn the algorithm written in R for text mining and web data mining.
For text mining, the semistructured and nonstructured documents are the main dataset. There are a few of major categories of text mining, such as clustering, document retrieval and representation, and anomaly detection. The application of text mining includes, but is not limited to, topic tracking, and text summarization and categorization.
Web content, structure, and usage mining is one application of web mining. Web mining is also used for user behavior modeling, personalized views and content annotation, and so on. In another aspect, web mining integrates the result information from the traditional data-mining technologies and the information from WWW.
In this chapter, we will cover the following topics:
Along with the appearance of text mining, due to the characteristics of text or documents, the traditional data-mining algorithms need some minor adjustments or extensions. The classical text-mining process is as follows:
The popular text-clustering algorithms include the distance-based clustering algorithm, the hierarchical clustering algorithm, the partition-based clustering algorithm, and so on.
The popular text-classification algorithms include decision trees, pattern-based classification, SVM classification, Bayesian classification, and so on.
As a popular preprocessing step, here are the details of the word-extraction algorithm.
3.142.134.23