Home Page Icon
Home Page
Table of Contents for
R: Mining Spatial, Text, Web, and Social Media Data
Close
R: Mining Spatial, Text, Web, and Social Media Data
by Richard Heimann, Nathan Danneman, Pradeepta Mishra, Bater Makhabel
R: Mining Spatial, Text, Web, and Social Media Data
R: Mining Spatial, Text, Web, and Social Media Data
Table of Contents
R: Mining Spatial, Text, Web, and Social Media Data
R: Mining Spatial, Text, Web, and Social Media Data
Credits
Preface
What this learning path covers
What you need for this learning path
Who this learning path is for
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Module 1
1. Warming Up
Big data
Scalability and efficiency
Data source
Data mining
Feature extraction
Summarization
The data mining process
CRISP-DM
SEMMA
Social network mining
Social network
Text mining
Information retrieval and text mining
Mining text for prediction
Web data mining
Why R?
What are the disadvantages of R?
Statistics
Statistics and data mining
Statistics and machine learning
Statistics and R
The limitations of statistics on data mining
Machine learning
Approaches to machine learning
Machine learning architecture
Data attributes and description
Numeric attributes
Categorical attributes
Data description
Data measuring
Data cleaning
Missing values
Junk, noisy data, or outlier
Data integration
Data dimension reduction
Eigenvalues and Eigenvectors
Principal-Component Analysis
Singular-value decomposition
CUR decomposition
Data transformation and discretization
Data transformation
Normalization data transformation methods
Data discretization
Visualization of results
Visualization with R
Time for action
Summary
2. Mining Frequent Patterns, Associations, and Correlations
An overview of associations and patterns
Patterns and pattern discovery
The frequent itemset
The frequent subsequence
The frequent substructures
Relationship or rules discovery
Association rules
Correlation rules
Market basket analysis
The market basket model
A-Priori algorithms
Input data characteristics and data structure
The A-Priori algorithm
The R implementation
A-Priori algorithm variants
The Eclat algorithm
The R implementation
The FP-growth algorithm
Input data characteristics and data structure
The FP-growth algorithm
The R implementation
The GenMax algorithm with maximal frequent itemsets
The R implementation
The Charm algorithm with closed frequent itemsets
The R implementation
The algorithm to generate association rules
The R implementation
Hybrid association rules mining
Mining multilevel and multidimensional association rules
Constraint-based frequent pattern mining
Mining sequence dataset
Sequence dataset
The GSP algorithm
The R implementation
The SPADE algorithm
The R implementation
Rule generation from sequential patterns
High-performance algorithms
Time for action
Summary
3. Classification
Classification
Generic decision tree induction
Attribute selection measures
Tree pruning
General algorithm for the decision tree generation
The R implementation
High-value credit card customers classification using ID3
The ID3 algorithm
The R implementation
Web attack detection
High-value credit card customers classification
Web spam detection using C4.5
The C4.5 algorithm
The R implementation
A parallel version with MapReduce
Web spam detection
Web key resource page judgment using CART
The CART algorithm
The R implementation
Web key resource page judgment
Trojan traffic identification method and Bayes classification
Estimating
Prior probability estimation
Likelihood estimation
The Bayes classification
The R implementation
Trojan traffic identification method
Identify spam e-mail and Naïve Bayes classification
The Naïve Bayes classification
The R implementation
Identify spam e-mail
Rule-based classification of player types in computer games and rule-based classification
Transformation from decision tree to decision rules
Rule-based classification
Sequential covering algorithm
The RIPPER algorithm
The R implementation
Rule-based classification of player types in computer games
Time for action
Summary
4. Advanced Classification
Ensemble (EM) methods
The bagging algorithm
The boosting and AdaBoost algorithms
The Random forests algorithm
The R implementation
Parallel version with MapReduce
Biological traits and the Bayesian belief network
The Bayesian belief network (BBN) algorithm
The R implementation
Biological traits
Protein classification and the k-Nearest Neighbors algorithm
The kNN algorithm
The R implementation
Document retrieval and Support Vector Machine
The SVM algorithm
The R implementation
Parallel version with MapReduce
Document retrieval
Classification using frequent patterns
The associative classification
CBA
Discriminative frequent pattern-based classification
The R implementation
Text classification using sentential frequent itemsets
Classification using the backpropagation algorithm
The BP algorithm
The R implementation
Parallel version with MapReduce
Time for action
Summary
5. Cluster Analysis
Search engines and the k-means algorithm
The k-means clustering algorithm
The kernel k-means algorithm
The k-modes algorithm
The R implementation
Parallel version with MapReduce
Search engine and web page clustering
Automatic abstraction of document texts and the k-medoids algorithm
The PAM algorithm
The R implementation
Automatic abstraction and summarization of document text
The CLARA algorithm
The CLARA algorithm
The R implementation
CLARANS
The CLARANS algorithm
The R implementation
Unsupervised image categorization and affinity propagation clustering
Affinity propagation clustering
The R implementation
Unsupervised image categorization
The spectral clustering algorithm
The R implementation
News categorization and hierarchical clustering
Agglomerative hierarchical clustering
The BIRCH algorithm
The chameleon algorithm
The Bayesian hierarchical clustering algorithm
The probabilistic hierarchical clustering algorithm
The R implementation
News categorization
Time for action
Summary
6. Advanced Cluster Analysis
Customer categorization analysis of e-commerce and DBSCAN
The DBSCAN algorithm
Customer categorization analysis of e-commerce
Clustering web pages and OPTICS
The OPTICS algorithm
The R implementation
Clustering web pages
Visitor analysis in the browser cache and DENCLUE
The DENCLUE algorithm
The R implementation
Visitor analysis in the browser cache
Recommendation system and STING
The STING algorithm
The R implementation
Recommendation systems
Web sentiment analysis and CLIQUE
The CLIQUE algorithm
The R implementation
Web sentiment analysis
Opinion mining and WAVE clustering
The WAVE cluster algorithm
The R implementation
Opinion mining
User search intent and the EM algorithm
The EM algorithm
The R implementation
The user search intent
Customer purchase data analysis and clustering high-dimensional data
The MAFIA algorithm
The SURFING algorithm
The R implementation
Customer purchase data analysis
SNS and clustering graph and network data
The SCAN algorithm
The R implementation
Social networking service (SNS)
Time for action
Summary
7. Outlier Detection
Credit card fraud detection and statistical methods
The likelihood-based outlier detection algorithm
The R implementation
Credit card fraud detection
Activity monitoring – the detection of fraud involving mobile phones and proximity-based methods
The NL algorithm
The FindAllOutsM algorithm
The FindAllOutsD algorithm
The distance-based algorithm
The Dolphin algorithm
The R implementation
Activity monitoring and the detection of mobile fraud
Intrusion detection and density-based methods
The OPTICS-OF algorithm
The High Contrast Subspace algorithm
The R implementation
Intrusion detection
Intrusion detection and clustering-based methods
Hierarchical clustering to detect outliers
The k-means-based algorithm
The ODIN algorithm
The R implementation
Monitoring the performance of the web server and classification-based methods
The OCSVM algorithm
The one-class nearest neighbor algorithm
The R implementation
Monitoring the performance of the web server
Detecting novelty in text, topic detection, and mining contextual outliers
The conditional anomaly detection (CAD) algorithm
The R implementation
Detecting novelty in text and topic detection
Collective outliers on spatial data
The route outlier detection (ROD) algorithm
The R implementation
Characteristics of collective outliers
Outlier detection in high-dimensional data
The brute-force algorithm
The HilOut algorithm
The R implementation
Time for action
Summary
8. Mining Stream, Time-series, and Sequence Data
The credit card transaction flow and STREAM algorithm
The STREAM algorithm
The single-pass-any-time clustering algorithm
The R implementation
The credit card transaction flow
Predicting future prices and time-series analysis
The ARIMA algorithm
Predicting future prices
Stock market data and time-series clustering and classification
The hError algorithm
Time-series classification with the 1NN classifier
The R implementation
Stock market data
Web click streams and mining symbolic sequences
The TECNO-STREAMS algorithm
The R implementation
Web click streams
Mining sequence patterns in transactional databases
The PrefixSpan algorithm
The R implementation
Time for action
Summary
9. Graph Mining and Network Analysis
Graph mining
Graph
Graph mining algorithms
Mining frequent subgraph patterns
The gPLS algorithm
The GraphSig algorithm
The gSpan algorithm
Rightmost path extensions and their supports
The subgraph isomorphism enumeration algorithm
The canonical checking algorithm
The R implementation
Social network mining
Community detection and the shingling algorithm
The node classification and iterative classification algorithms
The R implementation
Time for action
Summary
10. Mining Text and Web Data
Text mining and TM packages
Text summarization
Topic representation
The multidocument summarization algorithm
The Maximal Marginal Relevance algorithm
The R implementation
The question answering system
Genre categorization of web pages
Categorizing newspaper articles and newswires into topics
The N-gram-based text categorization
The R implementation
Web usage mining with web logs
The FCA-based association rule mining algorithm
The R implementation
Time for action
Summary
A. Algorithms and Data Structures
2. Module 2
1. Data Manipulation Using In-built R Data
What is data mining?
How is it related to data science, analytics, and statistical modeling?
Introduction to the R programming language
Getting started with R
Data types, vectors, arrays, and matrices
List management, factors, and sequences
Import and export of data types
Data type conversion
Sorting and merging dataframes
Indexing or subsetting dataframes
Date and time formatting
Creating new functions
User-defined functions
Built-in functions
Loop concepts - the for loop
Loop concepts - the repeat loop
Loop concepts - while conditions
Apply concepts
String manipulation
NA and missing value management
Missing value imputation techniques
Summary
2. Exploratory Data Analysis with Automobile Data
Univariate data analysis
Bivariate analysis
Multivariate analysis
Understanding distributions and transformation
Normal probability distribution
Binomial probability distribution
Poisson probability distribution
Interpreting distributions
Interpreting continuous data
Variable binning or discretizing continuous data
Contingency tables, bivariate statistics, and checking for data normality
Hypothesis testing
Test of the population mean
One tail test of mean with known variance
One tail and two tail test of proportions
Two sample variance test
Non-parametric methods
Wilcoxon signed-rank test
Mann-Whitney-Wilcoxon test
Kruskal-Wallis test
Summary
3. Visualize Diamond Dataset
Data visualization using ggplot2
Bar chart
Boxplot
Bubble chart
Donut chart
Geo mapping
Histogram
Line chart
Pie chart
Scatterplot
Stacked bar chart
Stem and leaf plot
Word cloud
Coxcomb plot
Using plotly
Bubble plot
Bar charts using plotly
Scatterplot using plotly
Boxplots using plotly
Polar charts using plotly
Polar scatterplot using plotly
Polar area chart
Creating geo mapping
Summary
4. Regression with Automobile Data
Regression introduction
Formulation of regression problem
Case study
Linear regression
Stepwise regression method for variable selection
Logistic regression
Cubic regression
Penalized regression
Summary
5. Market Basket Analysis with Groceries Data
Introduction to Market Basket Analysis
What is MBA?
Where to apply MBA?
Data requirement
Assumptions/prerequisites
Modeling techniques
Limitations
Practical project
Apriori algorithm
Eclat algorithm
Visualizing association rules
Implementation of arules
Summary
6. Clustering with E-commerce Data
Understanding customer segmentation
Why understanding customer segmentation is important
How to perform customer segmentation?
Various clustering methods available
K-means clustering
Hierarchical clustering
Model-based clustering
Other cluster algorithms
Comparing clustering methods
References
Summary
7. Building a Retail Recommendation Engine
What is recommendation?
Types of product recommendation
Techniques to perform recommendation
Assumptions
What method to apply when
Limitations of collaborative filtering
Practical project
Summary
8. Dimensionality Reduction
Why dimensionality reduction?
Techniques available for dimensionality reduction
Which technique to apply where?
Principal component analysis
Practical project around dimensionality reduction
Attribute description
Parametric approach to dimension reduction
References
Summary
9. Applying Neural Network to Healthcare Data
Introduction to neural networks
Understanding the math behind the neural network
Neural network implementation in R
Neural networks for prediction
Neural networks for classification
Neural networks for forecasting
Merits and demerits of neural networks
References
Summary
3. Module 3
1. Going Viral
Social media mining using sentiment analysis
The state of communication
What is Big Data?
Human sensors and honest signals
Quantitative approaches
Summary
2. Getting Started with R
Why R?
Quick start
The basics – assignment and arithmetic
Functions, arguments, and help
Vectors, sequences, and combining vectors
A quick example – creating data frames and importing files
Visualization in R
Style and workflow
Additional resources
Summary
3. Mining Twitter with R
Why Twitter data?
Obtaining Twitter data
Preliminary analyses
Summary
4. Potentials and Pitfalls of Social Media Data
Opinion mining made difficult
Sentiment and its measurement
The nature of social media data
Traditional versus nontraditional social data
Measurement and inferential challenges
Summary
5. Social Media Mining – Fundamentals
Key concepts of social media mining
Good data versus bad data
Understanding sentiments
Scherer's typology of emotions
Sentiment polarity – data and classification
Supervised social media mining – lexicon-based sentiment
Supervised social media mining – Naive Bayes classifiers
Unsupervised social media mining – Item Response Theory for text scaling
Summary
6. Social Media Mining – Case Studies
Introductory considerations
Case study 1 – supervised social media mining – lexicon-based sentiment
Case study 2 – Naive Bayes classifier
Case study 3 – IRT models for unsupervised sentiment scaling
Summary
A. Conclusions and Next Steps
Final thoughts
An expanding field
Further reading
Bibliography
Bibliography
Index
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Prev
Previous Chapter
Table of Contents
Next
Next Chapter
R: Mining Spatial, Text, Web, and Social Media Data
R: Mining Spatial, Text, Web, and Social Media Data
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset