Home Page Icon
Home Page
Table of Contents for
IV. Prediction and Classification Methods
Close
IV. Prediction and Classification Methods
by Peter C. Bruce, Nitin R. Patel, Galit Shmueli
Data Mining For Business Intelligence: Concepts, Techniques, and Applications in Microsoft Office Excel® with XLMiner®, Second Edition
Copyright
Foreword
Preface to the Second Edition
Preface to the First Edition
Acknowledgments
I. Preliminaries
1. Introduction
1.1. What Is Data Mining?
1.2. Where Is Data Mining Used?
1.3. Origins of Data Mining
1.4. Rapid Growth of Data Mining
1.5. Why Are There So Many Different Methods?
1.6. Terminology and Notation
1.7. Road Maps to This Book
1.7.1. Order of Topics
2. Overview of the Data Mining Process
2.1. Introduction
2.2. Core Ideas in Data Mining
2.2.1. Classification
2.2.2. Prediction
2.2.3. Association Rules
2.2.4. Predictive Analytics
2.2.5. Data Reduction
2.2.6. Data Exploration
2.2.7. Data Visualization
2.3. Supervised and Unsupervised Learning
2.4. Steps in Data Mining
2.5. Preliminary Steps
2.5.1. Organization of Datasets
2.5.2. Sampling from a Database
2.5.3. Oversampling Rare Events
2.5.4. Preprocessing and Cleaning the Data
2.5.5. Use and Creation of Partitions
2.6. Building a Model: Example with Linear Regression
2.6.1. Boston Housing Data
2.6.2. Modeling Process
2.7. Using Excel for Data Mining
2.8. PROBLEMS
II. Data Exploration and Dimension Reduction
3. Data Visualization
3.1. Uses of Data Visualization
3.2. Data Examples
3.2.1. Example 1: Boston Housing Data
3.2.2. Example 2: Ridership on Amtrak Trains
3.3. Basic Charts: bar charts, line graphs, and scatterplots
3.3.1. Distribution Plots: Boxplots and Histograms
3.3.2. Heatmaps: Visualizing Correlations and Missing Values
3.4. Multidimensional Visualization
3.4.1. Adding Variables: Color, Size, Shape, Multiple Panels, and Animation
3.4.2. Manipulations: Rescaling, Aggregation and Hierarchies, Zooming, and Panning, and Filtering
3.4.3. Reference: Trend Lines and Labels
3.4.4. Scaling up: Large Datasets
3.4.5. Multivariate Plot: Parallel Coordinates Plot
3.4.6. Interactive Visualization
3.5. Specialized Visualizations
3.5.1. Visualizing Networked Data
3.5.2. Visualizing Hierarchical Data: Treemaps
3.5.3. Visualizing Geographical Data: Map Charts
3.6. Summary of major visualizations and operations, according to data mining goal
3.6.1. Prediction
3.6.2. Classification
3.6.3. Time Series Forecasting
3.6.4. Unsupervised Learning
3.7. PROBLEMS
4. Dimension Reduction
4.1. Introduction
4.2. Practical Considerations
4.2.1. Example 1: House Prices in Boston
4.3. Data Summaries
4.3.1. Summary Statistics
4.3.2. Pivot Tables
4.4. Correlation Analysis
4.5. Reducing the Number of Categories in Categorical Variables
4.6. Converting A Categorical Variable to A Numerical Variable
4.7. Principal Components Analysis
4.7.1. Example 2: Breakfast Cereals
4.7.2. Principal Components
4.7.3. Normalizing the Data
4.7.4. Using Principal Components for Classification and Prediction
4.8. Dimension Reduction Using Regression Models
4.9. Dimension Reduction Using Classification and Regression Trees
4.10. PROBLEMS
III. Performance Evaluation
5. Evaluating Classification and Predictive Performance
5.1. Introduction
5.2. Judging Classification Performance
5.2.1. Benchmark: The Naive Rule
5.2.2. Class Separation
5.2.3. Classification Matrix
5.2.4. Using the Validation Data
5.2.5. Accuracy Measures
5.2.6. Cutoff for Classification
5.2.7. Performance in Unequal Importance of Classes
5.2.8. Asymmetric Misclassification Costs
5.2.9. Oversampling and Asymmetric Costs
5.2.10. Classification Using a Triage Strategy
5.3. Evaluating Predictive Performance
5.3.1. Benchmark: The Average
5.3.2. Prediction Accuracy Measures
5.4. PROBLEMS
IV. Prediction and Classification Methods
6. Multiple Linear Regression
6.1. Introduction
6.2. Explanatory versus Predictive Modeling
6.3. Estimating the Regression Equation and Prediction
6.3.1. Example: Predicting the Price of Used Toyota Corolla Automobiles
6.4. Variable Selection in Linear Regression
6.4.1. Reducing the Number of Predictors
6.4.2. How to Reduce the Number of Predictors
6.5. PROBLEMS
7. k-Nearest Neighbors (k-NN)
7.1. k-NN Classifier (categorical outcome)
7.1.1. Determining Neighbors
7.1.2. Classification Rule
7.1.3. Example: Riding Mowers
7.1.4. Choosing k
7.1.5. Setting the Cutoff Value
7.1.6. k-NN with More Than Two Classes
7.2. k-NN for a Numerical Response
7.3. Advantages and Shortcomings of k-NN Algorithms
7.4. PROBLEMS
8. Naive Bayes
8.1. Introduction
8.1.1. Example 1: Predicting Fraudulent Financial Reporting
8.2. Applying the Full (Exact) Bayesian Classifier
8.2.1. Practical Difficulty with the Complete (Exact) Bayes Procedure
8.2.2. Solution: Naive Bayes
8.2.3. Example 2: Predicting Fraudulent Financial Reports, Two Predictors
8.2.4. Example 3: Predicting Delayed Flights
8.3. Advantages and Shortcomings of the Naive Bayes Classifier
8.4. PROBLEMS
9. Classification and Regression Trees
9.1. Introduction
9.2. Classification Trees
9.2.1. Recursive Partitioning
9.2.2. Example 1: Riding Mowers
9.3. Measures of Impurity
9.3.1. Tree Structure
9.3.2. Classifying a New Observation
9.4. Evaluating the Performance of a Classification Tree
9.4.1. Example 2: Acceptance of Personal Loan
9.5. Avoiding Overfitting
9.5.1. Stopping Tree Growth: CHAID
9.5.2. Pruning the Tree
9.6. Classification Rules from Trees
9.7. Classification Trees for More Than two Classes
9.8. Regression Trees
9.8.1. Prediction
9.8.2. Measuring Impurity
9.8.3. Evaluating Performance
9.9. Advantages, Weaknesses, and Extensions
9.10. PROBLEMS
10. Logistic Regression
10.1. Introduction
10.2. Logistic Regression Model
10.2.1. Example: Acceptance of Personal Loan
10.2.2. Model with a Single Predictor
10.2.3. Estimating the Logistic Model from Data: Computing Parameter Estimates
10.2.4. Interpreting Results in Terms of Odds
10.3. Evaluating Classification Performance
10.3.1. Variable Selection
10.3.2. Impact of Single Predictors
10.4. Example of Complete Analysis: Predicting Delayed Flights
10.4.1. Data Preprocessing
10.4.2. Model Fitting and Estimation
10.4.3. Model Interpretation
10.4.4. Model Performance
10.4.5. Variable Selection
10.5. Appendix: Logistic Regression for Profiling
10.5.1. Appendix A: Why Linear Regression Is Inappropriate for a Categorical Response
10.5.2. Appendix B: Evaluating Goodness of Fit
10.5.3. Appendix C: Logistic Regression for More Than Two Classes
10.6. PROBLEMS
11. Neural Nets
11.1. Introduction
11.2. Concept and Structure of a Neural Network
11.3. Fitting a Network to Data
11.3.1. Example 1: Tiny Dataset
11.3.2. Computing Output of Nodes
11.3.3. Preprocessing the Data
11.3.4. Training the Model
11.3.5. Example 2: Classifying Accident Severity
11.3.6. Avoiding Overfitting
11.3.7. Using the Output for Prediction and Classification
11.4. Required User Input
11.5. Exploring the Relationship Between Predictors and Response
11.6. Advantages and Weaknesses of Neural Networks
11.7. PROBLEMS
12. Discriminant Analysis
12.1. Introduction
12.1.1. Example 1: Riding Mowers
12.1.2. Example 2: Personal Loan Acceptance
12.2. Distance of an Observation from a Class
12.3. Fisher's Linear Classification Functions
12.4. Classification Performance of Discriminant Analysis
12.5. Prior Probabilities
12.6. Unequal Misclassification Costs
12.7. Classifying More Than Two Classes
12.7.1. Example 3: Medical Dispatch to Accident Scenes
12.8. Advantages and Weaknesses
12.9. PROBLEMS
V. Mining Relationships Among Records
13. Association Rules
13.1. Introduction
13.2. Discovering Association Rules in Transaction Databases
13.2.1. Example 1: Synthetic Data on Purchases of Phone Faceplates
13.3. Generating Candidate Rules
13.3.1. The Apriori Algorithm
13.4. Selecting Strong Rules
13.4.1. Support and Confidence
13.4.2. Lift Ratio
13.4.3. Data Format
13.4.4. Process of Rule Selection
13.4.5. Interpreting the Results
13.4.6. Statistical Significance of Rules
13.4.7. Example 2: Rules for Similar Book Purchases
13.5. Summary
13.6. PROBLEMS
14. Cluster Analysis
14.1. Introduction
14.1.1. Example: Public Utilities
14.2. Measuring Distance Between Two Records
14.2.1. Euclidean Distance
14.2.2. Normalizing Numerical Measurements
14.2.3. Other Distance Measures for Numerical Data
14.2.4. Distance Measures for Categorical Data
14.2.5. Distance Measures for Mixed Data
14.3. Measuring Distance Between Two Clusters
14.4. Hierarchical (Agglomerative) Clustering
14.4.1. Minimum Distance (Single Linkage)
14.4.2. Maximum Distance (Complete Linkage)
14.4.3. Average Distance (Average Linkage)
14.4.4. Centroid Distance (Average Group Linkage)
14.4.5. Ward's Method
14.4.6. Dendrograms: Displaying Clustering Process and Results
14.4.7. Validating Clusters
14.4.8. Limitations of Hierarchical Clustering
14.5. Nonhierarchical Clustering: The k-Means Algorithm
14.5.1. Initial Partition into k Clusters
14.6. PROBLEMS
VI. Forecasting Time Series
15. Handling Time Series
15.1. Introduction
15.2. Explanatory versus Predictive Modeling
15.3. Popular Forecasting Methods in Business
15.3.1. Combining Methods
15.4. Time Series Components
15.4.1. Example: Ridership on Amtrak Trains
15.5. Data Partitioning
15.6. PROBLEMS
16. Regression-Based Forecasting
16.1. Model with Trend
16.1.1. Linear Trend
16.1.2. Exponential Trend
16.1.3. Polynomial Trend
16.2. Model with Seasonality
16.3. Model with trend and seasonality
16.4. Autocorrelation and ARIMA Models
16.4.1. Computing Autocorrelation
16.4.2. Improving Forecasts by Integrating Autocorrelation Information
16.4.3. Evaluating Predictability
16.5. PROBLEMS
17. Smoothing Methods
17.1. Introduction
17.2. Moving Average
17.2.1. Centered Moving Average for Visualization
17.2.2. Trailing Moving Average for Forecasting
17.2.3. Choosing Window Width (w)
17.3. Simple Exponential Smoothing
17.3.1. Choosing Smoothing Parameter α
17.3.2. Relation between Moving Average and Simple Exponential Smoothing
17.4. Advanced Exponential Smoothing
17.4.1. Series with a Trend
17.4.2. Series with a Trend and Seasonality
17.4.3. Series with Seasonality (No Trend)
17.5. PROBLEMS
VII. Cases
18. Cases
18.1. Charles Book Club
18.1.1. The Book Industry
18.1.2. Database Marketing at Charles
18.1.3. Data Mining Techniques
18.1.4. Assignment
18.2. German Credit
18.2.1. Assignment
18.3. Tayko Software Cataloger
18.3.1. Background
18.3.2. The Mailing Experiment
18.3.3. Data
18.3.4. Assignment
18.4. Segmenting Consumers of Bath Soap
18.4.1. Business Situation
18.4.2. Key Problems
18.4.3. Data
18.4.4. Measuring Brand Loyalty
18.4.5. Assignment
18.4.6. Appendix
18.5. Direct-Mail Fundraising
18.5.1. Background
18.5.2. Data
18.5.3. Assignment
18.6. Catalog Cross Selling
18.6.1. Background
18.6.2. Assignment
18.7. Predicting Bankruptcy
18.7.1. Predicting Corporate Bankruptcy
18.7.2. Assignment
18.8. Time Series Case: Forecasting Public Transportation Demand
18.8.1. Background
18.8.2. Problem Description
18.8.3. Available Data
18.8.4. Assignment Goal
18.8.5. Assignment
18.8.6. Tips and Suggested Steps
References
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Prev
Previous Chapter
5. Evaluating Classification and Predictive Performance
Next
Next Chapter
6. Multiple Linear Regression
Part IV. Prediction and Classification Methods
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset