Home Page Icon
Home Page
Table of Contents for
III. Module 3: Learning Data Mining with R
Close
III. Module 3: Learning Data Mining with R
by Ágnes Vidovics-Dancs, Kata Váradi, Tamás Vadász, Ágnes Tuza, Balázs Árpád Szucs,
R: Data Analysis and Visualization
R: Data Analysis and Visualization
Table of Contents
R: Data Analysis and Visualization
Meet Your Course Guide
Course Structure
Course journey
The Course Roadmap and Timeline
I. Module 1: Data Analysis with R
1. RefresheR
Navigating the basics
Arithmetic and assignment
Logicals and characters
Flow of control
Getting help in R
Vectors
Subsetting
Vectorized functions
Advanced subsetting
Recycling
Functions
Matrices
Loading data into R
Working with packages
2. The Shape of Data
Univariate data
Frequency distributions
Central tendency
Spread
Populations, samples, and estimation
Probability distributions
Visualization methods
3. Describing Relationships
Multivariate data
Relationships between a categorical and a continuous variable
Relationships between two categorical variables
The relationship between two continuous variables
Covariance
Correlation coefficients
Comparing multiple correlations
Visualization methods
Categorical and continuous variables
Two categorical variables
Two continuous variables
More than two continuous variables
4. Probability
Basic probability
A tale of two interpretations
Sampling from distributions
Parameters
The binomial distribution
The normal distribution
The three-sigma rule and using z-tables
5. Using Data to Reason About the World
Estimating means
The sampling distribution
Interval estimation
How did we get 1.96?
Smaller samples
6. Testing Hypotheses
Null Hypothesis Significance Testing
One and two-tailed tests
When things go wrong
A warning about significance
A warning about p-values
Testing the mean of one sample
Assumptions of the one sample t-test
Testing two means
Don't be fooled!
Assumptions of the independent samples t-test
Testing more than two means
Assumptions of ANOVA
Testing independence of proportions
What if my assumptions are unfounded?
7. Bayesian Methods
The big idea behind Bayesian analysis
Choosing a prior
Who cares about coin flips
Enter MCMC – stage left
Using JAGS and runjags
Fitting distributions the Bayesian way
The Bayesian independent samples t-test
8. Predicting Continuous Variables
Linear models
Simple linear regression
Simple linear regression with a binary predictor
A word of warning
Multiple regression
Regression with a non-binary predictor
Kitchen sink regression
The bias-variance trade-off
Cross-validation
Striking a balance
Linear regression diagnostics
Second Anscombe relationship
Third Anscombe relationship
Fourth Anscombe relationship
Advanced topics
9. Predicting Categorical Variables
k-Nearest Neighbors
Using k-NN in R
Confusion matrices
Limitations of k-NN
Logistic regression
Using logistic regression in R
Decision trees
Random forests
Choosing a classifier
The vertical decision boundary
The diagonal decision boundary
The crescent decision boundary
The circular decision boundary
10. Sources of Data
Relational Databases
Why didn't we just do that in SQL?
Using JSON
XML
Other data formats
Online repositories
11. Dealing with Messy Data
Analysis with missing data
Visualizing missing data
Types of missing data
So which one is it?
Unsophisticated methods for dealing with missing data
Complete case analysis
Pairwise deletion
Mean substitution
Hot deck imputation
Regression imputation
Stochastic regression imputation
Multiple imputation
So how does mice come up with the imputed values?
Methods of imputation
Multiple imputation in practice
Analysis with unsanitized data
Checking for out-of-bounds data
Checking the data type of a column
Checking for unexpected categories
Checking for outliers, entry errors, or unlikely data points
Chaining assertions
Other messiness
OpenRefine
Regular expressions
tidyr
12. Dealing with Large Data
Wait to optimize
Using a bigger and faster machine
Be smart about your code
Allocation of memory
Vectorization
Using optimized packages
Using another R implementation
Use parallelization
Getting started with parallel R
An example of (some) substance
Using Rcpp
Be smarter about your code
13. Reproducibility and Best Practices
R Scripting
RStudio
Running R scripts
An example script
Scripting and reproducibility
R projects
Version control
Communicating results
II. Module 2: R Graphs
1. R Graphics
Base graphics using the default package
Trellis graphs using lattice
Graphs inspired by Grammar of Graphics
2. Basic Graph Functions
Introduction
Creating basic scatter plots
Getting ready
How to do it...
How it works...
There's more...
A note on R's built-in datasets
See also
Creating line graphs
Getting ready
How to do it...
How it works...
There's more...
See also
Creating bar charts
Getting ready
How to do it...
How it works...
There's more...
See also
Creating histograms and density plots
How to do it...
How it works...
There's more...
See also
Creating box plots
Getting ready
How to do it...
How it works...
There's more...
See also
Adjusting x and y axes' limits
How to do it...
How it works...
There's more...
See also
Creating heat maps
How to do it...
How it works...
There's more...
See also
Creating pairs plots
How to do it...
How it works...
There's more...
See also
Creating multiple plot matrix layouts
How to do it...
How it works...
There's more...
See also
Adding and formatting legends
Getting ready
How to do it...
How it works...
There's more...
See also
Creating graphs with maps
Getting ready
How to do it...
How it works...
There's more...
See also
Saving and exporting graphs
How to do it...
How it works...
There's more...
See also
3. Beyond the Basics – Adjusting Key Parameters
Introduction
Setting colors of points, lines, and bars
Getting ready
How to do it...
How it works...
There's more...
See also
Setting plot background colors
Getting ready
How to do it...
How it works...
There's more...
Setting colors for text elements – axis annotations, labels, plot titles, and legends
Getting ready
How to do it...
How it works...
There's more...
Choosing color combinations and palettes
Getting ready
How to do it...
How it works...
There's more...
See also
Setting fonts for annotations and titles
Getting ready
How to do it...
How it works...
There's more...
See also
Choosing plotting point symbol styles and sizes
Getting ready
How to do it...
How it works...
There's more...
See also
Choosing line styles and width
Getting ready
How to do it...
How it works...
See also
Choosing box styles
Getting ready
How to do it...
How it works...
There's more...
Adjusting axis annotations and tick marks
Getting ready
How to do it...
How it works...
There's more...
See also
Formatting log axes
Getting ready
How to do it...
How it works...
There's more...
Setting graph margins and dimensions
Getting ready
How to do it...
How it works...
See also
4. Creating Scatter Plots
Introduction
Grouping data points within a scatter plot
Getting ready
How to do it...
How it works...
There's more...
See also
Highlighting grouped data points by size and symbol type
Getting ready
How to do it...
How it works...
Labeling data points
Getting ready
How to do it...
How it works...
There's more...
Correlation matrix using pairs plots
Getting ready
How to do it...
How it works...
Adding error bars
Getting ready
How to do it...
How it works...
There's more...
Using jitter to distinguish closely packed data points
Getting ready
How to do it...
How it works...
Adding linear model lines
Getting ready
How to do it...
How it works...
Adding nonlinear model curves
Getting ready
How to do it...
How it works...
Adding nonparametric model curves with lowess
Getting ready
How to do it...
How it works...
Creating three-dimensional scatter plots
Getting ready
How to do it...
How it works...
There's more...
Creating Quantile-Quantile plots
Getting ready
How to do it...
How it works...
There's more...
Displaying the data density on axes
Getting ready
How to do it...
How it works...
There's more...
Creating scatter plots with a smoothed density representation
Getting ready
How to do it...
How it works...
There's more...
5. Creating Line Graphs and Time Series Charts
Introduction
Adding customized legends for multiple-line graphs
Getting ready
How to do it...
How it works...
There's more...
See also
Using margin labels instead of legends for multiple-line graphs
Getting ready
How to do it...
How it works...
There's more...
Adding horizontal and vertical grid lines
Getting ready
How to do it...
How it works...
There's more...
See also
Adding marker lines at specific x and y values using abline
Getting ready
How to do it...
How it works...
There's more...
Creating sparklines
Getting ready
How to do it...
How it works...
Plotting functions of a variable in a dataset
Getting ready
How to do it...
How it works...
There's more...
Formatting time series data for plotting
Getting ready
How to do it...
How it works...
There's more...
Plotting the date or time variable on the x axis
Getting ready
How to do it...
How it works...
There's more...
Annotating axis labels in different human-readable time formats
Getting ready
How to do it...
How it works...
There's more...
Adding vertical markers to indicate specific time events
Getting ready
How to do it...
How it works...
There's more...
Plotting data with varying time-averaging periods
Getting ready
How to do it...
How it works...
Creating stock charts
Getting ready
How to do it...
How it works...
There's more...
6. Creating Bar, Dot, and Pie Charts
Introduction
Creating bar charts with more than one factor variable
Getting ready
How to do it...
How it works...
See also
Creating stacked bar charts
Getting ready
How to do it...
How it works...
There's more...
Adjusting the orientation of bars – horizontal and vertical
Getting ready
How to do it...
How it works...
There's more...
Adjusting bar widths, spacing, colors, and borders
Getting ready
How to do it...
How it works...
There's more...
Displaying values on top of or next to the bars
Getting ready
How to do it...
How it works...
There's more...
See also
Placing labels inside bars
Getting ready
How to do it...
How it works...
There's more...
Creating bar charts with vertical error bars
Getting ready
How to do it...
How it works...
There's more...
Modifying dot charts by grouping variables
Getting ready
How to do it...
How it works...
Making better, readable pie charts with clockwise-ordered slices
Getting ready
How to do it...
How it works...
See also
Labeling a pie chart with percentage values for each slice
Getting ready
How it works...
There's more...
See also
Adding a legend to a pie chart
Getting ready
How to do it...
How it works...
There's more...
7. Creating Histograms
Introduction
Visualizing distributions as count frequencies or probability densities
Getting ready
How to do it...
How it works...
There's more
Setting the bin size and the number of breaks
Getting ready
How to do it...
How it works...
There's more
Adjusting histogram styles – bar colors, borders, and axes
Getting ready
How to do it...
How it works...
There's more
Overlaying a density line over a histogram
Getting ready
How to do it...
How it works...
Multiple histograms along the diagonal of a pairs plot
Getting ready
How to do it...
How it works...
Histograms in the margins of line and scatter plots
Getting ready
How to do it...
How it works...
8. Box and Whisker Plots
Introduction
Creating box plots with narrow boxes for a small number of variables
Getting ready
How to do it...
How it works...
There's more
See also
Grouping over a variable
Getting ready
How to do it...
How it works...
There's more
See also
Varying box widths by the number of observations
Getting ready
How to do it...
How it works...
Creating box plots with notches
Getting ready
How to do it...
How it works...
There's more
Including or excluding outliers
Getting ready
How to do it...
How it works...
See also
Creating horizontal box plots
Getting ready
How to do it...
How it works...
Changing the box styling
Getting ready
How to do it...
How it works...
There's more
Adjusting the extent of plot whiskers outside the box
Getting ready
How to do it...
How it works...
There's more
Showing the number of observations
Getting ready
How to do it...
How it works...
There's more
Splitting a variable at arbitrary values into subsets
Getting ready
How to do it...
How it works...
There's more
9. Creating Heat Maps and Contour Plots
Introduction
Creating heat maps of a single Z variable with a scale
Getting ready
How to do it...
How it works...
There's more
See also
Creating correlation heat maps
Getting ready
How to do it...
How it works...
There's more
Summarizing multivariate data in a single heat map
Getting ready
How to do it...
How it works...
There's more
Creating contour plots
Getting ready
How to do it...
How it works...
There's more
See also
Creating filled contour plots
Getting ready
How to do it...
How it works...
There's more
See also
Creating three-dimensional surface plots
Getting ready
How to do it...
How it works...
There's more
Visualizing time series as calendar heat maps
Getting ready
How to do it...
How it works...
There's more
10. Creating Maps
Introduction
Plotting global data by countries on a world map
Getting ready
How to do it...
How it works...
There's more
See also
Creating graphs with regional maps
Getting ready
How to do it...
How it works...
There's more
Plotting data on Google maps
Getting ready
How to do it...
How it works...
There's more
See also
Creating and reading KML data
Getting ready
How to do it...
How it works...
See Also
Working with ESRI shapefiles
Getting ready
How to do it...
How it works...
There's more
11. Data Visualization Using Lattice
Introduction
Creating bar charts
Getting ready
How to do it…
How it works…
There's more…
See also
Creating stacked bar charts
Getting ready
How to do it…
How it works…
There's more…
See also
Creating bar charts to visualize cross-tabulation
Getting ready
How to do it…
How it works…
There's more…
Creating a conditional histogram
Getting ready
How to do it…
How it works…
There's more…
See also
Visualizing distributions through a kernel-density plot
Getting ready
How to do it…
How it works…
There's more…
Creating a normal Q-Q plot
Getting ready
How to do it…
How it works…
There's more…
Visualizing an empirical Cumulative Distribution Function
Getting ready
How to do it…
How it works…
There's more…
Creating a boxplot
Getting ready
How to do it…
How it works…
There's more…
Creating a conditional scatter plot
Getting ready
How to do it…
How it works…
There's more…
12. Data Visualization Using ggplot2
Introduction
Creating bar charts
Getting ready
How to do it…
How it works…
There's more…
See also
Creating multiple bar charts
Getting ready
How to do it…
How it works…
There's more…
See also
Creating a bar chart with error bars
Getting ready
How to do it…
How it works…
There's more…
Visualizing the density of a numeric variable
Getting ready
How to do it...
How it works…
There's more...
Creating a box plot
Getting ready
How to do it...
How it works…
Creating a layered plot with a scatter plot and fitted line
Getting ready
How to do it...
How it works…
There's more...
Creating a line chart
Getting ready
How to do it...
How it works…
There's more...
Graph annotation with ggplot
Getting ready
How to do it...
How it works...
13. Inspecting Large Datasets
Introduction
Multivariate continuous data visualization
Getting ready
How to do it…
How it works…
There's more…
See also
Multivariate categorical data visualization
Getting ready
How to do it…
How it works…
There's more…
Visualizing mixed data
Getting ready
How to do it…
Zooming and filtering
Getting ready
How to do it...
How it works…
There's more...
14. Three-dimensional Visualizations
Introduction
Three-dimensional scatter plots
Getting ready
How to do it…
How it works…
There's more…
See also...
Three-dimensional scatter plots with a regression plane
Getting ready
How to do it…
How it works…
There's more…
Three-dimensional bar charts
Getting ready
How to do it…
How it works…
Three-dimensional density plots
Getting ready
How to do it...
How it works…
15. Finalizing Graphs for Publications and Presentations
Introduction
Exporting graphs in high-resolution image formats – PNG, JPEG, BMP, and TIFF
Getting ready
How to do it...
How it works...
There's more
See also
Exporting graphs in vector formats – SVG, PDF, and PS
Getting ready
How to do it...
How it works...
There's more
Adding mathematical and scientific notations (typesetting)
Getting ready
How to do it...
How it works...
There's more
Adding text descriptions to graphs
Getting ready
How to do it...
How it works...
There's more
Using graph templates
Getting ready
How to do it...
How it works...
There's more
Choosing font families and styles under Windows, Mac OS X, and Linux
Getting ready
How to do it...
How it works...
There's more
See also
Choosing fonts for PostScripts and PDFs
Getting ready
How to do it...
How it works...
There's more
III. Module 3: Learning Data Mining with R
1. Warming Up
Big data
Scalability and efficiency
Data source
Data mining
Feature extraction
Summarization
The data mining process
CRISP-DM
SEMMA
Social network mining
Social network
Text mining
Information retrieval and text mining
Mining text for prediction
Web data mining
Why R?
What are the disadvantages of R?
Statistics
Statistics and data mining
Statistics and machine learning
Statistics and R
The limitations of statistics on data mining
Machine learning
Approaches to machine learning
Machine learning architecture
Data attributes and description
Numeric attributes
Categorical attributes
Data description
Data measuring
Data cleaning
Missing values
Junk, noisy data, or outlier
Data integration
Data dimension reduction
Eigenvalues and Eigenvectors
Principal-Component Analysis
Singular-value decomposition
CUR decomposition
Data transformation and discretization
Data transformation
Normalization data transformation methods
Data discretization
Visualization of results
Visualization with R
2. Mining Frequent Patterns, Associations, and Correlations
An overview of associations and patterns
Patterns and pattern discovery
The frequent itemset
The frequent subsequence
The frequent substructures
Relationship or rules discovery
Association rules
Correlation rules
Market basket analysis
The market basket model
A-Priori algorithms
Input data characteristics and data structure
The A-Priori algorithm
The R implementation
A-Priori algorithm variants
The Eclat algorithm
The R implementation
The FP-growth algorithm
Input data characteristics and data structure
The FP-growth algorithm
The R implementation
The GenMax algorithm with maximal frequent itemsets
The R implementation
The Charm algorithm with closed frequent itemsets
The R implementation
The algorithm to generate association rules
The R implementation
Hybrid association rules mining
Mining multilevel and multidimensional association rules
Constraint-based frequent pattern mining
Mining sequence dataset
Sequence dataset
The GSP algorithm
The R implementation
The SPADE algorithm
The R implementation
Rule generation from sequential patterns
High-performance algorithms
3. Classification
Classification
Generic decision tree induction
Attribute selection measures
Tree pruning
General algorithm for the decision tree generation
The R implementation
High-value credit card customers classification using ID3
The ID3 algorithm
The R implementation
Web attack detection
High-value credit card customers classification
Web spam detection using C4.5
The C4.5 algorithm
The R implementation
A parallel version with MapReduce
Web spam detection
Web key resource page judgment using CART
The CART algorithm
The R implementation
Web key resource page judgment
Trojan traffic identification method and Bayes classification
Estimating
Prior probability estimation
Likelihood estimation
The Bayes classification
The R implementation
Trojan traffic identification method
Identify spam e-mail and Naïve Bayes classification
The Naïve Bayes classification
The R implementation
Identify spam e-mail
Rule-based classification of player types in computer games and rule-based classification
Transformation from decision tree to decision rules
Rule-based classification
Sequential covering algorithm
The RIPPER algorithm
The R implementation
Rule-based classification of player types in computer games
4. Advanced Classification
Ensemble (EM) methods
The bagging algorithm
The boosting and AdaBoost algorithms
The Random forests algorithm
The R implementation
Parallel version with MapReduce
Biological traits and the Bayesian belief network
The Bayesian belief network (BBN) algorithm
The R implementation
Biological traits
Protein classification and the k-Nearest Neighbors algorithm
The kNN algorithm
The R implementation
Document retrieval and Support Vector Machine
The SVM algorithm
The R implementation
Parallel version with MapReduce
Document retrieval
Classification using frequent patterns
The associative classification
CBA
Discriminative frequent pattern-based classification
The R implementation
Text classification using sentential frequent itemsets
Classification using the backpropagation algorithm
The BP algorithm
The R implementation
Parallel version with MapReduce
5. Cluster Analysis
Search engines and the k-means algorithm
The k-means clustering algorithm
The kernel k-means algorithm
The k-modes algorithm
The R implementation
Parallel version with MapReduce
Search engine and web page clustering
Automatic abstraction of document texts and the k-medoids algorithm
The PAM algorithm
The R implementation
Automatic abstraction and summarization of document text
The CLARA algorithm
The CLARA algorithm
The R implementation
CLARANS
The CLARANS algorithm
The R implementation
Unsupervised image categorization and affinity propagation clustering
Affinity propagation clustering
The R implementation
Unsupervised image categorization
The spectral clustering algorithm
The R implementation
News categorization and hierarchical clustering
Agglomerative hierarchical clustering
The BIRCH algorithm
The chameleon algorithm
The Bayesian hierarchical clustering algorithm
The probabilistic hierarchical clustering algorithm
The R implementation
News categorization
6. Advanced Cluster Analysis
Customer categorization analysis of e-commerce and DBSCAN
The DBSCAN algorithm
Customer categorization analysis of e-commerce
Clustering web pages and OPTICS
The OPTICS algorithm
The R implementation
Clustering web pages
Visitor analysis in the browser cache and DENCLUE
The DENCLUE algorithm
The R implementation
Visitor analysis in the browser cache
Recommendation system and STING
The STING algorithm
The R implementation
Recommendation systems
Web sentiment analysis and CLIQUE
The CLIQUE algorithm
The R implementation
Web sentiment analysis
Opinion mining and WAVE clustering
The WAVE cluster algorithm
The R implementation
Opinion mining
User search intent and the EM algorithm
The EM algorithm
The R implementation
The user search intent
Customer purchase data analysis and clustering high-dimensional data
The MAFIA algorithm
The SURFING algorithm
The R implementation
Customer purchase data analysis
SNS and clustering graph and network data
The SCAN algorithm
The R implementation
Social networking service (SNS)
7. Outlier Detection
Credit card fraud detection and statistical methods
The likelihood-based outlier detection algorithm
The R implementation
Credit card fraud detection
Activity monitoring – the detection of fraud involving mobile phones and proximity-based methods
The NL algorithm
The FindAllOutsM algorithm
The FindAllOutsD algorithm
The distance-based algorithm
The Dolphin algorithm
The R implementation
Activity monitoring and the detection of mobile fraud
Intrusion detection and density-based methods
The OPTICS-OF algorithm
The High Contrast Subspace algorithm
The R implementation
Intrusion detection
Intrusion detection and clustering-based methods
Hierarchical clustering to detect outliers
The k-means-based algorithm
The ODIN algorithm
The R implementation
Monitoring the performance of the web server and classification-based methods
The OCSVM algorithm
The one-class nearest neighbor algorithm
The R implementation
Monitoring the performance of the web server
Detecting novelty in text, topic detection, and mining contextual outliers
The conditional anomaly detection (CAD) algorithm
The R implementation
Detecting novelty in text and topic detection
Collective outliers on spatial data
The route outlier detection (ROD) algorithm
The R implementation
Characteristics of collective outliers
Outlier detection in high-dimensional data
The brute-force algorithm
The HilOut algorithm
The R implementation
8. Mining Stream, Time-series, and Sequence Data
The credit card transaction flow and STREAM algorithm
The STREAM algorithm
The single-pass-any-time clustering algorithm
The R implementation
The credit card transaction flow
Predicting future prices and time-series analysis
The ARIMA algorithm
Predicting future prices
Stock market data and time-series clustering and classification
The hError algorithm
Time-series classification with the 1NN classifier
The R implementation
Stock market data
Web click streams and mining symbolic sequences
The TECNO-STREAMS algorithm
The R implementation
Web click streams
Mining sequence patterns in transactional databases
The PrefixSpan algorithm
The R implementation
9. Graph Mining and Network Analysis
Graph mining
Graph
Graph mining algorithms
Mining frequent subgraph patterns
The gPLS algorithm
The GraphSig algorithm
The gSpan algorithm
Rightmost path extensions and their supports
The subgraph isomorphism enumeration algorithm
The canonical checking algorithm
The R implementation
Social network mining
Community detection and the shingling algorithm
The node classification and iterative classification algorithms
The R implementation
10. Mining Text and Web Data
Text mining and TM packages
Text summarization
Topic representation
The multidocument summarization algorithm
The Maximal Marginal Relevance algorithm
The R implementation
The question answering system
Genre categorization of web pages
Categorizing newspaper articles and newswires into topics
The N-gram-based text categorization
The R implementation
Web usage mining with web logs
The FCA-based association rule mining algorithm
The R implementation
IV. Module 4: Mastering R for Quantitative Finance
1. Time Series Analysis
Multivariate time series analysis
Cointegration
Vector autoregressive models
VAR implementation example
Cointegrated VAR and VECM
Volatility modeling
GARCH modeling with the rugarch package
The standard GARCH model
The Exponential GARCH model (EGARCH)
The Threshold GARCH model (TGARCH)
Simulation and forecasting
References and reading list
2. Factor Models
Arbitrage pricing theory
Implementation of APT
Fama-French three-factor model
Modeling in R
Data selection
Estimation of APT with principal component analysis
Estimation of the Fama-French model
References
3. Forecasting Volume
Motivation
The intensity of trading
The volume forecasting model
Implementation in R
The data
Loading the data
The seasonal component
AR(1) estimation and forecasting
SETAR estimation and forecasting
Interpreting the results
References
4. Big Data – Advanced Analytics
Getting data from open sources
Introduction to big data analysis in R
K-means clustering on big data
Loading big matrices
Big data K-means clustering analysis
Big data linear regression analysis
Loading big data
Fitting a linear regression model on large datasets
References
5. FX Derivatives
Terminology and notations
Currency options
Exchange options
Two-dimensional Wiener processes
The Margrabe formula
Application in R
Quanto options
Pricing formula for a call quanto
Pricing a call quanto in R
References
6. Interest Rate Derivatives and Models
The Black model
Pricing a cap with Black's model
The Vasicek model
The Cox-Ingersoll-Ross model
Parameter estimation of interest rate models
Using the SMFI5 package
References
7. Exotic Options
A general pricing approach
The role of dynamic hedging
How R can help a lot
A glance beyond vanillas
Greeks – the link back to the vanilla world
Pricing the Double-no-touch option
Another way to price the Double-no-touch option
The life of a Double-no-touch option – a simulation
Exotic options embedded in structured products
References
8. Optimal Hedging
Hedging of derivatives
Market risk of derivatives
Static delta hedge
Dynamic delta hedge
Comparing the performance of delta hedging
Hedging in the presence of transaction costs
Optimization of the hedge
Optimal hedging in the case of absolute transaction costs
Optimal hedging in the case of relative transaction costs
Further extensions
References
9. Fundamental Analysis
The basics of fundamental analysis
Collecting data
Revealing connections
Including multiple variables
Separating investment targets
Setting classification rules
Backtesting
Industry-specific investment
References
10. Technical Analysis, Neural Networks, and Logoptimal Portfolios
Market efficiency
Technical analysis
The TA toolkit
Markets
Plotting charts - bitcoin
Built-in indicators
SMA and EMA
RSI
MACD
Candle patterns: key reversal
Evaluating the signals and managing the position
A word on money management
Wraping up
Neural networks
Forecasting bitcoin prices
Evaluation of the strategy
Logoptimal portfolios
A universally consistent, non-parametric investment strategy
Evaluation of the strategy
References
11. Asset and Liability Management
Data preparation
Data source at first glance
Cash-flow generator functions
Preparing the cash-flow
Interest rate risk measurement
Liquidity risk measurement
Modeling non-maturity deposits
A Model of deposit interest rate development
Static replication of non-maturity deposits
References
12. Capital Adequacy
Principles of the Basel Accords
Basel I
Basel II
Minimum capital requirements
Supervisory review
Transparency
Basel III
Risk measures
Analytical VaR
Historical VaR
Monte-Carlo simulation
Risk categories
Market risk
Credit risk
Operational risk
References
13. Systemic Risks
Systemic risk in a nutshell
The dataset used in our examples
Core-periphery decomposition
Implementation in R
Results
The simulation method
The simulation
Implementation in R
Results
Possible interpretations and suggestions
References
V. Module 5: Machine Learning with R module
1. Introducing Machine Learning
The origins of machine learning
Uses and abuses of machine learning
Machine learning successes
The limits of machine learning
Machine learning ethics
How machines learn
Data storage
Abstraction
Generalization
Evaluation
Machine learning in practice
Types of input data
Types of machine learning algorithms
Matching input data to algorithms
Machine learning with R
Installing R packages
Loading and unloading R packages
2. Managing and Understanding Data
R data structures
Vectors
Factors
Lists
Data frames
Matrixes and arrays
Managing data with R
Saving, loading, and removing R data structures
Importing and saving data from CSV files
Exploring and understanding data
Exploring the structure of data
Exploring numeric variables
Measuring the central tendency – mean and median
Measuring spread – quartiles and the five-number summary
Visualizing numeric variables – boxplots
Visualizing numeric variables – histograms
Understanding numeric data – uniform and normal distributions
Measuring spread – variance and standard deviation
Exploring categorical variables
Measuring the central tendency – the mode
Exploring relationships between variables
Visualizing relationships – scatterplots
Examining relationships – two-way cross-tabulations
3. Lazy Learning – Classification Using Nearest Neighbors
Understanding nearest neighbor classification
The k-NN algorithm
Measuring similarity with distance
Choosing an appropriate k
Preparing data for use with k-NN
Why is the k-NN algorithm lazy?
Example – diagnosing breast cancer with the k-NN algorithm
Step 1 – collecting data
Step 2 – exploring and preparing the data
Transformation – normalizing numeric data
Data preparation – creating training and test datasets
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Transformation – z-score standardization
Testing alternative values of k
4. Probabilistic Learning – Classification Using Naive Bayes
Understanding Naive Bayes
Basic concepts of Bayesian methods
Understanding probability
Understanding joint probability
Computing conditional probability with Bayes' theorem
The Naive Bayes algorithm
Classification with Naive Bayes
The Laplace estimator
Using numeric features with Naive Bayes
Example – filtering mobile phone spam with the Naive Bayes algorithm
Step 1 – collecting data
Step 2 – exploring and preparing the data
Data preparation – cleaning and standardizing text data
Data preparation – splitting text documents into words
Data preparation – creating training and test datasets
Visualizing text data – word clouds
Data preparation – creating indicator features for frequent words
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
5. Divide and Conquer – Classification Using Decision Trees and Rules
Understanding decision trees
Divide and conquer
The C5.0 decision tree algorithm
Choosing the best split
Pruning the decision tree
Example – identifying risky bank loans using C5.0 decision trees
Step 1 – collecting data
Step 2 – exploring and preparing the data
Data preparation – creating random training and test datasets
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Boosting the accuracy of decision trees
Making mistakes more costlier than others
Understanding classification rules
Separate and conquer
The 1R algorithm
The RIPPER algorithm
Rules from decision trees
What makes trees and rules greedy?
Example – identifying poisonous mushrooms with rule learners
Step 1 – collecting data
Step 2 – exploring and preparing the data
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
6. Forecasting Numeric Data – Regression Methods
Understanding regression
Simple linear regression
Ordinary least squares estimation
Correlations
Multiple linear regression
Example – predicting medical expenses using linear regression
Step 1 – collecting data
Step 2 – exploring and preparing the data
Exploring relationships among features – the correlation matrix
Visualizing relationships among features – the scatterplot matrix
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Model specification – adding non-linear relationships
Transformation – converting a numeric variable to a binary indicator
Model specification – adding interaction effects
Putting it all together – an improved regression model
Understanding regression trees and model trees
Adding regression to trees
Example – estimating the quality of wines with regression trees and model trees
Step 1 – collecting data
Step 2 – exploring and preparing the data
Step 3 – training a model on the data
Visualizing decision trees
Step 4 – evaluating model performance
Measuring performance with the mean absolute error
Step 5 – improving model performance
7. Black Box Methods – Neural Networks and Support Vector Machines
Understanding neural networks
From biological to artificial neurons
Activation functions
Network topology
The number of layers
The direction of information travel
The number of nodes in each layer
Training neural networks with backpropagation
Example – Modeling the strength of concrete with ANNs
Step 1 – collecting data
Step 2 – exploring and preparing the data
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Understanding Support Vector Machines
Classification with hyperplanes
The case of linearly separable data
The case of nonlinearly separable data
Using kernels for non-linear spaces
Example – performing OCR with SVMs
Step 1 – collecting data
Step 2 – exploring and preparing the data
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
8. Finding Patterns – Market Basket Analysis Using Association Rules
Understanding association rules
The Apriori algorithm for association rule learning
Measuring rule interest – support and confidence
Building a set of rules with the Apriori principle
Example – identifying frequently purchased groceries with association rules
Step 1 – collecting data
Step 2 – exploring and preparing the data
Data preparation – creating a sparse matrix for transaction data
Visualizing item support – item frequency plots
Visualizing the transaction data – plotting the sparse matrix
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Sorting the set of association rules
Taking subsets of association rules
Saving association rules to a file or data frame
9. Finding Groups of Data – Clustering with k-means
Understanding clustering
Clustering as a machine learning task
The k-means clustering algorithm
Using distance to assign and update clusters
Choosing the appropriate number of clusters
Example – finding teen market segments using k-means clustering
Step 1 – collecting data
Step 2 – exploring and preparing the data
Data preparation – dummy coding missing values
Data preparation – imputing the missing values
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
10. Evaluating Model Performance
Measuring performance for classification
Working with classification prediction data in R
A closer look at confusion matrices
Using confusion matrices to measure performance
Beyond accuracy – other measures of performance
The kappa statistic
Sensitivity and specificity
Precision and recall
The F-measure
Visualizing performance trade-offs
ROC curves
Estimating future performance
The holdout method
Cross-validation
Bootstrap sampling
11. Improving Model Performance
Tuning stock models for better performance
Using caret for automated parameter tuning
Creating a simple tuned model
Customizing the tuning process
Improving model performance with meta-learning
Understanding ensembles
Bagging
Boosting
Random forests
Training random forests
Evaluating random forest performance
12. Specialized Machine Learning Topics
Working with proprietary files and databases
Reading from and writing to Microsoft Excel, SAS, SPSS, and Stata files
Querying data in SQL databases
Working with online data and services
Downloading the complete text of web pages
Scraping data from web pages
Parsing XML documents
Parsing JSON from web APIs
Working with domain-specific data
Analyzing bioinformatics data
Analyzing and visualizing network data
Improving the performance of R
Managing very large datasets
Generalizing tabular data structures with dplyr
Making data frames faster with data.table
Creating disk-based data frames with ff
Using massive matrices with bigmemory
Learning faster with parallel computing
Measuring execution time
Working in parallel with multicore and snow
Taking advantage of parallel with foreach and doParallel
Parallel cloud computing with MapReduce and Hadoop
GPU computing
Deploying optimized learning algorithms
Building bigger regression models with biglm
Growing bigger and faster random forests with bigrf
Training and evaluating models in parallel with caret
A. Reflect and Test Yourself Answers
Module 1: Data Analysis with R
Chapter 1: RefresheR
Chapter 2: The Shape of Data
Chapter 3: Describing Relationships
Chapter 4: Probability
Chapter 5: Using Data to Reason About the World
Chapter 6: Testing Hypotheses
Chapter 7: Bayesian Methods
Chapter 8: Predicting Continuous Variables
Chapter 9: Predicting Categorical Variables
Chapter 10: Sources of Data
Chapter 11: Dealing with Messy Data
Chapter 12: Dealing with Large Data
Module 2: R Graphs
Chapter 1: R Graphics
Chapter 2: Basic Graph Functions
Chapter 3: Beyond the Basics – Adjusting Key Parameters
Chapter 4: Creating Scatter Plots
Chapter 5: Creating Line Graphs and Time Series Charts
Chapter 6: Creating Bar, Dot, and Pie Charts
Chapter 7: Creating Histograms
Chapter 8: Box and Whisker Plots
Chapter 9: Creating Heat Maps and Contour Plots
Module 4: Mastering R for Quantitative Finance
Chapter 1: Time Series Analysis
Chapter 3: Forecasting Volume
Chapter 4: Big Data – Advanced Analytics
Chapter 5: FX Derivatives
Chapter 6: Interest Rate Derivatives and Models
Chapter 7: Exotic Options
Chapter 8: Optimal Hedging
Chapter 9: Fundamental Analysis
Module 5: Machine Learning with R
Chapter 1: Introducing Machine Learning
Chapter 2: Managing and Understanding Data
Chapter 3: Lazy Learning – Classification Using Nearest Neighbors
Chapter 4: Probabilistic Learning – Classification Using Naive Bayes
Chapter 5: Divide and Conquer – Classification Using Decision Trees and Rules
Chapter 6: Forecasting Numeric Data – Regression Methods
Chapter 7: Black Box Methods – Neural Networks and Support Vector Machines
Chapter 8: Finding Patterns – Market Basket Analysis Using Association Rules
B. Bibliography
Index
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Prev
Previous Chapter
Choosing fonts for PostScripts and PDFs
Next
Next Chapter
1. Warming Up
Part III. Module 3: Learning Data Mining with R
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset