Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Appendix A. Exercises and Solutions

Exercises

Here, we provide the exercises for most chapters and recommend that you practice your newly acquired skills after reading each chapter.

Chapter 1 – Setting GNU R for Predictive Modeling

Chapter 1 already contains the exercises and solutions.

Chapter 2 – Visualizing and Manipulating Data Using R

Have a look at the following exercises and try to perform the required tasks.

Let's have a little fun now! For this exercise, imagine a player betting on red for 1,000 consecutive trials. You'll have to plot the variations in money throughout the game. Use the isRed attribute of the data frame Data that we built at the beginning of the chapter. The player here starts with $1,000 and bets 1 in every game. The worst possible outcome is leaving with nothing, but also without debts. The line graph you will use has to be wide rather than tall; that is, 10 x 4 inches (use the documentation of the par() function to know how to configure this). Does the player end up winning or losing money?

Chapter 3 – Data Visualization with Lattice

Here, simply plot the relationship between Petal.Length and Petal.Width in the iris dataset and include the regression line.

Chapter 4 – Cluster Analysis

Here, simply determine the best number of clusters in the iris dataset (omit the Species attribute), using several distances measures (use distance = "euclidean", distance = "maximum", and distance = "manhattan"). Always use method = "kmeans". What is the best number of clusters for each distance (use a majority rule). Do the results surprise you?

Chapter 5 – Agglomerative Clustering Using hclust()

Use hclust() to perform clustering on the iris dataset (omit the Species attribute). Use different methods for distance calculation (configurable using the method argument of the dist() function); and different linkage options (configurable using the method argument of the hclust() function).

Chapter 6 – Dimensionality Reduction with Principal Component Analysis

The bfi dataset (in the psych package) contains the responses of 2,800 participants to the Big Five Inventory (http://www.ocf.berkeley.edu/~johnlab/bfi.htm), which measures the five dimensions of personality. This contains 25 items of the inventory, five per dimension of personality: Neuroticism (N1-N5), Extraversion (E1-E5), Conscience (C1-C5), Agreeability (A1-A5), and Openness (O1-O5); as well as the variable's gender, education, and age at the end of the data frame.

Perform the following using the 25 items:

Examine the missing values.
Perform the diagnostics (omit cases with missing values). What do you find?
Run PCA using princomp().
Plot the eigenvalues to determine the number of components to be retained.
Rerun the analysis with that number of components using principal() with the varimax rotation and save the PCA scores.
What is the proportion of cumulative variance explained by all the components?
Name the components by looking at the loadings.
What is the relationship (correlation) between each component and attribute age?

Chapter 7 – Exploring Association Rules with Apriori

Using the ICU dataset, without the attribute race, obtain the association rules with support = 0.1, confidence = 0.8, minlen = 2 containing pco=<=45 as an antecedent. You should obtain 13 rules. Convert the rules object to a data frame. Create an object containing the significance values of fisher's exact test for these rules (rounded to two decimal places), and append it as a column to the data frame you just created. Visualize the relationship between lift and significance of fisher's exact test (the p value) using the plot() function.

Chapter 8 – Probability Distributions, Covariance, and Correlation

Try performing the following exercises:

Adapt the code we used when discussing the binomial distribution to compute the probability of getting a red number in European roulette spins between:
- 40 and 49 times
- 51 and 60 times
Are these numbers different ? If so, or if not, why?
Compute the correlation between petal length and petal width in the iris dataset using the cor.test() function. Is the correlation positive or negative? Is it significant?

Chapter 9 – Linear Regression

Try performing the following exercises:

Using the nurses dataset, examine the effect of a work-family conflict (attribute WFC) on work satisfaction (WorkSat) in the first model called model01.
Create a second model called model02, in which you include WFC and exhaustion (Exhaus) as predictors of WorkSat.

What happens to the relationship between WFC and WorkSat?

Test the relationship between WFC (predictor) and Exhaus (criterion).
If it is significant, perform a sobel test for the mediation of the relationship between WFC and WorkSat by Exhaus.

Chapter 10 – Classification with k-Nearest Neighbors and Naïve Bayes

In this exercise, you will try to classify the observations in the Ozone dataset using knn(). The class is the season and is computed (approximately) as follows:

1 library(mlbench)
2 data(Ozone)
3 Oz = na.omit(Ozone)
4 Oz$season = rep("winter",length(Oz[,1]))
5 Oz$season[as.numeric(Oz[[1]])>=3 & as.numeric(Oz[[1]])<=5] 
6   = "spring"
7 Oz$season[as.numeric(Oz[[1]])>=6 & as.numeric(Oz[[1]])<=8]
8   = "summer"
9 Oz$season[as.numeric(Oz[[1]])>=9 & as.numeric(Oz[[1]])<=11] 
10   = "autumn"

You will determine the best number of neighbors on the basis of the kappa value in the training set (higher is better). Finally, based on the kappa value in the testing set with the best number of neighbors, would you trust the classification?

The training and testing datasets are obtained as follows:

1  set.seed(5)
2  Oz$samples = sample(0:1, nrow(Oz), replace =T)
3  TRAIN = subset(Oz, samples == 0)
4  TEST = subset(Oz, samples == 1)

The class (season, the target attribute) is in column 14. Do not include columns 1 and 15 in the analyses. Take care of unlisting the class, for instance, with the unlist() function, if you use subsetting, otherwise, use the df$attribute notation for the class.

Chapter 11 – Classification Trees

Classify the observations in the iris dataset (class is Species) using C4.5 (pruned tree) and CART (using the default arguments). Which produces the best classification in terms of accuracy in the testing set? Create a function that assesses accuracy.

The training and testing sets are generated as follows:

IRIStrain = iris[as.numeric(row.names(iris)) %% 2 == T,]
IRIStest = iris[as.numeric(row.names(iris)) %% 2 == F,]

Chapter 12 – Multilevel Analyses

Try performing the following exercises:

Using the NursesML dataset, visualize whether the relationship between exhaustion (attribute Exhaust) and work satisfaction (WorkSat) varies between hospitals. Include the regression line. Perform the same step for the relationship of depersonalization (Depers) and work satisfaction.
Using the modelPred model, determine which difference in the observed work satisfaction is obtained from an increase of 1 in the predicted values.
What is the intercept of the model (that is, the average value of work satisfaction for the average predicted value)?

Chapter 13 – Text Analytics with R

The tm package contains a corpus of 50 news articles that we access as follows:

data(acq)
acq

What are the terms that occur more than 100 times is this corpus before and after preprocessing with the preprocess() function?

Using the preprocessed data, plot the sorted term frequencies above 10 with terms (row names) on the x axis. Use the barplot() function.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for A. Exercises and Solutions

Create new playlist

Sign In

Sign Up

Appendix A. Exercises and Solutions

Exercises

Chapter 1 – Setting GNU R for Predictive Modeling

Chapter 2 – Visualizing and Manipulating Data Using R

Chapter 3 – Data Visualization with Lattice

Chapter 4 – Cluster Analysis

Chapter 5 – Agglomerative Clustering Using hclust()

Chapter 6 – Dimensionality Reduction with Principal Component Analysis

Chapter 7 – Exploring Association Rules with Apriori

Chapter 8 – Probability Distributions, Covariance, and Correlation

Chapter 9 – Linear Regression

Chapter 10 – Classification with k-Nearest Neighbors and Naïve Bayes

Chapter 11 – Classification Trees

Chapter 12 – Multilevel Analyses

Chapter 13 – Text Analytics with R

Table of Contents for
A. Exercises and Solutions