Here, we provide the exercises for most chapters and recommend that you practice your newly acquired skills after reading each chapter.
Chapter 1 already contains the exercises and solutions.
Have a look at the following exercises and try to perform the required tasks.
Let's have a little fun now! For this exercise, imagine a player betting on red for 1,000 consecutive trials. You'll have to plot the variations in money throughout the game. Use the isRed
attribute of the data frame Data
that we built at the beginning of the chapter. The player here starts with $1,000 and bets 1 in every game. The worst possible outcome is leaving with nothing, but also without debts. The line graph you will use has to be wide rather than tall; that is, 10 x 4 inches (use the documentation of the par()
function to know how to configure this). Does the player end up winning or losing money?
Here, simply plot the relationship between Petal.Length
and Petal.Width
in the iris
dataset and include the regression line.
Here, simply determine the best number of clusters in the iris
dataset (omit the Species
attribute), using several distances measures (use distance = "euclidean"
, distance = "maximum"
, and distance = "manhattan"
). Always use method = "kmeans"
. What is the best number of clusters for each distance (use a majority rule). Do the results surprise you?
Use hclust()
to perform clustering on the iris
dataset (omit the Species
attribute). Use different methods for distance calculation (configurable using the method
argument of the dist()
function); and different linkage options (configurable using the method
argument of the hclust()
function).
The bfi
dataset (in the psych
package) contains the responses of 2,800 participants to the Big Five Inventory (http://www.ocf.berkeley.edu/~johnlab/bfi.htm), which measures the five dimensions of personality. This contains 25 items of the inventory, five per dimension of personality: Neuroticism (N1-N5), Extraversion (E1-E5), Conscience (C1-C5), Agreeability (A1-A5), and Openness (O1-O5); as well as the variable's gender, education, and age at the end of the data frame.
Perform the following using the 25 items:
princomp()
.principal()
with the varimax rotation and save the PCA scores.Using the ICU
dataset, without the attribute race
, obtain the association rules with support = 0.1
, confidence = 0.8, minlen = 2
containing pco=<=45
as an antecedent. You should obtain 13 rules. Convert the rules object to a data frame. Create an object containing the significance values of fisher's exact test for these rules (rounded to two decimal places), and append it as a column to the data frame you just created. Visualize the relationship between lift and significance of fisher's exact test (the p value) using the plot()
function.
Try performing the following exercises:
Are these numbers different ? If so, or if not, why?
cor.test()
function. Is the correlation positive or negative? Is it significant?Try performing the following exercises:
nurses
dataset, examine the effect of a work-family conflict (attribute WFC
) on work satisfaction (WorkSat
) in the first model called model01
.model02
, in which you include WFC
and exhaustion (Exhaus
) as predictors of WorkSat
.What happens to the relationship between WFC
and WorkSat
?
WFC
(predictor) and Exhaus
(criterion).WFC
and WorkSat
by Exhaus
.In this exercise, you will try to classify the observations in the Ozone
dataset using knn()
. The class is the season and is computed (approximately) as follows:
1 library(mlbench) 2 data(Ozone) 3 Oz = na.omit(Ozone) 4 Oz$season = rep("winter",length(Oz[,1])) 5 Oz$season[as.numeric(Oz[[1]])>=3 & as.numeric(Oz[[1]])<=5] 6 = "spring" 7 Oz$season[as.numeric(Oz[[1]])>=6 & as.numeric(Oz[[1]])<=8] 8 = "summer" 9 Oz$season[as.numeric(Oz[[1]])>=9 & as.numeric(Oz[[1]])<=11] 10 = "autumn"
You will determine the best number of neighbors on the basis of the kappa value in the training set (higher is better). Finally, based on the kappa value in the testing set with the best number of neighbors, would you trust the classification?
The training and testing datasets are obtained as follows:
1 set.seed(5) 2 Oz$samples = sample(0:1, nrow(Oz), replace =T) 3 TRAIN = subset(Oz, samples == 0) 4 TEST = subset(Oz, samples == 1)
The class (season
, the target
attribute) is in column 14. Do not include columns 1 and 15 in the analyses. Take care of unlisting the class, for instance, with the unlist()
function, if you use subsetting, otherwise, use the df$attribute
notation for the class.
Classify the observations in the iris
dataset (class is Species
) using C4.5 (pruned tree) and CART (using the default arguments). Which produces the best classification in terms of accuracy in the testing set? Create a function that assesses accuracy.
The training and testing sets are generated as follows:
IRIStrain = iris[as.numeric(row.names(iris)) %% 2 == T,] IRIStest = iris[as.numeric(row.names(iris)) %% 2 == F,]
Try performing the following exercises:
NursesML
dataset, visualize whether the relationship between exhaustion (attribute Exhaust
) and work satisfaction (WorkSat
) varies between hospitals. Include the regression line. Perform the same step for the relationship of depersonalization (Depers
) and work satisfaction.modelPred
model, determine which difference in the observed work satisfaction is obtained from an increase of 1 in the predicted values.The tm
package contains a corpus of 50 news articles that we access as follows:
data(acq) acq
What are the terms that occur more than 100 times is this corpus before and after preprocessing with the
preprocess()
function?
Using the preprocessed data, plot the sorted term frequencies above 10 with terms (row names) on the x axis. Use the barplot()
function.
18.119.133.160