You will find the solutions to the exercises below.
One way to solve this is to first create a vector with a size of 1001
(line 1). The first value is the money that the player comes with (assigned on line 2). On line 3, we assign a copy of the isRed
attribute to the wins
vector. On lines 5 to 7, we compute the amount of money at each trial by adding 1 to the money on the next turn if the drawn number is red, and remove 1 if the number is not red. On line 8, we set the graphic parameters to contain a single plot (using argument mfrow
) and to be of the required dimensions (10 by 4 inches) using the pin
argument. Finally, we generate the plot on line 9:
1 money = rep(0,1001) 2 money[1] = 1000 3 wins = Data$isRed 4 for (i in 1:nrow(Data)) { 5 if (wins[i] == 1) money[i+1] = money[i] + 1 6 else money[i+1] = money[i] - 1 7 } 8 par (mfrow = c(1,1), pin=c(5,2)) 9 plot(money, type = "l", xlab = "Trial number")
Run the code and have a look at your screen to know more about the outcomes of the player's game!
We plot these elements as follows:
library(lattice) xyplot(Petal.Length ~ Petal.Width | Species, data = iris, panel = function(x, y, ...) { panel.xyplot(x,y) panel.abline(lm(y~x)) } )
The following code performs all the operations:
library(NbClust) NbClust(iris[1:4],distance="euclidean", method= "kmeans") NbClust(iris[1:4],distance="manhattan", method= "kmeans") NbClust(iris[1:4],distance="maximum", method= "kmeans")
The outputs all show that a two-cluster solution is the best according to a majority rule. This is surprising because we know there are three classes in the dataset, and kmeans()
is very accurate at classifying the cases. But apparently, this is not the best way to classify the data. Without prior knowledge, we could have concluded that there are observations for only two species of flowers in the dataset.
We first create a vector with all the possible distances allowed by the dist()
function; we then create a list, called d
, of one element—the first distance matrix (with the Euclidean distance). We then add the other distance matrices to the list. Please note that the binary distance was included here for demonstration purposes, but the binary distance is of no practical use with quantitative data:
1 dist.types = c("euclidean", "maximum", "manhattan", "canberra", 2 "binary", "minkowski") 3 d = list(dist(iris[1:4], method = dist.types[1])) 4 for (i in 2:length(dist.types)) d = c(d,list(dist(iris[1:4], 5 method = dist.types[i])))
Next, we create a vector with all the possible agglomerating methods allowed by hclust()
. We then perform the analysis six times for each of these methods (once per distance matrix), saving the result of each iteration in a matrix of lists:
1 agglo.types = c("ward.D", "ward.D2", "single", "complete", 2 "average", "mcquitty", "median", "centroid") 3 M = array(list(), 8) 4 for (i in 1:length(agglo.types)) { 5 OneDList = list(hclust(d[[1]], method = agglo.types[i])) 6 for (j in 2:length(dist.types)) OneDList = 7 c(OneDList, list(hclust(d[[j]], method = agglo.types[i]))) 8 M[[i]] = OneDList 9 }
The 48 hclust
models stored in the M
matrix could be used for selecting the best models, for instance, by examining the plotted models. We can plot results with complete linkage and Euclidean distance (graph not included) like this:
plot(M[[4]][[1]])
The reader can notice that the results using binary distance (for example, with complete linkage as follows) are meaningless:
plot(M[[4]][[5]])
Here are the solutions. Before anything, let's make sure the psych
package is loaded:
library(psych)
summary(bfi[,1:25])
We can see that there are between 9 and 36 missing values for each of the items.
We assign the cases without missing values to a new object called Dat
, retaining only the bfi
measures:
Dat = na.omit(bfi[,1:25])
We can know how many cases of the original dataset contain missing values using the following line of code. The answer is 364
:
dim(bfi)[1] - dim(Dat)[1]
cortest.normal(Dat)
The results show that the data set is significantly different from an identity matrix. We can examine the KMO index using the following code:
KMO(Dat)
The results show that the overall MSA value is large (0.85), as are most individual item values (lowest at 0.76). We deduce from these tests that performing PCA on this data is warranted.
pcas = princomp(Dat)
plot(pcas$sd^2)
We find that a five-factor solution seems optimal, which is great as we have five theoretical factors of personality (the questionnaire is designed that way)
pcas2 = principal(Dat,nfactors=5, rotate="varimax", scores=T)
We add the factorial scores to the dataset as follows:
Dat = cbind(Dat, pcas2$scores)
Cumulative var
, last column).pcas2
You will find the solution to this exercise below.
library(vcdExtra) ICU_norace = ICU[-4]
library(arules)
ICU_norace$age = cut(ICU_norace$age, breaks = 4) ICU_norace$systolic = cut(ICU_norace$systolic, breaks = 4) ICU_norace$hrtrate = cut(ICU_norace$hrtrate, breaks = 4) ICU_norace_tr = as(ICU_norace, "transactions")
rulesLowpco = apriori(ICU_norace_tr, parameter = list(confidence = 0.8, support=.1, minlen = 2), appearance = list(lhs = c("pco=<=45"), default="rhs"))
rulesLowpco.df = as(rulesLowpco,"data.frame")
IM = interestMeasure(rulesLowpco, "fishersExactTest", ICU_norace_tr)
rulesLowpco.df$IM= round(IM, digits = 2)
plot (rulesLowpco.df$lift, rulesLowpco.df$IM, xlab = "lift", ylab = "Significance")
The following code will compute the probabilities of each outcome of interest:
1 p = 18/37 2 N = 100 3 n = 1 4 v40_49 = rep(1,10) 5 v51_60 = v40_49 6 i=1 7 for (n in 40:49) { 8 v40_49[i] = choose(N, n) * (p^n) * (1 - p)^(N-n) 9 i = i + 1 10 } 11 i=1 12 for (n in 51:60) { 13 v51_60[i] = choose(N, n) * (p^n) * (1 - p)^(N-n) 14 i = i + 1 15 }
sum(v40_49)
sum(v51_60)
So it would be better to bet on black, right ? Well, this wouldn't make a difference. Because of the presence of the 0, which isn't black nor red, the probabilities of winning a several bets is strongly lower than the probabilities of losing, as the risk of losing each trial is higher than the chances of winning (19/37 versus 18/37).
cor.test(iris$Petal.Width,iris$Petal.Length)
The correlation between these attributes is around 0.963, and is significant at p < .001
.
WFC
) on work satisfaction (WorkSat
) in the first model called model01
:model01 = lm(WorkSat ~ WFC, data = nurses)
model02
, in which you include WFC
and exhaustion (Exhaus
) as predictors of WorkSat
:model02 = lm(WorkSat ~ WFC + Exhaus, data = nurses)
WFC
and Worksat
in both models using the following code. We notice that the effect of WFC
on work satisfaction is significant in model01
, but not anymore in model02
, that is when exhaustion is included in the model.summary(model01) summary(model02)
model03
with WFC
included as a predictor of exhaustion:model03 = lm(Exhaus ~ WFC, data = nurses)
The summary of model03
shows that WFC
is a significant predictor of Exhaus
:
summary(model03)
WFC
and WorkSat
by Exhaus
as follows:library(bda) mediation.test(nurses$Exhaus, nurses$WFC, nurses$WorkSat)
We can see that the value for the Sobel test is significant, which attests to the presence of the mediation
The following code will generate the kappas
values for numbers of neighbors ranging from 2 to 10 (the first value is for two neighbors, the second for three):
1 library(psych); library(class) 2 kappas = rep(0,9) 3 for (i in 1:9) { 4 set.seed(2222) 5 pred.train = knn(train = TRAIN[,2:13], test = TRAIN[,2:13], 6 cl = TRAIN$season, k = i+1) 7 tab = table(pred.train,TRAIN[,14]) 8 kappas[i] = cohen.kappa(tab)[1] 9 }
Using the following code, we notice that the two clusters' solution has the highest kappa:
which.max(kappas)
We therefore use this solution to predict our data in the testing set and compute the kappa value, which is only 0.19 (that's pretty bad!):
pred.test = knn(train = TRAIN[,2:13], test = TEST[,2:13], cl = TRAIN$season, k = 2) cohen.kappa(table(pred.test,TEST[,14]))[1]
The C4.5 model is generated as follows:
library(RWeka) IRIS.C45 = J48(Species~ . , data= IRIStrain, control= Weka_control(U=FALSE))
We compute the predictions as follows:
C45.preds = predict(IRIS.C45, IRIStest)
The CART model is generated as follows:
library(rpart) IRIS.CART = rpart(Species ~. , data= IRIStrain)
The following line of code computes a data frame with the predicted probabilities:
CART.probs = as.data.frame(predict(IRIS.CART, IRIStest))
We now obtain the column number in which the predicted probabilities are the highest for each case:
CART.preds = apply(CART.probs, 1, which.max)
Finally, we assign the name of that column to each observation:
for (i in 1:length(CART.preds)) { col= as.numeric(CART.preds[i]) CART.preds[i] = names(CART.probs[col]) }
Let's obtain the confusion matrix for both classifications:
CONF.C45 = table(IRIStest$Species, C45.preds) CONF.CART = table(IRIStest$Species, CART.preds)
We create a function that computes the accuracy of a confusion matrix for us:
acc = function(table) { sum(diag(dim(table)[1])*table) / sum(CONF.C45) }
We can see that, for this dataset, CART has a little advantage in accuracy (0.95 versus 0.92):
acc(CONF.C45) acc(CONF.CART)
We obtain the graph for the relationship between Exhaust
and WorkSat
as follows:
library(lattice) attach(NursesML) xyplot(WorkSat~Exhaust | as.factor(hosp), panel = function(x,y) { panel.xyplot(x,y) panel.lmline(x,y) })
We generate the second graph as follows:
xyplot(WorkSat~Depers | as.factor(hosp), panel = function(x,y) { panel.xyplot(x,y) panel.lmline(x,y) })
We obtain the answer to questions 2 and 3 using the following line of code. The first value is the intercept, the second value is the coefficient for the predicted value (the increase on the actual work satisfaction for an increase of 1 in the predicted value):
coeffs(modelPred)
This is easily done as follows. For the original corpus type the following code:
DTM = as.matrix(DocumentTermMatrix(acq)) FR = colSums(DTM) FR[FR>100]
For the preprocessed corpus type the following code:
acq2 = preprocess(acq) DTM2 = as.matrix(DocumentTermMatrix(acq2)) FR2 = colSums(DTM2) FR2[FR2>100]
We obtain the graph as follows:
barplot(sort(FR2[FR2>30]), names.arg = rownames(sort(FR2[FR2>30])))
18.191.181.252