Chapter 2. Statistical Methods with R

This chapter will present an overview of how to summarize your data and get useful statistical information for downstream analysis. We will also show you how to plot and get statistical information from probability distributions and how to test the fit of your sample distribution to well-defined probability distributions. We will also go over some of the functions used to perform hypothesis testing including the Student's t-test, Wilcoxon rank-sum test, z-test, chi-squared test, Fisher's exact test, and F-test.

Before we begin, we will load the gene expression profiling data from the E-GEOD-19577 study entitled "MLL partner genes confer distinct biological and clinical signatures of pediatric AML, an AIEOP study" from the ArrayExpress website to use as a sample dataset for some of our examples. For simplicity, we will not go into the details of how the data was generated, except to mention that the study evaluates the expression level of 54,675 probes in 42 leukemia patients' samples. If you would like to learn more about the study, please consult the experiment web page at http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-19577. Here are the steps we will follow to load the data into R:

  1. Download the R ExpressionSet (E-GEOD-19577.eSet.r).
  2. Load the dataset with the load() function. This command will create the study object, which contains the raw experimental data.
  3. Rename the study object as MLLpartner.dataset.
  4. Load the Biobase and affy bioconductor packages.
  5. Normalize the data with the rma() function.
  6. Inspect the data.

These steps can be implemented in R, as follows:

> load(url("http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-19577/E-GEOD-19577.eSet.r"))
> MLLpartner.ds <- study
> library("affy")
> library("Biobase")
> AEsetnorm = rma(MLLpartner.ds)
Background correcting
Normalizing
Calculating Expression
> head(exprs(AEsetnorm)) #output shown truncated
          GSM487973 GSM487972 GSM487971 GSM487970 GSM487969
1007_s_at  4.372242  4.293080  4.210850  4.707231  4.345101
1053_at    8.176033  8.541016  8.338475  7.935021  7.868985
117_at     5.730343  8.723568  5.172717  5.404062  5.731468
121_at     7.744487  6.951332  7.202343  7.158402  6.959318
1255_g_at  2.707115  2.717625  2.699625  2.698669  2.701679
1294_at    9.077232  7.611238  9.649630  7.911132  9.732346

Now, let's get the expression values for two probes to be used in our examples, as follows:

> probeA <- as.numeric(exprs(AEsetnorm)[1,])
> probeA <- setNames(probeA, colnames(exprs(AEsetnorm)))
> probeB <- as.numeric(exprs(AEsetnorm)[2,])
> probeB <-setNames(probeB, colnames(exprs(AEsetnorm)))

Now, let's create a matrix with all the expression values for each probe evaluated in the 42 patient samples, as follows:

> MLLpartner.mx <- as.matrix(exprs(AEsetnorm))
> #Lets save the object to our session
> dump("MLLpartner.mx", "MLLpartner.R")
> class(MLLpartner.mx)
[1] "matrix"
> dim(MLLpartner.mx)
[1] 54675    42

Descriptive statistics

A useful tool to evaluate your data before you begin your analysis is the summary() function, which provides a summary of the non-parametric descriptors of a sample, as follows:

> summary(probeA)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  4.211   4.645   4.774   4.774   4.892   5.231

You can also get a summary for each column of the matrix using the summary() function, as follows:

> summary(MLLpartner.mx) #output truncated
   GSM487973        GSM487972        GSM487971     
 Min.   : 2.112   Min.   : 1.805   Min.   : 1.994  
 1st Qu.: 3.412   1st Qu.: 3.410   1st Qu.: 3.411  
 Median : 4.736   Median : 4.745   Median : 4.731  
 Mean   : 5.342   Mean   : 5.346   Mean   : 5.355  
 3rd Qu.: 6.870   3rd Qu.: 6.851   3rd Qu.: 6.933  
 Max.   :14.449   Max.   :14.453   Max.   :14.406  

We can also get this information by calling the individual functions used to determine these parameters, namely the mean, median, min, max, and quantile functions, as follows:

> min(probeA)
[1] 4.21085
> max(probeA)
[1] 5.231199
> mean(probeA)
[1] 4.773866
> median(probeA)
[1] 4.774236
> quantile(probeA)
      0%      25%      50%      75%     100%
4.210850 4.644994 4.774236 4.892259 5.231199

You can also specify which probabilities to use with the probs argument, as follows:

> quantile(probeA, probs = c(0.1, 0.2, 0.6, 0.9))
     10%      20%      60%      90%
4.375377 4.576501 4.821101 5.118735

To avoid getting a string of numbers in your output, you can specify the number of decimal places to be displayed using the round() function, as follows:

> round(mean(probeA), 2)
[1] 4.77

Suppose we also had information on the response to the drugA treatment for those 42 patient samples, we could include it to the probeA and probeB gene expression levels, as follows:

> df <- data.frame(expr_probeA=probeA, expr_probeB=probeB, drugA_response= factor(rep(c("success", "fail"), 21)))
> head(df)
          expr_probeA expr_probeB drugA_response
GSM487973    4.372242    8.176033        success
GSM487972    4.293080    8.541016           fail
GSM487971    4.210850    8.338475        success
GSM487970    4.707231    7.935021           fail
GSM487969    4.345101    7.868985        success
GSM487968    4.586062    7.909702           fail

Now, we can get a summary for each column by response to the drugA treatment using the by() function, as follows:

> by(df, df$drugA_response, summary)
df$drugA_response: fail
  expr_probeA     expr_probeB    drugA_response
 Min.   :4.293   Min.   :6.960   fail   :21    
 1st Qu.:4.687   1st Qu.:7.935   success: 0    
 Median :4.766   Median :8.245                 
 Mean   :4.786   Mean   :8.201                 
 3rd Qu.:4.895   3rd Qu.:8.575                 
 Max.   :5.218   Max.   :8.926                 
-------------------------------------------------
df$drugA_response: success
  expr_probeA     expr_probeB    drugA_response
 Min.   :4.211   Min.   :6.652   fail   : 0    
 1st Qu.:4.571   1st Qu.:7.597   success:21    
 Median :4.776   Median :7.921                 
 Mean   :4.762   Mean   :7.950                 
 3rd Qu.:4.885   3rd Qu.:8.338                 
 Max.   :5.231   Max.   :9.033

Data variability

In addition to general information from the mean, median, and quantiles functions, we are often interested in knowing how variable our data points are from each other. The simplest measure of variability is the range, which is the difference between the largest and smallest value. We can obtain the range by subtracting the maximum value from the minimum value, or simply using the range() function, as shown in the following code:

> max(probeA) - min(probeA)
[1] 1.02035
> range(probeA)
[1] 4.210850 5.231199

Other measures of variability include the variance and standard deviation. The variance is defined as the average squared deviation of the data values from the mean. More formally, the population variance is defined as:

Data variability

In the preceding formula, N is the size and Data variability is the mean of the population given by each data point Data variability. The sample variance is defined as:

Data variability

In the preceding formula, n is the number of samples from the population, n is less than N and Data variability is the sample mean.

In other words, the sum of squares divided by the number of data values (n) for a population and the degrees of freedom (n-1) for a sample. We can find the mean using the mean() function, as follows:

> mean(probeA)
[1] 4.773866

We can calculate the sum of squares in R using the sum() function, as follows:

> probeA.soq <- sum((probeA-mean(probeA))^2)
[1] 2.734039

Now, we can easily calculate the unbiased sample variance by dividing the sum of squares by the number of degrees of freedom defined as (n-1), where n is length() of the probeA vector, as shown in the following command:

> d.f <- length(probeA) - 1
> probeA.soq/(d.f)
[1] 0.06668388

A quicker way to get the variance would be to use the var() function, as shown in the following command:

 > var(probeA)
[1] 0.06668388

Another measure of variability is the standard deviation defined as the square root of the sample variance. To get the standard deviation for our probeA data, we can write the formula using the sqrt() function to get the square root of the variance or, more simply, use the sd() function, as shown in the following command:

> sqrt(var(probeA))
[1] 0.2582322
> sd(probeA)
[1] 0.2582322

Often, it is important to determine how reliable our measures are by calculating the standard error of the mean, defined as the square root of the variance divided by the number of samples, as shown in the following command:

> sqrt(var(probeA)/length(probeA))
[1] 0.0398461

Confidence intervals

Another way we can assess the reliability of our measurements is by determining the confidence intervals for the mean of our data points. Confidence intervals (CI) estimate the range the mean would fall in, should the experiment or exercise be repeated. We can calculate the confidence intervals for the mean of our sample distribution by multiplying the standard error by the t value associated with a significance level of Confidence intervals equal to 0.025 or 0.975 quantile of the t-distribution with 41 degrees of freedom. The qt() function gives us the quantiles of the t-distribution for n degrees of freedom, which we can apply in the formula to calculate the confidence interval, as shown in the following code:

> std.err.s2A <- sqrt(var(probeA)/length(probeA))
> qt(.975, d.f)* std.err.s2A
[1] 0.08047082

Therefore, the mean of the gene expression values for probeA is 4.77 ± 0.080 for the 42 samples with a 95 percent confidence interval.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.93.64