This chapter will present an overview of how to summarize your data and get useful statistical information for downstream analysis. We will also show you how to plot and get statistical information from probability distributions and how to test the fit of your sample distribution to well-defined probability distributions. We will also go over some of the functions used to perform hypothesis testing including the Student's t-test, Wilcoxon rank-sum test, z-test, chi-squared test, Fisher's exact test, and F-test.
Before we begin, we will load the gene expression profiling data from the E-GEOD-19577 study entitled "MLL partner genes confer distinct biological and clinical signatures of pediatric AML, an AIEOP study" from the ArrayExpress website to use as a sample dataset for some of our examples. For simplicity, we will not go into the details of how the data was generated, except to mention that the study evaluates the expression level of 54,675 probes in 42 leukemia patients' samples. If you would like to learn more about the study, please consult the experiment web page at http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-19577. Here are the steps we will follow to load the data into R:
E-GEOD-19577.eSet.r
).load()
function. This command will create the study
object, which contains the raw experimental data.study
object as MLLpartner.dataset
.Biobase
and affy
bioconductor packages.rma()
function.These steps can be implemented in R, as follows:
> load(url("http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-19577/E-GEOD-19577.eSet.r")) > MLLpartner.ds <- study > library("affy") > library("Biobase") > AEsetnorm = rma(MLLpartner.ds) Background correcting Normalizing Calculating Expression > head(exprs(AEsetnorm)) #output shown truncated GSM487973 GSM487972 GSM487971 GSM487970 GSM487969 1007_s_at 4.372242 4.293080 4.210850 4.707231 4.345101 1053_at 8.176033 8.541016 8.338475 7.935021 7.868985 117_at 5.730343 8.723568 5.172717 5.404062 5.731468 121_at 7.744487 6.951332 7.202343 7.158402 6.959318 1255_g_at 2.707115 2.717625 2.699625 2.698669 2.701679 1294_at 9.077232 7.611238 9.649630 7.911132 9.732346
Now, let's get the expression values for two probes to be used in our examples, as follows:
> probeA <- as.numeric(exprs(AEsetnorm)[1,]) > probeA <- setNames(probeA, colnames(exprs(AEsetnorm))) > probeB <- as.numeric(exprs(AEsetnorm)[2,]) > probeB <-setNames(probeB, colnames(exprs(AEsetnorm)))
Now, let's create a matrix with all the expression values for each probe evaluated in the 42 patient samples, as follows:
> MLLpartner.mx <- as.matrix(exprs(AEsetnorm)) > #Lets save the object to our session > dump("MLLpartner.mx", "MLLpartner.R") > class(MLLpartner.mx) [1] "matrix" > dim(MLLpartner.mx) [1] 54675 42
A useful tool to evaluate your data before you begin your analysis is the summary()
function, which provides a summary of the non-parametric descriptors of a sample, as follows:
> summary(probeA) Min. 1st Qu. Median Mean 3rd Qu. Max. 4.211 4.645 4.774 4.774 4.892 5.231
You can also get a summary for each column of the matrix using the summary()
function, as follows:
> summary(MLLpartner.mx) #output truncated GSM487973 GSM487972 GSM487971 Min. : 2.112 Min. : 1.805 Min. : 1.994 1st Qu.: 3.412 1st Qu.: 3.410 1st Qu.: 3.411 Median : 4.736 Median : 4.745 Median : 4.731 Mean : 5.342 Mean : 5.346 Mean : 5.355 3rd Qu.: 6.870 3rd Qu.: 6.851 3rd Qu.: 6.933 Max. :14.449 Max. :14.453 Max. :14.406
We can also get this information by calling the individual functions used to determine these parameters, namely the mean
, median
, min
, max
, and quantile
functions, as follows:
> min(probeA) [1] 4.21085 > max(probeA) [1] 5.231199 > mean(probeA) [1] 4.773866 > median(probeA) [1] 4.774236 > quantile(probeA) 0% 25% 50% 75% 100% 4.210850 4.644994 4.774236 4.892259 5.231199
You can also specify which probabilities to use with the probs
argument, as follows:
> quantile(probeA, probs = c(0.1, 0.2, 0.6, 0.9)) 10% 20% 60% 90% 4.375377 4.576501 4.821101 5.118735
To avoid getting a string of numbers in your output, you can specify the number of decimal places to be displayed using the round()
function, as follows:
> round(mean(probeA), 2) [1] 4.77
Suppose we also had information on the response to the drugA
treatment for those 42 patient samples, we could include it to the probeA
and probeB
gene expression levels, as follows:
> df <- data.frame(expr_probeA=probeA, expr_probeB=probeB, drugA_response= factor(rep(c("success", "fail"), 21))) > head(df) expr_probeA expr_probeB drugA_response GSM487973 4.372242 8.176033 success GSM487972 4.293080 8.541016 fail GSM487971 4.210850 8.338475 success GSM487970 4.707231 7.935021 fail GSM487969 4.345101 7.868985 success GSM487968 4.586062 7.909702 fail
Now, we can get a summary for each column by response to the drugA
treatment using the by()
function, as follows:
> by(df, df$drugA_response, summary) df$drugA_response: fail expr_probeA expr_probeB drugA_response Min. :4.293 Min. :6.960 fail :21 1st Qu.:4.687 1st Qu.:7.935 success: 0 Median :4.766 Median :8.245 Mean :4.786 Mean :8.201 3rd Qu.:4.895 3rd Qu.:8.575 Max. :5.218 Max. :8.926 ------------------------------------------------- df$drugA_response: success expr_probeA expr_probeB drugA_response Min. :4.211 Min. :6.652 fail : 0 1st Qu.:4.571 1st Qu.:7.597 success:21 Median :4.776 Median :7.921 Mean :4.762 Mean :7.950 3rd Qu.:4.885 3rd Qu.:8.338 Max. :5.231 Max. :9.033
In addition to general information from the mean
, median
, and quantiles
functions, we are often interested in knowing how variable our data points are from each other. The simplest measure of variability is the range, which is the difference between the largest and smallest value. We can obtain the range by subtracting the maximum value from the minimum value, or simply using the range()
function, as shown in the following code:
> max(probeA) - min(probeA) [1] 1.02035 > range(probeA) [1] 4.210850 5.231199
Other measures of variability include the variance and standard deviation. The variance is defined as the average squared deviation of the data values from the mean. More formally, the population variance is defined as:
In the preceding formula, N is the size and is the mean of the population given by each data point . The sample variance is defined as:
In the preceding formula, n is the number of samples from the population, n is less than N and is the sample mean.
In other words, the sum of squares divided by the number of data values (n) for a population and the degrees of freedom (n-1) for a sample. We can find the mean using the mean()
function, as follows:
> mean(probeA) [1] 4.773866
We can calculate the sum of squares in R using the sum()
function, as follows:
> probeA.soq <- sum((probeA-mean(probeA))^2) [1] 2.734039
Now, we can easily calculate the unbiased sample variance by dividing the sum of squares by the number of degrees of freedom defined as (n-1), where n is length()
of the probeA
vector, as shown in the following command:
> d.f <- length(probeA) - 1 > probeA.soq/(d.f) [1] 0.06668388
A quicker way to get the variance would be to use the var()
function, as shown in the following command:
> var(probeA) [1] 0.06668388
Another measure of variability is the standard deviation defined as the square root of the sample variance. To get the standard deviation for our probeA
data, we can write the formula using the sqrt()
function to get the square root of the variance or, more simply, use the sd()
function, as shown in the following command:
> sqrt(var(probeA)) [1] 0.2582322 > sd(probeA) [1] 0.2582322
Often, it is important to determine how reliable our measures are by calculating the standard error of the mean, defined as the square root of the variance divided by the number of samples, as shown in the following command:
> sqrt(var(probeA)/length(probeA)) [1] 0.0398461
Another way we can assess the reliability of our measurements is by determining the confidence intervals for the mean of our data points. Confidence intervals (CI) estimate the range the mean would fall in, should the experiment or exercise be repeated. We can calculate the confidence intervals for the mean of our sample distribution by multiplying the standard error by the t value associated with a significance level of equal to 0.025 or 0.975 quantile of the t-distribution with 41 degrees of freedom. The qt()
function gives us the quantiles of the t-distribution for n degrees of freedom, which we can apply in the formula to calculate the confidence interval, as shown in the following code:
> std.err.s2A <- sqrt(var(probeA)/length(probeA)) > qt(.975, d.f)* std.err.s2A [1] 0.08047082
Therefore, the mean of the gene expression values for probeA
is 4.77 ± 0.080 for the 42 samples with a 95 percent confidence interval.
18.118.93.64