Communicating results

Unless an analysis is performed solely for the personal edification of the analyst, the results are going to be communicated—either to teammates, your company, your lab, or the general public. Some very advanced technologies are in place for R programmers to communicate their results accurately and attractively.

Following the pattern of some of the other sections in this chapter, we will talk about a range of approaches starting with a bad alternative and give an explanation for why it's inadequate.

The terrible solution to the creating of a statistical report is to copy R output into a Word document (or PowerPoint presentation) mixed with prose. Why is this terrible? you ask? Because if one little thing about your analysis changes, you will have to re-copy the new R output into the document, manually. If you do this enough times, it's not a matter of if but a matter of when you will mess up and copy the wrong thing, or forget to copy the new output, and so on. This method just opens up too many vectors for mistakes. Additionally, any time you have to make a slight change to a plot, update a data source, alter priors, or even change the number of multiple imputation iterations to use, it requires a herculean effort on your part to keep the document up to date.

All better solutions involve having R directly output the document that you will use to communicate your results. RStudio (along with the knitr and rmarkdown packages) makes it very easy for you to have your analysis spit out a paper rendered with LaTeX, a slideshow presentation, or a self-contained HTML webpage. It's even possible to have R directly output a Word document, whose contents are dynamically created using R objects.

The least attractive, but easiest of the alternatives, is to use the Compile Notebook function from the RStudio interface (the button labeled f in Figure 13.1). A pop-up should appear asking you if you want the output in HTML, PDF, or a Word document. Choose one and look at the output.

Communicating results

Figure 13.4: An excerpt from the output of Compile Notebook on our example script

Sure, this may not be the prettiest document in the world, but at least it combines our code (including our informative comments) and results (plots) in a single document. Further, any change to our R script followed by recompiling the notebook will result in a completely updated document for sharing. It's a little bit weird to have our narrative told completely via comments, though, right?

Literate programming is a novel programming paradigm put forth by genius computer scientist Donald Knuth (who we mentioned in the previous chapter). This approach involves interspersing computer code and prose in the same document. Whereas the Compile Notebook feature doesn't allow for prose (except in code comments), the RStudio/knitr/rmarkdown stack allows for an approach to report generation where the prose/narrative plays a more integral part. To begin, click the New Document button (component e), and choose R Markdown… from the dropdown. Choose a title like example1 in the pop-up window, leave the default output format, and press OK. You should see a document with some unfamiliar symbols in the editor. Finally, click the button labeled Knit HTML (it's the button with the cute image of a ball of yarn), and inspect the output.

Go back to the editor and re-read the code that produced the HTML output. This is R Markdown: a lightweight markup language with easy-to-remember formatting syntax elements and support for the embedded R code.

Besides the auto-generated header, the document consists of a series of two components. The first of the components is stretches of prose written in Markdown. With Markdown, a range of formatting options can be written in plain text that can be rendered in many different output formats, like HTML and PDF. These formatting options are simple: *This* produces italic text; **this** produces bold text. For a handy cheat sheet of Markdown formatting options, click the question mark icon (which appears when you are editing R Markdown [.Rmd] documents), and choose Markdown Quick Reference from the dropdown.

The second component is snippets of R code called chunks. These chunks are put between two sets of backticks (```). The set of three backticks that open a chunk look like ```{r}. Between the curly braces, you can optionally name the chunk, and you can specify any number of chunk options. Note that in example1.Rmd, the second chunk uses the option echo=FALSE; this means that the code snippet plot(cars) will not appear in the final rendered document, even though its output (namely, the plot) will.

There's an element of R Markdown that I want to call out explicitly: inline R code. During stretches of prose, any text between `r and ` is evaluated by the R interpreter, and substituted with its result in the final rendered document. Without this mechanism, any specific numbers/information related to the data objects (like the number of observations in a dataset) have to be hardcoded into the prose. When the code changed, the onus of visiting each of these hardcoded values to make sure they are up to date was on the report author. Using inline R to offload this updating onto R eliminates an entire class of common mistakes in report generation.

What follows is a re-working of our SAT script in R Markdown. This will give us a chance to look at this technology in more detail, and gain an appreciation for how it can help us achieve our goals of easy-to-manage reproducible, literate research.

---
title: "NYC SAT Scores Analysis"
author: "Tony Fischetti"
date: "November 1, 2015"
output: html_document
---

#### Aim:
To use Bayesian analysis to compare NYC's 2010 
combined SAT scores against the average of the
rest of the country, which, according to
FairTest.com, is 1509


```{r, echo=FALSE}
# options
options(echo=TRUE)
options(stringsAsFactors=FALSE)
```

We are going to use the `assertr` and `runjags`
packages for data checking and MCMC, respectively.
```{r}
# libraries
library(assertr)   # for data checking
library(runjags)   # for MCMC
```

Let's make sure everything is all set with JAGS!
```{r}
testjags()
```
Great!

This data was found in the NYC Open Data Portal:
https://nycopendata.socrata.com
```{r}
link.to.data <- "http://data.cityofnewyork.us/api/views/zt9s-n5aj/rows.csv?accessType=DOWNLOAD"
download.file(link.to.data, "./data/SAT_Scores_NYC_2010.csv")

nyc.sats <- read.csv("./data/SAT_Scores_NYC_2010.csv")
```

Let's give the columns easier names
```{r}
better.names <- c("id", "school.name", "n", "read.mean",
                  "math.mean", "write.mean")
names(nyc.sats) <- better.names
```

There are `r nrow(nyc.sats)` rows but almost 700 NYC schools. We will,
therefore, *assume* that this is a random sample of NYC schools.


Let's first check the veracity of this data...
```{r, error=TRUE}
nyc.sats <- assert(nyc.sats, is.numeric,
                   n, read.mean, math.mean, write.mean)
```

It looks like check failed because there are "s"s for some rows. (??)
A look at the data set descriptions indicates that the "s" is for schools
with 5 or fewer students. For our purposes, let's just exclude them.


This is a function that takes a vector, replaces all "s"s
with NAs and make coverts all non-"s"s into numerics
```{r}
remove.s <- function(vec){
  ifelse(vec=="s", NA, vec)
}

nyc.sats$n          <- as.numeric(remove.s(nyc.sats$n))
nyc.sats$read.mean  <- as.numeric(remove.s(nyc.sats$read.mean))
nyc.sats$math.mean  <- as.numeric(remove.s(nyc.sats$math.mean))
nyc.sats$write.mean <- as.numeric(remove.s(nyc.sats$write.mean))
```

Now we are going to remove schools with fewer than 5 test takers
and calculate a combined SAT score
```{r}
nyc.sats <- nyc.sats[complete.cases(nyc.sats), ]

# Calculate a total combined SAT score
nyc.sats$combined.mean <- (nyc.sats$read.mean +
                           nyc.sats$math.mean +
                           nyc.sats$write.mean)
```
Let's now build a posterior distribution of the true mean of NYC high schools' combined SAT scores. We're not going to look at the summary statistics, because we don't want to bias our priors.
We will use a standard gaussian model.

```{r, cache=TRUE, results="hide", warning=FALSE, message=FALSE}
the.model <- "
model {
  # priors
  mu ~ dunif(0, 2400)
  stddev ~ dunif(0, 500)
  tau <- pow(stddev, -2)

  # likelihood
  for(i in 1:theLength){
     samp[i] ~ dnorm(mu, tau)
  }
}"

the.data <- list(
  samp = nyc.sats$combined.mean,
  theLength = length(nyc.sats$combined.mean)
)

results <- autorun.jags(the.model, data=the.data,
                        n.chains = 3,
                        monitor = c('mu'))
```

Let's view the results of the MCMC.
```{r}
print(results)
```
Now let's plot the MCMC diagnostics
```{r, message=FALSE}
plot(results, plot.type=c("histogram", "trace"), layout=c(2,1))
```

Looks good!


Let's extract the MCMC samples of the mean, and get the
bounds of the middle 95%
```{r}
results.matrix <- as.matrix(results$mcmc)
mu.samples <- results.matrix[,'mu']
bounds <- quantile(mu.samples, c(.025, .975))
```

We are 95% sure that the true mean is between 
`r round(bounds[1], 2)` and `r round(bounds[2], 2)`.

Now let's plot the marginal posterior distribution for the mean
of the NYC high schools' combined SAT grades, and draw the 95%
percent credible interval.
```{r}
plot(density(mu.samples),
     main=paste("Posterior distribution of mean combined SAT",
                "score in NYC high schools (2010)", sep="
"))
lines(c(bounds[1], bounds[2]), c(0, 0), lwd=3, col="red")
```

Given the results, the SAT scores for NYC high schools in 2010
are **incontrovertibly** not at par with the average SAT scores of
the nation.

------------------------------------

This is some session information for reproducibility:
```{r}
devtools::session_info()
```

This R Markdown, when rendered by knitting the HTML, looks like this:

Communicating results

Figure 13.5: An excerpt from the output of Knit HTML on our example R Markdown document

Now, that's a handsome document!

A few things to note: First, our contextual narrative is no longer told through code comments; the narrative, code, code output, and plots are all separate and easily distinguished. Second, note that both, the number of observations in the data set and the bounds of our credible interval, are dynamically woven into the final document. If we change our priors, or use a different likelihood function (and we should—see exercise #3), the bounds as they appear in our final report will be automatically updated.

Finally, take a look at the chunk options we've used. We hid the code in our first chunk so that we didn't clutter the final document with option setting. In the sixth chunk, we used the option error=TRUE to let the renderer know that we expected the contained code to fail. The printed error message nicely illustrates why we had to spend the subsequent chunk on data cleaning. In the ninth chunk (the one where we run the MCMC chains), we use quite a few options. cache=TRUE caches the result of the chunk so that if the chunk's code doesn't change, we don't have to wait for MCMC chains to converge everything we render the document. We use results="hide" to hide the verbose output of autorun.jags. We use warning=FALSE to suppress the warning emitted by autorun.jags informing us that we didn't choose starting values for the chains. Lastly, we use message=FALSE to quiet the message produced by a autorun.jags that the rjags namespace is automatically being loaded. autorun.jags sure is chatty!

We may opt to use different chunk options depending on our intended audience. For example, we could hide more of the code—and focus more on the output and interpretation—if we were communicating the results to a party of non-statistical-programmers. On the other hand, we would hide less of the code if we were using the rendered HTML as a pedagogical document to teach budding R programmers how to use R Markdown.

The HTML that is produced can now be uploaded—as a standalone document—to a web server so that the results can be sent to others as a hyperlink. Bear in mind, too, that we are not limited to knitting HTML; we could have just as easily knitted a PDF or Word document. We could have also used R Markdown to produce a slideshow presentation—I use this technology all the time at work.

You don't have to necessarily use RStudio to produce these handsome, dynamically-generated reports (they can be rendered using only the knitr and rmarkdown packages and a format conversion utility called pandoc), but RStudio makes writing them so easy, you would need a really compelling reason to use any other editor.

knitr is a beefy package indeed, and we only touched on the tip of the iceberg in regard to what it is capable of; we didn't cover, for example, customizing the reports with HTML, embedding Math equations into the reports, and using LaTeX (instead of R Markdown) for increased flexibility. If you see that power in knitr, and dynamically-generated literate documents in general, I urge you to learn more about it.

Communicating results
Communicating results
Communicating results
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.11.211