The R-help mailing list

R-help is the official, main mailing list providing general discussion about problems and solutions using R, with many active users and several dozen e-mails every day. Fortunately, this public mailing list is archived on several sites, and we can easily download the compressed monthly files from, for example, ETH Zurich's R-help archives:

> library(RCurl)
> url <- getURL('https://stat.ethz.ch/pipermail/r-help/')

Now let's extract the URL of the monthly compressed archives from this page via an XPath query:

> R.help.toc <- htmlParse(url)
> R.help.archives <- unlist(xpathApply(R.help.toc,
+      "//table//td[3]/a", xmlAttrs), use.names = FALSE)

And now let's download these files to our computer for future parsing:

> dir.create('r-help')
> for (f in R.help.archives)
+     download.file(url = paste0(url, f),
+          file.path('help-r', f), method = 'curl'))

Note

Depending on your operating system and R version, the curl option that we used to download files via the HTTPS protocol might not be available. In such cases, you can try other another method or update the query to use the RCurl, curl, or httr packages.

Downloading these ~200 files takes some time and you might also want to add a Sys.sleep call in the loop so as not to overload the server. Anyway, after some time, you will have a local copy of the R-help mailing list in the r-help folder, ready to be parsed for some interesting data:

> lines <- system(paste0(
+     "zgrep -E '^From: .* at .*' ./help-r/*.txt.gz"),
+                 intern = TRUE)
> length(lines)
[1] 387218
> length(unique(lines))
[1] 110028

Note

Instead of loading all the text files into R and using grep there, I pre-filtered the files via the Linux command line zgrep utility, which can search in gzipped (compressed) text files efficiently. If you do not have zgrep installed (it is available on both Windows and the Mac), you can extract the files first and use the standard grep approach with the very same regular expression.

So we filtered for all lines of the e-mails and headers, starting with the From string, that hold information on the senders in the e-mail address and name. Out of the ~387,000 e-mails, we have found around ~110,000 unique e-mail sources. To understand the following regular expressions, let's see how one of these lines looks:

> lines[26]
[1] "./1997-April.txt.gz:From: pcm at ptd.net (Paul C. Murray)"

Now let's process these lines by removing the static prefix and extracting the names found between parentheses after the e-mail address:

> lines    <- sub('.*From: ', '', lines)
> Rhelpers <- sub('.*\((.*)\)', '\1', lines)

And we can see the list of the most active R-help posters:

> tail(sort(table(Rhelpers)), 6)
       jim holtman     Duncan Murdoch         Uwe Ligges 
              4284               6421               6455 
Gabor Grothendieck  Prof Brian Ripley    David Winsemius 
              8461               9287              10135

This list seems to be legitimate, right? Although my first guess was that Professor Brian Ripley with his brief messages will be the first one in this list. As a result of some earlier experiences, I know that matching names can be tricky and cumbersome, so let's verify that our data is clean enough and there's only one version of the Professor's name:

> grep('Brian( D)? Ripley', names(table(Rhelpers)), value = TRUE)
 [1] "Brian D Ripley"
 [2] "Brian D Ripley [mailto:ripley at stats.ox.ac.uk]"
 [3] "Brian Ripley"
 [4] "Brian Ripley <ripley at stats.ox.ac.uk>"
 [5] "Prof Brian D Ripley"
 [6] "Prof Brian D Ripley [mailto:ripley at stats.ox.ac.uk]"
 [7] "         Prof Brian D Ripley <ripley at stats.ox.ac.uk>"
 [8] ""Prof Brian D Ripley" <ripley at stats.ox.ac.uk>"
 [9] "Prof Brian D Ripley <ripley at stats.ox.ac.uk>"
[10] "Prof Brian Ripley"
[11] "Prof. Brian Ripley"
[12] "Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk]"
[13] "Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk] "
[14] "          	Prof Brian Ripley <ripley at stats.ox.ac.uk>"
[15] "  Prof Brian Ripley <ripley at stats.ox.ac.uk>"
[16] ""Prof Brian Ripley" <ripley at stats.ox.ac.uk>"
[17] "Prof Brian Ripley<ripley at stats.ox.ac.uk>"
[18] "Prof Brian Ripley <ripley at stats.ox.ac.uk>"
[19] "Prof Brian Ripley [ripley at stats.ox.ac.uk]"
[20] "Prof Brian Ripley <ripley at toucan.stats>"
[21] "Professor Brian Ripley"
[22] "r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Prof Brian Ripley"        
[23] "r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Prof Brian Ripley"

Well, it seems that the Professor used some alternative From addresses as well, so a more valid estimate of the number of his messages should be something like:

> sum(grepl('Brian( D)? Ripley', Rhelpers))
[1] 10816

So using quick, regular expressions to extract the names from the e-mails returned most of the information we were interested in, but it seems that we have to spend a lot more time to get the whole information set. As usual, the Pareto rule applies: we can spend around 80 percent of our time on preparing data, and we can get 80 percent of the data in around 20 percent of the whole project timeline.

Due to page limitations, we will not cover data cleansing on this dataset in greater detail at this point, but I highly suggest checking Mark van der Loo's stringdist package, which can compute string distances and similarities to, for example, merge similar names in cases like this.

Volume of the R-help mailing list

But besides the sender, these e-mails also include some other really interesting data as well. For example, we can extract the date and time when the e-mail was sent—to model the frequency and temporal pattern of the mailing list.

To this end, let's filter for some other lines in the compressed text files:

> lines <- system(paste0(
+     "zgrep -E '^Date: [A-Za-z]{3}, [0-9]{1,2} [A-Za-z]{3} ",
+     "[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2} [-+]{1}[0-9]{4}' ",
+     "./help-r/*.txt.gz"),
+                 intern = TRUE)

This returns fewer lines when compared to the previously extracted From lines:

> length(lines)
[1] 360817

This is due to the various date and time formats used in the e-mail headers, as sometimes the day of the week was not included in the string or the order of year, month, and day was off compared to the vast majority of other mails. Anyway, we will only concentrate on this significant portion of mails with the standard date and time format but, if you are interested in transforming these other time formats, you might want to check Hadley Wickham's lubridate package to help your workflow. But please note that there's no general algorithm to guess the order of decimal year, month, and day—so you will end up with some manual data cleansing for sure!

Let's see how these (subset of) lines look:

> head(sub('.*Date: ', '', lines[1]))
[1] "Tue, 1 Apr 1997 20:35:48 +1200 (NZST)"

Then we can simply get rid of the Date prefix and parse the time stamps via strptime:

> times <- strptime(sub('.*Date: ', '', lines),
+            format = '%a, %d %b %Y %H:%M:%S %z')

Now that the data is in a parsed format (even the local time-zones were converted to UTC), it's relatively easy to see, for example, the number of e-mails on the mailing list per year:

> plot(table(format(times, '%Y')), type = 'l')
Volume of the R-help mailing list

Note

Although the volume on the R-help mailing list seems to have decreased in the past few years, it's not due to the lower R activity: R users, okay as is or no/. others on the Internet, nowadays tend to use other information channels more often than e-mail—for example: StackOverflow and GitHub (or even Facebook and LinkedIn). For a related research, please see the paper of Bogdan Vasilescu at al at http://web.cs.ucdavis.edu/~filkov/papers/r_so.pdf.

Well, we can do a lot better than this, right? Let's massage our data a bit and visualize the frequency of mails based on the day of week and hour of the day via a more elegant graph—inspired by GitHub's punch card plot:

> library(data.table)
> Rhelp <- data.table(time = times)
> Rhelp[, H := hour(time)]
> Rhelp[, D := wday(time)]

Visualizing this dataset is relatively straightforward with ggplot:

> library(ggplot2)
> ggplot(na.omit(Rhelp[, .N, by = .(H, D)]),
+      aes(x = factor(H), y = factor(D), size = N)) + geom_point() +
+      ylab('Day of the week') + xlab('Hour of the day') +
+      ggtitle('Number of mails posted on [R-help]') +
+      theme_bw() + theme('legend.position' = 'top')
Volume of the R-help mailing list

As the times are by UTC, the early morning mails might suggest that where most R-help posters live has a positive GMT offset—if we suppose that most e-mails were written in business hours. Well, at least the lower number of e-mails on the weekends seems to suggest this statement.

And it seems that the UTC, UTC+1, and UTC+2 time zones are indeed rather frequent, but the US time zones are also pretty common for the R-help posters:

> tail(sort(table(sub('.*([+-][0-9]{4}).*', '\1', lines))), 22)
-1000 +0700 +0400 -0200 +0900 -0000 +0300 +1300 +1200 +1100 +0530 
  164   352   449  1713  1769  2585  2612  2917  2990  3156  3938 
-0300 +1000 +0800 -0600 +0000 -0800 +0200 -0500 -0400 +0100 -0700 
 4712  5081  5493 14351 28418 31661 42397 47552 50377 51390 55696

Forecasting the e-mail volume in the future

And we can also use this relatively clean dataset to forecast the future volume of the R-help mailing list. To this end, let's aggregate the original dataset to count data daily, as we saw in Chapter 3, Filtering and Summarizing Data:

> Rhelp[, date := as.Date(time)]
> Rdaily <- na.omit(Rhelp[, .N, by = date])

Now let's transform this data.table object into a time-series object by referencing the actual mail counts as values and the dates as the index:

> Rdaily <- zoo(Rdaily$N, Rdaily$date)

Well, this daily dataset is a lot spikier than the previously rendered yearly graph:

> plot(Rdaily)
Forecasting the e-mail volume in the future

But instead of smoothing or trying to decompose this time-series, like we did in Chapter 12, Analyzing Time-series, let's rather see how we can provide some quick estimates (based on historical data) on the forthcoming number of mails on this mailing list with some automatic models. To this end, we will use the forecast package:

> library(forecast)
> fit <- ets(Rdaily)

The ets function implements a fully automatic method that can select the optimal trend, season, and error type for the given time-series. Then we can simply call the predict or forecast function to see the specified number of estimates, only for the next day in this case:

> predict(fit, 1)
     Point Forecast   Lo 80    Hi 80        Lo 95    Hi 95
5823       28.48337 9.85733 47.10942 -0.002702251 56.96945

So it seems that, for the next day, our model estimated around 28 e-mails with a confidence interval of 80 percent being somewhere between 10 and 47. Visualizing predictions for a slightly longer period of time with some historical data can be done via the standard plot function with some useful new parameters:

> plot(forecast(fit, 30), include = 365)
Forecasting the e-mail volume in the future
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.152.139