R-help is the official, main mailing list providing general discussion about problems and solutions using R, with many active users and several dozen e-mails every day. Fortunately, this public mailing list is archived on several sites, and we can easily download the compressed monthly files from, for example, ETH Zurich's R-help archives:
> library(RCurl) > url <- getURL('https://stat.ethz.ch/pipermail/r-help/')
Now let's extract the URL of the monthly compressed archives from this page via an XPath query:
> R.help.toc <- htmlParse(url) > R.help.archives <- unlist(xpathApply(R.help.toc, + "//table//td[3]/a", xmlAttrs), use.names = FALSE)
And now let's download these files to our computer for future parsing:
> dir.create('r-help') > for (f in R.help.archives) + download.file(url = paste0(url, f), + file.path('help-r', f), method = 'curl'))
Downloading these ~200 files takes some time and you might also want to add a Sys.sleep
call in the loop so as not to overload the server. Anyway, after some time, you will have a local copy of the R-help
mailing list in the r-help
folder, ready to be parsed for some interesting data:
> lines <- system(paste0( + "zgrep -E '^From: .* at .*' ./help-r/*.txt.gz"), + intern = TRUE) > length(lines) [1] 387218 > length(unique(lines)) [1] 110028
Instead of loading all the text files into R and using grep
there, I pre-filtered the files via the Linux command line zgrep
utility, which can search in gzipped
(compressed) text files efficiently. If you do not have zgrep
installed (it is available on both Windows and the Mac), you can extract the files first and use the standard grep
approach with the very same regular expression.
So we filtered for all lines of the e-mails and headers, starting with the From
string, that hold information on the senders in the e-mail address and name. Out of the ~387,000 e-mails, we have found around ~110,000 unique e-mail sources. To understand the following regular expressions, let's see how one of these lines looks:
> lines[26] [1] "./1997-April.txt.gz:From: pcm at ptd.net (Paul C. Murray)"
Now let's process these lines by removing the static prefix and extracting the names found between parentheses after the e-mail address:
> lines <- sub('.*From: ', '', lines) > Rhelpers <- sub('.*\((.*)\)', '\1', lines)
And we can see the list of the most active R-help
posters:
> tail(sort(table(Rhelpers)), 6) jim holtman Duncan Murdoch Uwe Ligges 4284 6421 6455 Gabor Grothendieck Prof Brian Ripley David Winsemius 8461 9287 10135
This list seems to be legitimate, right? Although my first guess was that Professor Brian Ripley with his brief messages will be the first one in this list. As a result of some earlier experiences, I know that matching names can be tricky and cumbersome, so let's verify that our data is clean enough and there's only one version of the Professor's name:
> grep('Brian( D)? Ripley', names(table(Rhelpers)), value = TRUE) [1] "Brian D Ripley" [2] "Brian D Ripley [mailto:ripley at stats.ox.ac.uk]" [3] "Brian Ripley" [4] "Brian Ripley <ripley at stats.ox.ac.uk>" [5] "Prof Brian D Ripley" [6] "Prof Brian D Ripley [mailto:ripley at stats.ox.ac.uk]" [7] " Prof Brian D Ripley <ripley at stats.ox.ac.uk>" [8] ""Prof Brian D Ripley" <ripley at stats.ox.ac.uk>" [9] "Prof Brian D Ripley <ripley at stats.ox.ac.uk>" [10] "Prof Brian Ripley" [11] "Prof. Brian Ripley" [12] "Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk]" [13] "Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk] " [14] " Prof Brian Ripley <ripley at stats.ox.ac.uk>" [15] " Prof Brian Ripley <ripley at stats.ox.ac.uk>" [16] ""Prof Brian Ripley" <ripley at stats.ox.ac.uk>" [17] "Prof Brian Ripley<ripley at stats.ox.ac.uk>" [18] "Prof Brian Ripley <ripley at stats.ox.ac.uk>" [19] "Prof Brian Ripley [ripley at stats.ox.ac.uk]" [20] "Prof Brian Ripley <ripley at toucan.stats>" [21] "Professor Brian Ripley" [22] "r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Prof Brian Ripley" [23] "r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Prof Brian Ripley"
Well, it seems that the Professor used some alternative From
addresses as well, so a more valid estimate of the number of his messages should be something like:
> sum(grepl('Brian( D)? Ripley', Rhelpers)) [1] 10816
So using quick, regular expressions to extract the names from the e-mails returned most of the information we were interested in, but it seems that we have to spend a lot more time to get the whole information set. As usual, the Pareto rule applies: we can spend around 80 percent of our time on preparing data, and we can get 80 percent of the data in around 20 percent of the whole project timeline.
Due to page limitations, we will not cover data cleansing on this dataset in greater detail at this point, but I highly suggest checking Mark van der Loo's stringdist
package, which can compute string distances and similarities to, for example, merge similar names in cases like this.
But besides the sender, these e-mails also include some other really interesting data as well. For example, we can extract the date and time when the e-mail was sent—to model the frequency and temporal pattern of the mailing list.
To this end, let's filter for some other lines in the compressed text files:
> lines <- system(paste0( + "zgrep -E '^Date: [A-Za-z]{3}, [0-9]{1,2} [A-Za-z]{3} ", + "[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2} [-+]{1}[0-9]{4}' ", + "./help-r/*.txt.gz"), + intern = TRUE)
This returns fewer lines when compared to the previously extracted From
lines:
> length(lines) [1] 360817
This is due to the various date and time formats used in the e-mail headers, as sometimes the day of the week was not included in the string or the order of year, month, and day was off compared to the vast majority of other mails. Anyway, we will only concentrate on this significant portion of mails with the standard date and time format but, if you are interested in transforming these other time formats, you might want to check Hadley Wickham's lubridate
package to help your workflow. But please note that there's no general algorithm to guess the order of decimal year, month, and day—so you will end up with some manual data cleansing for sure!
Let's see how these (subset of) lines look:
> head(sub('.*Date: ', '', lines[1])) [1] "Tue, 1 Apr 1997 20:35:48 +1200 (NZST)"
Then we can simply get rid of the Date
prefix and parse the time stamps via strptime
:
> times <- strptime(sub('.*Date: ', '', lines), + format = '%a, %d %b %Y %H:%M:%S %z')
Now that the data is in a parsed format (even the local time-zones were converted to UTC), it's relatively easy to see, for example, the number of e-mails on the mailing list per year:
> plot(table(format(times, '%Y')), type = 'l')
Although the volume on the R-help
mailing list seems to have decreased in the past few years, it's not due to the lower R activity: R users, okay as is or no/. others on the Internet, nowadays tend to use other information channels more often than e-mail—for example: StackOverflow and GitHub (or even Facebook and LinkedIn). For a related research, please see the paper of Bogdan Vasilescu at al at http://web.cs.ucdavis.edu/~filkov/papers/r_so.pdf.
Well, we can do a lot better than this, right? Let's massage our data a bit and visualize the frequency of mails based on the day of week and hour of the day via a more elegant graph—inspired by GitHub's punch card plot:
> library(data.table) > Rhelp <- data.table(time = times) > Rhelp[, H := hour(time)] > Rhelp[, D := wday(time)]
Visualizing this dataset is relatively straightforward with ggplot
:
> library(ggplot2) > ggplot(na.omit(Rhelp[, .N, by = .(H, D)]), + aes(x = factor(H), y = factor(D), size = N)) + geom_point() + + ylab('Day of the week') + xlab('Hour of the day') + + ggtitle('Number of mails posted on [R-help]') + + theme_bw() + theme('legend.position' = 'top')
As the times are by UTC, the early morning mails might suggest that where most R-help
posters live has a positive GMT offset—if we suppose that most e-mails were written in business hours. Well, at least the lower number of e-mails on the weekends seems to suggest this statement.
And it seems that the UTC, UTC+1, and UTC+2 time zones are indeed rather frequent, but the US time zones are also pretty common for the R-help
posters:
> tail(sort(table(sub('.*([+-][0-9]{4}).*', '\1', lines))), 22) -1000 +0700 +0400 -0200 +0900 -0000 +0300 +1300 +1200 +1100 +0530 164 352 449 1713 1769 2585 2612 2917 2990 3156 3938 -0300 +1000 +0800 -0600 +0000 -0800 +0200 -0500 -0400 +0100 -0700 4712 5081 5493 14351 28418 31661 42397 47552 50377 51390 55696
And we can also use this relatively clean dataset to forecast the future volume of the R-help
mailing list. To this end, let's aggregate the original dataset to count data daily, as we saw in Chapter 3, Filtering and Summarizing Data:
> Rhelp[, date := as.Date(time)] > Rdaily <- na.omit(Rhelp[, .N, by = date])
Now let's transform this data.table
object into a time-series object by referencing the actual mail counts as values and the dates as the index:
> Rdaily <- zoo(Rdaily$N, Rdaily$date)
Well, this daily dataset is a lot spikier than the previously rendered yearly graph:
> plot(Rdaily)
But instead of smoothing or trying to decompose this time-series, like we did in Chapter 12, Analyzing Time-series, let's rather see how we can provide some quick estimates (based on historical data) on the forthcoming number of mails on this mailing list with some automatic models. To this end, we will use the forecast
package:
> library(forecast) > fit <- ets(Rdaily)
The ets
function implements a fully automatic method that can select the optimal trend, season, and error type for the given time-series. Then we can simply call the predict
or forecast
function to see the specified number of estimates, only for the next day in this case:
> predict(fit, 1) Point Forecast Lo 80 Hi 80 Lo 95 Hi 95 5823 28.48337 9.85733 47.10942 -0.002702251 56.96945
So it seems that, for the next day, our model estimated around 28 e-mails with a confidence interval of 80 percent being somewhere between 10 and 47. Visualizing predictions for a slightly longer period of time with some historical data can be done via the standard plot
function with some useful new parameters:
> plot(forecast(fit, 30), include = 365)
18.220.152.139