Another similarly straightforward data source might be the list of R package maintainers. We can download the names and e-mail addresses of the package maintainers from a public page of CRAN, where this data is stored in a nicely structured HTML table that is extremely easy to parse:
> packages <- readHTMLTable(paste0('http://cran.r-project.org', + '/web/checks/check_summary.html'), which = 2)
Extracting the names from the Maintainer
column can be done via some quick data cleansing and transformations, mainly using regular expressions. Please note that the column name starts with a space—that's why we quoted the column name:
> maintainers <- sub('(.*) <(.*)>', '\1', packages$' Maintainer') > maintainers <- gsub(' ', ' ', maintainers) > str(maintainers) chr [1:6994] "Scott Fortmann-Roe" "Gaurav Sood" "Blum Michael" ...
This list of almost 7,000 package maintainers includes some duplicated names (they maintain multiple packages). Let's see the list of the top, most prolific R package developers:
> tail(sort(table(maintainers)), 8) Paul Gilbert Simon Urbanek Scott Chamberlain Martin Maechler 22 22 24 25 ORPHANED Kurt Hornik Hadley Wickham Dirk Eddelbuettel 26 29 31 36
Although there's an odd name in the preceding list (orphaned packages do not have a maintainer—it's worth mentioning that having only 26 packages out of the 6,994 no longer actively maintained is a pretty good ratio), but the other names are indeed well known in the R community and work on a number of useful packages.
On the other hand, there are a lot more names in the list associated with only one or a few R packages. Instead of visualizing the number of packages per maintainer on a simple bar chart or histogram, let's load the fitdistrplus
package, which we will use on the forthcoming pages to fit various theoretical distributions on this analyzed dataset:
> N <- as.numeric(table(maintainers)) > library(fitdistrplus) > plotdist(N)
The preceding plots also show that most people in the list maintain only one, but no more than two or three, packages. If we are interested in how long/heavy tailed this distribution is, we might want to call the descdist
function, which returns some important descriptive statistics on the empirical distribution and also plots how different theoretical distributions fit our data on a skewness-kurtosis plot:
> descdist(N, boot = 1e3) summary statistics ------ min: 1 max: 36 median: 1 mean: 1.74327 estimated sd: 1.963108 estimated skewness: 7.191722 estimated kurtosis: 82.0168
Our empirical distribution seems to be rather long-tailed with a very high kurtosis, and it seems that the gamma distribution is the best fit for this dataset. Let's see the estimate parameters of this gamma distribution:
> (gparams <- fitdist(N, 'gamma')) Fitting of the distribution ' gamma ' by maximum likelihood Parameters: estimate Std. Error shape 2.394869 0.05019383 rate 1.373693 0.03202067
We can use these parameters to simulate a lot more R package maintainers with the rgamma
function. Let's see how many R packages would be available on CRAN with, for example, 100,000 package maintainers:
> gshape <- gparams$estimate[['shape']] > grate <- gparams$estimate[['rate']] > sum(rgamma(1e5, shape = gshape, rate = grate)) [1] 173655.3 > hist(rgamma(1e5, shape = gshape, rate = grate))
It's rather clear that this distribution is not as long-tailed as our real dataset: even with 100,000 simulations, the largest number was below 10, as we can see in the preceding plot; in reality, though, the R package maintainers are a lot more productive with up to 20 or 30 packages.
Let's verify this by estimating the proportion of R package maintainers with no more than two packages based on the preceding gamma distribution:
> pgamma(2, shape = gshape, rate = grate) [1] 0.6672011
But this percentage is a lot higher in the real dataset:
> prop.table(table(N <= 2)) FALSE TRUE 0.1458126 0.8541874
This may suggest trying to fit a longer-tailed distribution. Let's see for example how Pareto distribution would fit our data. To this end, let's follow the analytical approach by using the lowest value as the location of the distribution, and the number of values divided by the sum of the logarithmic difference of all these values from the location as the shape parameter:
> ploc <- min(N) > pshp <- length(N) / sum(log(N) - log(ploc))
Unfortunately, there is no ppareto
function in the base stats
package, so we have to first load the actuar
or VGAM
package to compute the distribution function:
> library(actuar) > ppareto(2, pshp, ploc) [1] 0.9631973
Well, now this is even higher than the real proportion! It seems that none of the preceding theoretical distributions fit our data perfectly—which is pretty normal by the way. But let's see how these distributions fit our original data set on a joint plot:
> fg <- fitdist(N, 'gamma') > fw <- fitdist(N, 'weibull') > fl <- fitdist(N, 'lnorm') > fp <- fitdist(N, 'pareto', start = list(shape = 1, scale = 1)) > par(mfrow = c(1, 2)) > denscomp(list(fg, fw, fl, fp), addlegend = FALSE) > qqcomp(list(fg, fw, fl, fp), + legendtext = c('gamma', 'Weibull', 'Lognormal', 'Pareto'))
After all, it seems that the Pareto distribution is the closest fit to our long-tailed data. But more importantly, we know about more than 4,000 R users besides the previously identified 279 R Foundation supporting members:
> length(unique(maintainers)) [1] 4012
What other data sources can we use to find information on the (number of) R users?
3.145.179.35