R package maintainers

Another similarly straightforward data source might be the list of R package maintainers. We can download the names and e-mail addresses of the package maintainers from a public page of CRAN, where this data is stored in a nicely structured HTML table that is extremely easy to parse:

> packages <- readHTMLTable(paste0('http://cran.r-project.org', 
+   '/web/checks/check_summary.html'), which = 2)

Extracting the names from the Maintainer column can be done via some quick data cleansing and transformations, mainly using regular expressions. Please note that the column name starts with a space—that's why we quoted the column name:

> maintainers <- sub('(.*) <(.*)>', '\1', packages$' Maintainer')
> maintainers <- gsub(' ', ' ', maintainers)
> str(maintainers)
 chr [1:6994] "Scott Fortmann-Roe" "Gaurav Sood" "Blum Michael" ...

This list of almost 7,000 package maintainers includes some duplicated names (they maintain multiple packages). Let's see the list of the top, most prolific R package developers:

> tail(sort(table(maintainers)), 8)
   Paul Gilbert     Simon Urbanek Scott Chamberlain   Martin Maechler 
             22                22                24                25 
       ORPHANED       Kurt Hornik    Hadley Wickham Dirk Eddelbuettel 
             26                29                31                36 

Although there's an odd name in the preceding list (orphaned packages do not have a maintainer—it's worth mentioning that having only 26 packages out of the 6,994 no longer actively maintained is a pretty good ratio), but the other names are indeed well known in the R community and work on a number of useful packages.

The number of packages per maintainer

On the other hand, there are a lot more names in the list associated with only one or a few R packages. Instead of visualizing the number of packages per maintainer on a simple bar chart or histogram, let's load the fitdistrplus package, which we will use on the forthcoming pages to fit various theoretical distributions on this analyzed dataset:

> N <- as.numeric(table(maintainers))
> library(fitdistrplus)
> plotdist(N)
The number of packages per maintainer

The preceding plots also show that most people in the list maintain only one, but no more than two or three, packages. If we are interested in how long/heavy tailed this distribution is, we might want to call the descdist function, which returns some important descriptive statistics on the empirical distribution and also plots how different theoretical distributions fit our data on a skewness-kurtosis plot:

> descdist(N, boot = 1e3)
summary statistics
------
min:  1   max:  36 
median:  1 
mean:  1.74327 
estimated sd:  1.963108 
estimated skewness:  7.191722 
estimated kurtosis:  82.0168 
The number of packages per maintainer

Our empirical distribution seems to be rather long-tailed with a very high kurtosis, and it seems that the gamma distribution is the best fit for this dataset. Let's see the estimate parameters of this gamma distribution:

> (gparams <- fitdist(N, 'gamma'))
Fitting of the distribution ' gamma ' by maximum likelihood 
Parameters:
      estimate Std. Error
shape 2.394869 0.05019383
rate  1.373693 0.03202067

We can use these parameters to simulate a lot more R package maintainers with the rgamma function. Let's see how many R packages would be available on CRAN with, for example, 100,000 package maintainers:

> gshape <- gparams$estimate[['shape']]
> grate  <- gparams$estimate[['rate']]
> sum(rgamma(1e5, shape = gshape, rate = grate))
[1] 173655.3
> hist(rgamma(1e5, shape = gshape, rate = grate))
The number of packages per maintainer

It's rather clear that this distribution is not as long-tailed as our real dataset: even with 100,000 simulations, the largest number was below 10, as we can see in the preceding plot; in reality, though, the R package maintainers are a lot more productive with up to 20 or 30 packages.

Let's verify this by estimating the proportion of R package maintainers with no more than two packages based on the preceding gamma distribution:

> pgamma(2, shape = gshape, rate = grate)
[1] 0.6672011

But this percentage is a lot higher in the real dataset:

> prop.table(table(N <= 2))
    FALSE      TRUE 
0.1458126 0.8541874 

This may suggest trying to fit a longer-tailed distribution. Let's see for example how Pareto distribution would fit our data. To this end, let's follow the analytical approach by using the lowest value as the location of the distribution, and the number of values divided by the sum of the logarithmic difference of all these values from the location as the shape parameter:

> ploc <- min(N)
> pshp <- length(N) / sum(log(N) - log(ploc))

Unfortunately, there is no ppareto function in the base stats package, so we have to first load the actuar or VGAM package to compute the distribution function:

> library(actuar)
> ppareto(2, pshp, ploc)
[1] 0.9631973

Well, now this is even higher than the real proportion! It seems that none of the preceding theoretical distributions fit our data perfectly—which is pretty normal by the way. But let's see how these distributions fit our original data set on a joint plot:

> fg <- fitdist(N, 'gamma')
> fw <- fitdist(N, 'weibull')
> fl <- fitdist(N, 'lnorm')
> fp <- fitdist(N, 'pareto', start = list(shape = 1, scale = 1))
> par(mfrow = c(1, 2))
> denscomp(list(fg, fw, fl, fp), addlegend = FALSE)
> qqcomp(list(fg, fw, fl, fp),
+   legendtext = c('gamma', 'Weibull', 'Lognormal', 'Pareto')) 
The number of packages per maintainer

After all, it seems that the Pareto distribution is the closest fit to our long-tailed data. But more importantly, we know about more than 4,000 R users besides the previously identified 279 R Foundation supporting members:

> length(unique(maintainers))
[1] 4012

What other data sources can we use to find information on the (number of) R users?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.179.35