But our original idea was to predict the number of R users around the world and not to focus on some minor segments, right? Now that we have multiple data sources, we can start building some models combining those to provide estimates on the global number of R users.
The basic idea behind this approach is the capture-recapture method, which is well known in ecology, where we first try to identify the probability of capturing a unit from the population, and then we use this probability to estimate the number of not captured units.
In our current study, units will be R users and the samples are the previously captured name lists on the:
Let's merge these lists with a tag referencing the data source:
> lists <- rbindlist(list( + data.frame(name = unique(supporterlist), list = 'supporter'), + data.frame(name = unique(maintainers), list = 'maintainer'), + data.frame(name = unique(Rhelpers), list = 'R-help')))
Next let's see the number of names we can find in one, two or all three groups:
> t <- table(lists$name, lists$list) > table(rowSums(t)) 1 2 3 44312 860 40
So there are (at least) 40 persons who support the R Foundation, maintain at least one R package on CRAN, and have posted at least one mail to R-help
since 1997! I am happy and proud to be one of these guys -- especially with an accent in my name, which often makes matching of strings more complex.
Now, if we suppose these lists refer to the same population, namely R users around the world, then we can use these common occurrences to predict the number of R users who somehow missed supporting the R Foundation, maintaining a package on CRAN, and writing a mail to the R-help mailing list. Although this assumption is obviously off, let's run this quick experiment and get back to these outstanding questions later.
One of the best things in R is that we have a package for almost any problem. Let's load the Rcapture
package, which provides some sophisticated, yet easily accessible, methods for capture-recapture models:
> library(Rcapture) > descriptive(t) Number of captured units: 45212 Frequency statistics: fi ui vi ni i = 1 44312 279 157 279 i = 2 860 3958 3194 4012 i = 3 40 40975 41861 41861 fi: number of units captured i times ui: number of units captured for the first time on occasion i vi: number of units captured for the last time on occasion i ni: number of units captured on occasion i
These numbers from the first fi
column are familiar from the previous table, and represent the number of R users identified on one, two, or all three lists. It's a lot more interesting to fit some models on this data with a simple call such as:
> closedp(t) Number of captured units: 45212 Abundance estimations and model fits: abundance stderr deviance df AIC BIC M0 750158.4 23800.7 73777.800 5 73835.630 73853.069 Mt 192022.2 5480.0 240.278 3 302.109 336.986 Mh Chao (LB) 806279.2 26954.8 73694.125 4 73753.956 73780.113 Mh Poisson2 2085896.4 214443.8 73694.125 4 73753.956 73780.113 Mh Darroch 5516992.8 1033404.9 73694.125 4 73753.956 73780.113 Mh Gamma3.5 14906552.8 4090049.0 73694.125 4 73753.956 73780.113 Mth Chao (LB) 205343.8 6190.1 30.598 2 94.429 138.025 Mth Poisson2 1086549.0 114592.9 30.598 2 94.429 138.025 Mth Darroch 6817027.3 1342273.7 30.598 2 94.429 138.025 Mth Gamma3.5 45168873.4 13055279.1 30.598 2 94.429 138.025 Mb -36.2 6.2 107.728 4 167.559 193.716 Mbh -144.2 25.9 84.927 3 146.758 181.635
Once again, I have to emphasize that these estimates are not actually on the abundance of all R users around the world, because:
Although this playful example did not really help us to find out the number of R users around the world, with some extensions the basic idea is definitely viable. First of all, we might consider analyzing the source data in smaller chunks—for example, looking for the same e-mail addresses or names in different years of the R-help archives. This might help with estimating the number of persons who were thinking about submitting a question to R-help
, but did not actually send the e-mail after all (for example, because another poster's question had already been answered or she/he resolved the problem without external help).
On the other hand, we could also add a number of other data sources to the models, so that we can do more reliable estimates on some other R users who do not contribute to the R Foundation, CRAN, or R-help.
I have been working on a similar study over the past 2 years, collecting data on the number of:
You can find the results on an interactive map and the country-level aggregated data in a CSV file at http://rapporter.net/custom/R-activity and an offline data visualization presented in the past two useR! conferences at http://bit.ly/useRs2015.
18.219.236.70