Analyzing overlaps between our lists of R users

But our original idea was to predict the number of R users around the world and not to focus on some minor segments, right? Now that we have multiple data sources, we can start building some models combining those to provide estimates on the global number of R users.

The basic idea behind this approach is the capture-recapture method, which is well known in ecology, where we first try to identify the probability of capturing a unit from the population, and then we use this probability to estimate the number of not captured units.

In our current study, units will be R users and the samples are the previously captured name lists on the:

  • Supporters of the R Foundation
  • R package maintainers who submitted at least one package to CRAN
  • R-help mailing list e-mail senders

Let's merge these lists with a tag referencing the data source:

> lists <- rbindlist(list(
+     data.frame(name = unique(supporterlist), list = 'supporter'),
+     data.frame(name = unique(maintainers),   list = 'maintainer'),
+     data.frame(name = unique(Rhelpers),      list = 'R-help')))

Next let's see the number of names we can find in one, two or all three groups:

> t <- table(lists$name, lists$list)
> table(rowSums(t))
    1     2     3 
44312   860    40

So there are (at least) 40 persons who support the R Foundation, maintain at least one R package on CRAN, and have posted at least one mail to R-help since 1997! I am happy and proud to be one of these guys -- especially with an accent in my name, which often makes matching of strings more complex.

Now, if we suppose these lists refer to the same population, namely R users around the world, then we can use these common occurrences to predict the number of R users who somehow missed supporting the R Foundation, maintaining a package on CRAN, and writing a mail to the R-help mailing list. Although this assumption is obviously off, let's run this quick experiment and get back to these outstanding questions later.

One of the best things in R is that we have a package for almost any problem. Let's load the Rcapture package, which provides some sophisticated, yet easily accessible, methods for capture-recapture models:

> library(Rcapture)
> descriptive(t)

Number of captured units: 45212 
 
Frequency statistics:
          fi     ui     vi     ni   
i = 1  44312    279    157    279
i = 2    860   3958   3194   4012
i = 3     40  40975  41861  41861
fi: number of units captured i times
ui: number of units captured for the first time on occasion i
vi: number of units captured for the last time on occasion i
ni: number of units captured on occasion i 

These numbers from the first fi column are familiar from the previous table, and represent the number of R users identified on one, two, or all three lists. It's a lot more interesting to fit some models on this data with a simple call such as:

> closedp(t)

Number of captured units: 45212 

Abundance estimations and model fits:
               abundance     stderr  deviance df       AIC       BIC
M0              750158.4    23800.7 73777.800  5 73835.630 73853.069
Mt              192022.2     5480.0   240.278  3   302.109   336.986
Mh Chao (LB)    806279.2    26954.8 73694.125  4 73753.956 73780.113
Mh Poisson2    2085896.4   214443.8 73694.125  4 73753.956 73780.113
Mh Darroch     5516992.8  1033404.9 73694.125  4 73753.956 73780.113
Mh Gamma3.5   14906552.8  4090049.0 73694.125  4 73753.956 73780.113
Mth Chao (LB)   205343.8     6190.1    30.598  2    94.429   138.025
Mth Poisson2   1086549.0   114592.9    30.598  2    94.429   138.025
Mth Darroch    6817027.3  1342273.7    30.598  2    94.429   138.025
Mth Gamma3.5  45168873.4 13055279.1    30.598  2    94.429   138.025
Mb                 -36.2        6.2   107.728  4   167.559   193.716
Mbh               -144.2       25.9    84.927  3   146.758   181.635

Once again, I have to emphasize that these estimates are not actually on the abundance of all R users around the world, because:

  • Our non-independent lists refer to far more specific groups
  • The model assumptions do not stand
  • The R community is definitely not a closed population and some open-population models would be more reliable
  • We missed some very important data-cleansing steps, as noted

Further ideas on extending the capture-recapture models

Although this playful example did not really help us to find out the number of R users around the world, with some extensions the basic idea is definitely viable. First of all, we might consider analyzing the source data in smaller chunks—for example, looking for the same e-mail addresses or names in different years of the R-help archives. This might help with estimating the number of persons who were thinking about submitting a question to R-help, but did not actually send the e-mail after all (for example, because another poster's question had already been answered or she/he resolved the problem without external help).

On the other hand, we could also add a number of other data sources to the models, so that we can do more reliable estimates on some other R users who do not contribute to the R Foundation, CRAN, or R-help.

I have been working on a similar study over the past 2 years, collecting data on the number of:

  • R Foundation ordinary and supporting members, donators and benefactors
  • Attendees at the annual R conference between 2004 and 2015
  • CRAN downloads per package and country in 2013 and 2014
  • R User Groups and meet-ups with the number of members
  • The http://www.r-bloggers.com visitors in 2013
  • GitHub users with at least one repository with R source code
  • Google search trends on R-related terms

You can find the results on an interactive map and the country-level aggregated data in a CSV file at http://rapporter.net/custom/R-activity and an offline data visualization presented in the past two useR! conferences at http://bit.ly/useRs2015.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.236.70