Chapter 14. Analyzing the R Community

In this final chapter, I will try to summarize what you have learned in the past 13 chapters. To this end, we will create an actual case study, independent from the previously used hflights and mtcars datasets, and will now try to estimate the size of the R community. This is a rather difficult task as there is no list of R users around the world; thus, we will have to build some predicting models on a number of partial datasets.

To this end, we will do the following in this chapter:

  • Collect live data from different data sources on the Internet
  • Cleanse the data and transform it to a standard format
  • Run some quick descriptive, exploratory analysis methods
  • Visualize the extracted data
  • Build some log-linear models on the number of R users based on an independent list of names

R Foundation members

One of the easiest things we can do is count the members of the R Foundation—the organization coordinating the development of the core R program. As the ordinary members of the Foundation include only the R Development Core Team, we had better check the supporting members. Anyone can become a supporting member of the Foundation by paying a nominal yearly fee— I highly suggest you do this, by the way. The list is available on the http://r-project.org site, and we will use the XML package (for more detail, see Chapter 2, Getting Data from the Web) to parse the HTML page:

> library(XML)
> page <- htmlParse('http://r-project.org/foundation/donors.html')

Now that we have the HTML page loaded into R, we can use the XML Path Language to extract the list of the supporting members of the Foundation, by reading the list after the Supporting members header:

> list <- unlist(xpathApply(page,
+     "//h3[@id='supporting-members']/following-sibling::ul[1]/li", 
+     xmlValue))
> str(list)
 chr [1:279] "Klaus Abberger (Germany)" "Claudio Agostinelli (Italy)" 

Form this character vector of 279 names and countries, let's extract the list of supporting members and the countries separately:

> supporterlist <- sub(' \([a-zA-Z ]*\)$', '', list)
> countrylist   <- substr(list, nchar(supporterlist) + 3,
+                               nchar(list) - 1)

So we first extracted the names by removing everything starting from the opening parenthesis in the strings, and then we matched the countries by the character positions computed from the number of characters in the names and the original strings.

Besides the name list of 279 supporting members of the R Foundation, we also know the proportion of the citizenship or residence of the members:

> tail(sort(prop.table(table(countrylist)) * 100), 5)
     Canada Switzerland          UK     Germany         USA 
   4.659498    5.017921    7.168459   15.770609   37.992832 

Visualizing supporting members around the world

Probably it's not that surprising that most supporting members are from the USA, and some European countries are also at the top of this list. Let's save this table so that we can generate a map on this count data after some quick data transformations:

> countries <- as.data.frame(table(countrylist))

As mentioned in Chapter 13, Data Around Us, the rworldmap package can render country-level maps in a very easy way; we just have to map the values with some polygons. Here, we will use the joinCountryData2Map function, first enabling the verbose option to see what country names have been missed:

> library(rworldmap)
> joinCountryData2Map(countries, joinCode = 'NAME',
+    nameJoinColumn = 'countrylist', verbose = TRUE)
32 codes from your data successfully matched countries in the map
4 codes from your data failed to match with a country code in the map
     failedCodes failedCountries
[1,] NA          "Brasil"       
[2,] NA          "CZ"           
[3,] NA          "Danmark"      
[4,] NA          "NL"           
213 codes from the map weren't represented in your data

So we tried to match the country names stored in the countries data frame, but failed for the previously listed four strings. Although we could manually fix this, in most cases it's better to automate what we can, so let's pass all the failed strings to the Google Maps geocoding API and see what it returns:

> library(ggmap)
> for (fix in c('Brasil', 'CZ', 'Danmark', 'NL')) {
+   countrylist[which(countrylist == fix)] <-
+       geocode(fix, output = 'more')$country
+ }

Now that we have fixed the country names with the help of the Google geocoding service, let's regenerate the frequency table and map those values to the polygon names with the rworldmap package:

> countries <- as.data.frame(table(countrylist))
> countries <- joinCountryData2Map(countries, joinCode = 'NAME',
+   nameJoinColumn = 'countrylist')
36 codes from your data successfully matched countries in the map
0 codes from your data failed to match with a country code in the map
211 codes from the map weren't represented in your data

These results are much more satisfying! Now we have the number of supporting members of the R Foundation mapped to the countries, so we can easily plot this data:

> mapCountryData(countries, 'Freq', catMethod = 'logFixedWidth',
+   mapTitle = 'Number of R Foundation supporting members')
Visualizing supporting members around the world

Well, it's clear that most supporting members of the R Foundation are based in the USA, Europe, Australia, and New Zealand (where R was born more than 20 years ago).

But the number of supporters is unfortunately really low, so let's see what other data sources we can find and utilize in order to estimate the number of R users around the world.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.171.212