Accessing GitHub data

There are various ways to access data from GitHub just like we have seen in the previous chapters for other social media platforms. If you are using R, you can go ahead and use the rgithub package which provides a high level interface with several functions to retrieve data from GitHub. Besides that, you can also register an application with GitHub to get an OAuth-based access token which can be used to gain access to GitHub's own API. We will cover both mechanisms briefly in this section.

Using the rgithub package for data access

As we mentioned earlier, there is a specific package in R known as rgithub which provides high level functions and interfaces to access GitHub data. You can install and load the package in R using the following code snippet:

library(devtools)
install_github("cscheid/rgithub")
library(github)

Of course, you would then need a GitHub application ID and secret token to gain seamless access to the data without too many restrictions and rate limits. We will cover this in the next section. To get more information about this package you can head over to the official GitHub repository at https://github.com/cscheid/rgithub. The package has well-defined functions with proper names and you can start using them to retrieve data from GitHub. The following snippet shows an example of getting data pertaining to your repositories:

# assuming you get your app id and secret tokens from GitHub
ctx = interactive.login(client.id, client.secret)
repos = get.my.repositories(ctx)

While this is a good library, we did find it quite restrictive in terms of the flexibility of using the raw power of GitHub's API as well as a lack of proper documentation with regard to each of its functions, unlike what we have observed for other packages such as RFacebook. Hence, in this chapter, we will be using the GitHub's API directly in all our analyzes.

Registering an application on GitHub

The main intent of registering an application on GitHub is to access the official API provided by GitHub using authenticated requests. If you are already familiar with the GitHub API, you might know that you can still access and retrieve data without any authentication. The question which might come to mind is: why do we need any authentication then? The reason is to avoid excessive rate limiting and the ability to access more data with more API requests per hour. Without authentication, you can make 60 API requests per hour and with authentication, you can make up to 5,000 requests per hour free of charge. This is quite a substantial gain and hence we will spend some time to understand how to register an application and use its access tokens for data retrieval.

Of course, you would definitely need to have a GitHub account for this; it is assumed that you already have it. If not, head over to https://github.com and create a personal account; it's free if you host public repositories. Once you have an account, head over to the settings page which is available in the link at https://github.com/settings/developers, which basically points you to all your registered developer OAuth applications. The following snapshot depicts the settings page for all my registered GitHub applications:

Registering an application on GitHub

The GitHub profile settings page

The boxes in red point out the main areas which you should focus on. You can already see in the bottom left-hand menu that we are in the OAuth applications section under the settings. Existing applications, if any, are shown there, which have been registered to use the GitHub API as mentioned in the text at the bottom. For now, you need to look at the top right box which points to a button saying Register a new application. Click on the button and it should take you to the next page as depicted in the following snapshot:

Registering an application on GitHub

The GitHub new app registration page

You can give the application a name of your choice. Once you have registered your application, click on the Register application button. This should take you to the following screen which shows you the Client ID and Client Secret tokens:

Registering an application on GitHub

Getting your GitHub app access tokens

Copy these tokens and store them somewhere so that we can start using them when we access the GitHub API.

Accessing data using the GitHub API

We are now ready to start using the GitHub API to access and retrieve data! The base URL for the API is always https://api.github.com and usually there are various end points which can be requested for based on the type of data we want to retrieve. GitHub's latest API is on version 3 at the time of writing this book and there is a dedicated section on the GitHub website containing detailed documentation of the API. Feel free to visit https://developer.github.com/guides/getting-started if you are interested in knowing in more detail about how the API works.

Any API end point typically returns data in JSON format which is usually a document of key and value pairs. We will use libraries such as httr and jsonlite to retrieve and parse the data into easy-to-use data formats such as R DataFrames. Let's take a simple example of trying to get relevant statistics for our personal GitHub account. First let us load the necessary packages we will be using:

library(httr)
library(jsonlite)

Now, let us create the necessary arguments we will be passing to the API, which will be our access token details:

auth.id <- 'XXXXXX'
auth.pwd <- 'XXXXXXXXXX'
api_id_param <- paste0('client_id=', auth.id) 
api_pwd_param <- paste0('client_secret=', auth.pwd)
arg_sep = '&'

Now, if I want to get the statistics for my account, I will be using the following snippet where the base API URL points to my GitHub account user name:

base_url <- 'https://api.github.com/users/dipanjanS?'
my_profile_url <- paste0(base_url, api_id_param, arg_sep, api_pwd_param)
response <- GET(my_profile_url)

On closer inspection of the response object, we observe the following output:

> response
Response [https://api.github.com/users/dipanjanS?client_id=XXXXXX&client_secret=XXXXXXXXXX]
  Date: 2017-03-23 20:14
  Status: 200
  Content-Type: application/json; charset=utf-8
  Size: 1.45 kB
{
  "login": "dipanjanS",
  "id": 3448263,
  "avatar_url": "https://avatars2.githubusercontent.com/u/3448263?v=3",
  "gravatar_id": "",
  "url": "https://api.github.com/users/dipanjanS",
  "html_url": "https://github.com/dipanjanS",
  "followers_url": "https://api.github.com/users/dipanjanS/followers",
  "following_url": "https://api.github.com/users/dipanjanS/following{/other_user}",
  "gists_url": "https://api.github.com/users/dipanjanS/gists{/gist_id}",
...

Remember our discussion about rate limits earlier? Well we can see how many requests per hour are allotted to us and how many are remaining using the following snippet:

> as.data.frame(response$headers)[,c('x.ratelimit.limit', 'x.ratelimit.remaining')]
  x.ratelimit.limit x.ratelimit.remaining
1              5000                  4999

Thus you can clearly see from the headers in our response object, that we have 5000 requests allocated to us out of which we used up one in the previous request. Try executing the same code with no authentication token and see how the rate limits differ.

We can parse the preceding JSON response object into a more suitable object such as a DataFrame using the following snippet:

me <- content(response)
me <- as.data.frame(t(as.matrix(me)))
View(me[,c('login', 'public_repos', 'public_gists', 'followers',
           'created_at', 'updated_at')])

This gives us the DataFrame as depicted in the following snapshot:

Accessing data using the GitHub API

There is a better approach to get the same DataFrame using minimal code. We will use the fromJSON(…) function available in the jsonlite package as depicted in the following code snippet:

me <- fromJSON(my_profile_url)
me <- as.data.frame(t(as.matrix(me)))
View(me[,c('login', 'public_repos', 'public_gists', 'followers',
           'created_at', 'updated_at')])

This gives us the same DataFrame which we wanted for the stats pertaining to my GitHub account as depicted in the following snapshot:

Accessing data using the GitHub API

This gives us a good peek at how to access and retrieve data from GitHub. We will now look at various ways to extract and analyze useful data from GitHub in the following sections by utilizing the aforementioned data access mechanisms in combination with the other packages we mentioned in the Environment setup section to produce insightful visualizations. We will start off by analyzing repository activity on GitHub.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.255.189