Chapter 5. Analyzing Software Collaboration Trends I – Social Coding with GitHub

Technology has evolved with leaps and bounds over the last couple of decades. With better hardware, software, and data, we are finally seeing trends such as open source, big data, predictive analytics, artificial intelligence, and productivity tools. The rise of open source development has created a healthy collaborative culture in the software development landscape where pair programming, open source contributions, and problem redressal have helped developers build better software together. In the next couple of chapters, we will try to analyze trends in the software development and collaboration domain by focusing on two major platforms – GitHub and StackExchange.

To know about GitHub, we need to know about Git! If you are a software developer, tester, or have worked collaboratively with others in building software, you might be familiar with Git. In the year 2005, famous software engineer Linus Torvalds, more popularly known as the father of Linux, created Git with the purpose of collaborative development of the Linux kernel with other kernel developers. Git is a version control system and a source code management system which is mainly used in distributed and collaborative development projects. GitHub is nothing but a platform which is a web-based version of Git. It is a distributed version control system at heart which encourages social collaboration when developing software.

Another popular aspect of software collaboration include Question & Answer (Q&A) platforms such as StackOverflow, WikiAnswers, and StackExchange, where users can post various questions related to bugs, issues, and the problems they face in their day-to-day lives in building software, technology, and products. Thus platforms such as these have become indispensable to the community at large.

We will follow a structured two-chapter approach in this book focusing on two different software collaboration platforms. In this chapter, we will be focusing on GitHub, the most popular social coding and collaboration platform for developers. We will learn a bit about the various features of GitHub, understand ways to access GitHub data with the help of APIs, and also cover several major aspects of analyzing GitHub data including:

  • Analyzing repository activity
  • Retrieving trending repositories
  • Analyzing repository trends
  • Analyzing language trends
  • Analyzing user trends

In the next chapter, we will focus on a popular Q&A platform for software and collaboration, StackExchange. We will learn about the StackExchange platform, look at various ways to access data including APIs and data dumps, and cover different aspects of analyzing the data by focusing on two interesting problems.

We will be using a suite of libraries in the following sections to retrieve, parse, analyze, and visualize data in R. While we will mention the libraries we use as necessary, in case you do not have any of them installed, remember to use the install.packages(…) function to install the necessary packages.

Environment setup

We will be using several R libraries or packages in this chapter as mentioned before. The major libraries which will be used along with their main utility are outlined in the following table. Feel free to use install.packages(…) to install them. For ease of usage, I have also created a file called env_setup.R which you can load into R and execute the necessary code present there to install all the necessary packages which will be used in this chapter. You can also find the package descriptions in more detail from the CRAN website at https://cran.r-project.org/web/packages/available_packages_by_name.html:

R package

Utility description

httr

Tools for working with APIs, URLs, and HTTP

jsonlite

Flexible, robust, high performance tools for working with JSON in R

dplyr

A fast, consistent tool for working with DataFrame-like objects, both in memory and out of memory

ggplot2

A system for declaratively creating graphics, plots, and visualizations based on The Grammar of Graphics

reshape2

A tool to flexibly restructure and aggregate data

hrbrthemes

A compilation of extra ggplot2 themes, scales, and utilities with an emphasis on typography

sqldf

A tool to manipulate R DataFrame using SQL

lubridate

Fast and user-friendly parsing and manipulation of date-time data

corrplot

Helps in visualization of correlation matrices

devtools

Collection of package development tools and utilities. Useful for installing packages from GitHub

data.table

Fast aggregation of large DataFrames

We will be using several functions from these packages to retrieve, analyze, and visualize our data in various scenarios in the following sections. An important point to note here is that I had to install the hrbrthemes package from its GitHub repository and you can use the following code snippet to do the same:

install.packages("scales")
install.packages('extrafontdb')
install.packages('Rttf2pt1')
devtools::install_github("hrbrmstr/hrbrthemes")

Before we start discussing data retrieval and analysis from GitHub, let us understand a bit more about GitHub and its major features.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.170.206