Centrality measures of networks

So we have identified almost 30,000 relations between our 6,500 packages. Is it a sparse or dense network? In other words, how many connections do we have out of all possible package dependencies? What if all the packages depend on all other packages? We do not really need any feature-rich package to calculate that:

> nrow(edges) / (nrow(pkgs) * (nrow(pkgs) - 1))
[1] 0.0006288816

This is a rather low percentage, which makes the life of R sysadmins rather easy compared to maintaining a dense network of R software. But who are the central players in this game? Which are the top-most dependent R packages?

We can also compute a rather trivial metric to answer this question without any serious SNA knowledge, as this can be defined as "Which R package is mentioned the most times in the dep column of the edges dataset"? Or, in plain English: "Which package has the most reverse dependencies?"

> head(sort(table(edges$dep), decreasing = TRUE))
       R  methods     MASS    stats testthat  lattice 
    3702      933      915      601      513      447

It seems that almost 50 percent of the packages depend on a minimal version of R. So as not to distort our directed network, let's remove these edges:

> edges <- edges[edges$dep != 'R', ]

And now it's time to transform our list of connections into a real graph object to compute more advanced metrics, and also to visualize the data:

> library(igraph)
> g <- graph.data.frame(edges)
> summary(g)
IGRAPH DN-- 5811 23258 -- 
attr: name (v/c), label (e/c)

After loading the package, the graph.data.frame function transforms various data sources into an igraph object. This is an extremely useful class with a variety of supported methods. The summary simply prints the number of vertices and edges, which shows that around 700 R packages have no dependencies. Let's compute the previously discussed and manually computed metrics with igraph:

> graph.density(g)
[1] 0.0006888828
> head(sort(degree(g), decreasing = TRUE))
 methods     MASS    stats testthat  ggplot2  lattice 
     933      923      601      516      459      454

It's not that surprising to see the methods package at the top of the list, as it's often required in packages with complex S4 methods and classes. The MASS and stats packages include most of the often used statistical methods, but what about the others? The lattice and ggplot2 packages are extremely smart and feature-full graphing engines, and testthat is one of the most popular unit-testing extensions of R; this must be mentioned in the package descriptions before submitting new packages to the central CRAN servers.

But degree is only one of the available centrality metrics for social networks. Unfortunately, computing closeness, which shows the distance of each node from the others, is not really meaningful when it comes to dependency, but betweenness is a really interesting comparison to the preceding results:

> head(sort(betweenness(g), decreasing = TRUE))
   Hmisc     nlme  ggplot2     MASS multcomp      rms 
943085.3 774245.2 769692.2 613696.9 453615.3 323629.8

This metric shows the number of times each package acts as a bridge (the only connecting node between two others) in the shortest path between the other packages. So it's not about having a lot of depending packages; rather, it shows the importance of the packages from a more global perspective. Just imagine if a package with a high betweenness was deprecated and removed from CRAN; not only the directly dependent packages, but also all other packages in the dependency tree would be in a rather awkward situation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.230.81