"R" is for Re-use
Andrew Elliott
Previously on "R is for ..."
One of R's greatest strengths is the level of activity in the user community and the range of packages that have been developed and contributed to the general good. There are thousands of packages out there and the list grows daily. How is the young data scientist to stay on top of this flood of material?, I hear you ask. Various helpful lists have been contributed by bloggers and other commentators, such as 10 R packages I wish I knew about earlier. The CRANtastic website provides a list of the favourites based on user ratings http://crantastic.org/popcon, and r-bloggers provides a list by frequency of download in RStudio http://www.r-bloggers.com/a-list-of-r-packages-by-popularity/.
Dependencies
Another way of looking at this, is to look at which packages are most fundamental to the broader R community - which packages do package authors build upon. The CRAN repository provides structured data on each package: among the data provided are "Depends", and "Imports", which list the packages each is built upon. It seemed a fun thing to see which packages were most depended upon, which were the most fundamental in the R ecosystem.
First-Order
For this exercise I didn't bother distinguishing between "Depends" and "Imports" - I wrote a simple routine to take the list of packages from CRAN, and then for each, to harvest from the relevant page on the CRAN website, the contents of "Depends" and "Imports" properties, and stash those package names in a table which I called "antecedants". The table has columns "self", the package in question, "ante", the antecedant package and "order" the depth of the dependency.
options(width=100)
source("Rpackages.R")
load("Packages.Rda")
load("Antecedants.Rda")
head(antecedants)
## self ante order
## 1 cleangeo sp 1
## 2 cleangeo rgeos 1
## 3 cleangeo maptools 1
## 4 smerc SpatialTools 1
## 5 smerc fields 1
## 6 smerc maps 1
That gave the first order dependencies, and here are some interesting glimpses into that table. I used table
to count the order-1 dependency for each antecedant, to see which are most re-used, and then sort that table to reveal the top ten.
ante1<-table(antecedants[antecedants["order"]==1,]$ante)
anteSorted1<-ante1[order(ante1, decreasing=TRUE)]
length(anteSorted1)
## [1] 1458
dim(antecedants[antecedants["order"]==1,])
## [1] 10330 3
head(anteSorted1, 10)
##
## MASS Rcpp ggplot2 plyr Matrix lattice stringr reshape2 sp mvtnorm
## 374 370 321 266 183 173 157 151 146 142
So something over a thousand packages are in some way re-used, for a total of over 10,000 order-1 dependencies, and the most popular include many of the usual suspects like ggplot
and plyr
.
Going Deeper
But just looking at the first level is not good enough. If your package builds on, say, ggplot2
, which has among its antecedants, plyr
, then of course plyr
is an antecedant of your package too, but a second-order antecedant. So we need to get recursive, and we can do this just by analysing the antecedants table. So we can build the order 2 antecedants table based on the order 1 table; and the order 3 from the order 2, and so on, until we finally bottom out and reach the maximum depth. Along the way we need to make sure we don't double-count - if a packages uses ggplot2
and also uses plyr
directly, we don't want to be double-counting plyr
.
So for example here are the most frequent order 3 dependencies.
ante3<-table(antecedants[antecedants["order"]==3,]$ante)
anteSorted3<-ante3[order(ante3, decreasing=TRUE)]
head(anteSorted3, 10)
##
## lattice Rcpp stringr RColorBrewer plyr magrittr digest
## 674 659 503 469 465 421 394
## dichromat labeling munsell
## 380 380 380
And having chased this down, until there were no more levels, the winners are ...
anteN<-table(unique(antecedants[,-3])$ante)
anteSortedN<-anteN[order(anteN, decreasing=TRUE)]
top10ante<-head(anteSortedN, 10)
top10ante
##
## Rcpp lattice MASS magrittr stringi stringr digest
## 1341 1119 1048 911 876 864 853
## plyr RColorBrewer colorspace
## 799 654 650
So what are these packages that float to the top of the list?
packages[trim(packages$name) %in% names(top10ante),1:2]
## name desc
## 449 Rcpp Seamless R and C++ Integration
## 674 MASS Support Functions and Datasets for Venables and Ripley's MASS
## 1489 lattice Trellis Graphics for R
## 1820 stringi Character String Processing Facilities
## 2013 plyr Tools for Splitting, Applying and Combining Data
## 2460 stringr Simple, Consistent Wrappers for Common String Operations
## 2871 colorspace Color Space Manipulation
## 3479 digest Create Cryptographic Hash Digests of R Objects
## 3661 RColorBrewer ColorBrewer Palettes
## 3796 magrittr A Forward-Pipe Operator for R
Oh, and ...
Just for fun, some other bits and pieces
The deepest dependency:
head(antecedants[antecedants$order==max(antecedants$order),])
## self ante order
## 53579 BIFIEsurvey lattice 10
## 53580 BIFIEsurvey Rcpp 10
## 53581 BIFIEsurvey stringi 10
## 53582 BIFIEsurvey magrittr 10
## 53583 BIFIEsurvey colorspace 10
The number of dependencies for each order maxes out at second-order dependencies, and then tails away:
table(antecedants$order)
##
## 1 2 3 4 5 6 7 8 9 10
## 10330 14504 11647 8289 5052 2582 932 208 34 5
The most dependent packages - the ones which will pull in the greatest number of other packages:
selfN<-table(unique(antecedants[,-3])$self)
selfSortedN<-selfN[order(selfN, decreasing=TRUE)]
top10self<-head(selfSortedN, 10)
top10self
##
## BIFIEsurvey miceadds immer sirt treescape semPlot bootnet
## 120 119 108 106 92 87 84
## IATscores RAM diveRsity
## 83 82 81
And these highly dependent packages, what do they do?
packages[packages$name %in% names(top10self),1:2]
## name desc
## 386 immer Item Response Models for Multiple Ratings
## 437 treescape Statistical Exploration of Landscapes of Phylogenetic Trees
## 484 BIFIEsurvey Tools for Survey Statistics in Educational Assessment
## 1546 miceadds Some Additional Multiple Imputation Functions, Especially for\n'mice'
## 1817 sirt Supplementary Item Response Theory Models
## 2231 RAM R for Amplicon-Sequencing-Based Microbial-Ecology
## 2277 IATscores Implicit Association Test Scores Using Robust Statistics
## 2921 bootnet Bootstrap Methods for Various Network Estimation Routines
## 3526 diveRsity A Comprehensive, General Purpose Population Genetics Analysis\nPackage
## 4335 semPlot Path diagrams and visual analysis of various SEM packages'\noutput