Contact Me

Use the form on the right to contact me.

 

         

123 Street Avenue, City Town, 99999

(123) 555-6789

email@address.com

 

You can set your address, phone number, email and site description in the settings tab.
Link to read me page with more information.

Explorations

"R" is for Re-use

Andrew Elliott

Previously on "R is for ..."

One of R's greatest strengths is the level of activity in the user community and the range of packages that have been developed and contributed to the general good. There are thousands of packages out there and the list grows daily. How is the young data scientist to stay on top of this flood of material?, I hear you ask. Various helpful lists have been contributed by bloggers and other commentators, such as 10 R packages I wish I knew about earlier. The CRANtastic website provides a list of the favourites based on user ratings http://crantastic.org/popcon, and r-bloggers provides a list by frequency of download in RStudio http://www.r-bloggers.com/a-list-of-r-packages-by-popularity/.

Dependencies

Another way of looking at this, is to look at which packages are most fundamental to the broader R community - which packages do package authors build upon. The CRAN repository provides structured data on each package: among the data provided are "Depends", and "Imports", which list the packages each is built upon. It seemed a fun thing to see which packages were most depended upon, which were the most fundamental in the R ecosystem.

First-Order

For this exercise I didn't bother distinguishing between "Depends" and "Imports" - I wrote a simple routine to take the list of packages from CRAN, and then for each, to harvest from the relevant page on the CRAN website, the contents of "Depends" and "Imports" properties, and stash those package names in a table which I called "antecedants". The table has columns "self", the package in question, "ante", the antecedant package and "order" the depth of the dependency.

        options(width=100)
        source("Rpackages.R")
        load("Packages.Rda")
        load("Antecedants.Rda")
        head(antecedants)
##         self         ante order
## 1  cleangeo            sp     1
## 2  cleangeo         rgeos     1
## 3  cleangeo      maptools     1
## 4     smerc  SpatialTools     1
## 5     smerc        fields     1
## 6     smerc          maps     1

That gave the first order dependencies, and here are some interesting glimpses into that table. I used table to count the order-1 dependency for each antecedant, to see which are most re-used, and then sort that table to reveal the top ten.

        ante1<-table(antecedants[antecedants["order"]==1,]$ante)
        anteSorted1<-ante1[order(ante1, decreasing=TRUE)]
        length(anteSorted1)
## [1] 1458
        dim(antecedants[antecedants["order"]==1,])
## [1] 10330     3
        head(anteSorted1, 10)
##
##     MASS     Rcpp  ggplot2     plyr   Matrix  lattice  stringr reshape2       sp  mvtnorm
##      374      370      321      266      183      173      157      151      146      142

So something over a thousand packages are in some way re-used, for a total of over 10,000 order-1 dependencies, and the most popular include many of the usual suspects like ggplot and plyr.

Going Deeper

But just looking at the first level is not good enough. If your package builds on, say, ggplot2, which has among its antecedants, plyr, then of course plyr is an antecedant of your package too, but a second-order antecedant. So we need to get recursive, and we can do this just by analysing the antecedants table. So we can build the order 2 antecedants table based on the order 1 table; and the order 3 from the order 2, and so on, until we finally bottom out and reach the maximum depth. Along the way we need to make sure we don't double-count - if a packages uses ggplot2 and also uses plyr directly, we don't want to be double-counting plyr.

So for example here are the most frequent order 3 dependencies.

        ante3<-table(antecedants[antecedants["order"]==3,]$ante)
        anteSorted3<-ante3[order(ante3, decreasing=TRUE)]
        head(anteSorted3, 10)
##
##      lattice         Rcpp      stringr RColorBrewer         plyr     magrittr       digest
##          674          659          503          469          465          421          394
##    dichromat     labeling      munsell
##          380          380          380

And having chased this down, until there were no more levels, the winners are ...

        anteN<-table(unique(antecedants[,-3])$ante)
        anteSortedN<-anteN[order(anteN, decreasing=TRUE)]
        top10ante<-head(anteSortedN, 10)
        top10ante
##
##         Rcpp      lattice         MASS     magrittr      stringi      stringr       digest
##         1341         1119         1048          911          876          864          853
##         plyr RColorBrewer   colorspace
##          799          654          650

So what are these packages that float to the top of the list?

        packages[trim(packages$name) %in% names(top10ante),1:2]
##                name                                                            desc
## 449           Rcpp                                  Seamless R and C++ Integration
## 674           MASS   Support Functions and Datasets for Venables and Ripley's MASS
## 1489       lattice                                          Trellis Graphics for R
## 1820       stringi                          Character String Processing Facilities
## 2013          plyr                Tools for Splitting, Applying and Combining Data
## 2460       stringr        Simple, Consistent Wrappers for Common String Operations
## 2871    colorspace                                        Color Space Manipulation
## 3479        digest                  Create Cryptographic Hash Digests of R Objects
## 3661  RColorBrewer                                            ColorBrewer Palettes
## 3796      magrittr                                   A Forward-Pipe Operator for R

Oh, and ...

Just for fun, some other bits and pieces

The deepest dependency:

        head(antecedants[antecedants$order==max(antecedants$order),])
##                self       ante order
## 53579  BIFIEsurvey     lattice    10
## 53580  BIFIEsurvey        Rcpp    10
## 53581  BIFIEsurvey     stringi    10
## 53582  BIFIEsurvey    magrittr    10
## 53583  BIFIEsurvey  colorspace    10

The number of dependencies for each order maxes out at second-order dependencies, and then tails away:

        table(antecedants$order)
##
##     1     2     3     4     5     6     7     8     9    10
## 10330 14504 11647  8289  5052  2582   932   208    34     5

The most dependent packages - the ones which will pull in the greatest number of other packages:

        selfN<-table(unique(antecedants[,-3])$self)
        selfSortedN<-selfN[order(selfN, decreasing=TRUE)]
        top10self<-head(selfSortedN, 10)
        top10self
##
##  BIFIEsurvey      miceadds         immer          sirt     treescape       semPlot       bootnet
##           120           119           108           106            92            87            84
##    IATscores           RAM     diveRsity
##            83            82            81

And these highly dependent packages, what do they do?

        packages[packages$name %in% names(top10self),1:2]
##               name                                                                     desc
## 386         immer                                Item Response Models for Multiple Ratings
## 437     treescape              Statistical Exploration of Landscapes of Phylogenetic Trees
## 484   BIFIEsurvey                    Tools for Survey Statistics in Educational Assessment
## 1546     miceadds    Some Additional Multiple Imputation Functions, Especially for\n'mice'
## 1817         sirt                                Supplementary Item Response Theory Models
## 2231          RAM                        R for Amplicon-Sequencing-Based Microbial-Ecology
## 2277    IATscores                 Implicit Association Test Scores Using Robust Statistics
## 2921      bootnet                Bootstrap Methods for Various Network Estimation Routines
## 3526    diveRsity   A Comprehensive, General Purpose Population Genetics Analysis\nPackage
## 4335      semPlot       Path diagrams and visual analysis of various SEM packages'\noutput