The fastverse is a suite of complementary high-performance packages for statistical computing and data manipulation in R. Developed independently by various people, fastverse packages jointly contribute to the objectives of:
- Speeding up R through heavy use of compiled code (C, C++, Fortran)
- Enabling more complex statistical and data manipulation operations in R
- Reducing the number of dependencies required for advanced computing in R
The fastverse
package integrates, and provides utilities for easy installation, loading and management
of these packages. It is an extensible framework that allows users to (permanently) add or remove packages to create a 'verse' of packages suiting their general needs. Separate 'verses' can also be created.
fastverse packages are jointly attached with library(fastverse)
, and several functions starting with fastverse_
help manage dependencies, detect namespace conflicts, add/remove packages from the fastverse and update packages.
The fastverse consists of 6 core packages (7 dependencies in total) which provide broad C/C++ based statistical and data manipulation functionality and have carefully managed APIs. These packages are installed and attached along with the fastverse
package.
-
data.table: Enhanced data frame class with concise data manipulation framework offering powerful aggregation, extremely flexible split-apply-combine computing, reshaping, joins, rolling statistics, set operations on tables, fast csv read/write, and various utilities such as transposition of data.
-
collapse: Fast grouped & weighted statistical computations, time series and panel data transformations, list-processing, data manipulation functions, summary statistics and various utilities such as support for variable labels. Class-agnostic framework designed to work with vectors, matrices, data frames, lists and related classes including xts, data.table, tibble, pdata.frame, sf.
-
matrixStats: Efficient row-and column-wise (weighted) statistics on matrices and vectors, including computations on subsets of rows and columns.
-
kit: Fast vectorized and nested switches, some parallel (row-wise) statistics, and some utilities such as efficient partial sorting and unique values.
-
magrittr: Efficient pipe operators for enhanced programming and code unnesting.
-
fst: A compressed data file format that is very fast to read and write. Full random access in both rows and columns allows reading subsets from a '.fst' file.
Additional dependency: Package Rcpp is imported by collapse and fst.
Currently, there are 2 different versions of the fastverse on CRAN and GitHub. The GitHub version is recommended if you want to have matrixStats consistently preserve attributes of your matrices: it modifies functions in the matrixStats namespace making them preserve attributes consistently (and by default) whenever the fastverse is attached. This version was rejected by CRAN because it requires a call to unlockBinding
. The CRAN version takes matrixStats as it is, which means most functions do not preserve attributes such as dimension names in computations.
# Install the CRAN version
install.packages("fastverse")
# Install the GitHub version (Requires Rtools)
remotes::install_github("SebKrantz/fastverse")
Note that the GitHub version is not a development version, development takes place in the 'development' branch. matrixStats is slowly evolving towards greater consistency, but it might take more than half a year until dimension names are handled consistently by default - due to the large number of reverse dependencies. Until then CRAN and GitHub versions of the fastverse are released together.
In addition, users have the option (via the fastverse_entend()
function) to freely attach extension packages offering more specific functionality. The fastverse can by extended by any R package, either just for the current session or permanently:
Click here to expand
# Loads and attaches the core fastverse packages library(fastverse)
# -- Attaching packages --------------------------------------- fastverse 0.1.5 --
# v data.table 1.14.0 v collapse 1.6.5 # v magrittr 2.0.1 v matrixStats 0.59.0 # v kit 0.0.7 v fst 0.9.4
# Permanently extends the core fastverse by certain packages fastverse_extend(xts, roll, dygraphs, permanent = TRUE)
# -- Attaching extension packages ----------------------------- fastverse 0.1.5 --
# v xts 0.12.1 v dygraphs 1.1.1.6 # v roll 1.1.6
# -- Conflicts ------------------------------------------ fastverse_conflicts() -- # x xts::first() masks data.table::first() # x xts::last() masks data.table::last()
# If the fastverse is now loaded in a new session, these packages are added fastverse_detach(session = TRUE) library(fastverse)
# -- Attaching packages --------------------------------------- fastverse 0.1.5 --
# v data.table 1.14.0 v fst 0.9.4 # v magrittr 2.0.1 v xts 0.12.1 # v kit 0.0.7 v roll 1.1.6 # v collapse 1.6.5 v dygraphs 1.1.1.6 # v matrixStats 0.59.0
# -- Conflicts ------------------------------------------ fastverse_conflicts() -- # x xts::first() masks data.table::first() # x collapse::is.regular() masks zoo::is.regular() # x xts::last() masks data.table::last()
# We can also extend only the fastverse for the session, here adding Rfast2 # and any installed suggested packages for date-time manipulation (see following README section) fastverse_extend(Rfast2, topics = "DT")
# -- Attaching extension packages ----------------------------- fastverse 0.1.5 --
# v Rfast2 0.0.9 v clock 0.3.1 # v lubridate 1.7.10 v fasttime 1.0.2
# -- Conflicts ------------------------------------------ fastverse_conflicts() -- # x lubridate::as.difftime() masks base::as.difftime() # x clock::as_date() masks lubridate::as_date() # x lubridate::date() masks base::date() # x lubridate::hour() masks data.table::hour() # x lubridate::intersect() masks base::intersect() # x lubridate::is.Date() masks collapse::is.Date() # x lubridate::isoweek() masks data.table::isoweek() # x lubridate::mday() masks data.table::mday() # x lubridate::minute() masks data.table::minute() # x lubridate::month() masks data.table::month() # x lubridate::quarter() masks data.table::quarter() # x lubridate::second() masks data.table::second() # x lubridate::setdiff() masks base::setdiff() # x lubridate::union() masks base::union() # x lubridate::wday() masks data.table::wday() # x lubridate::week() masks data.table::week() # x lubridate::yday() masks data.table::yday() # x lubridate::year() masks data.table::year()
# This shows a situation report of the fastverse, including all dependencies fastverse_sitrep(recursive = TRUE)
# -- fastverse 0.1.5: Situation Report -------------------------------- R 4.1.0 -- # * Global config file: TRUE # * Project config file: FALSE # -- Core packages --------------------------------------------------------------- # * data.table (1.14.0) # * magrittr (2.0.1) # * kit (0.0.7) # * collapse (1.6.5) # * matrixStats (0.59.0 < 0.60.0) # * fst (0.9.4) # * xts (0.12.1) # * roll (1.1.6) # * dygraphs (1.1.1.6) # -- Extension packages ---------------------------------------------------------- # * Rfast2 (0.0.9) # * lubridate (1.7.10) # * clock (0.3.1 < 0.4.0) # * fasttime (1.0.2) # -- Dependencies ---------------------------------------------------------------- # * base64enc (0.1.3) # * cpp11 (0.3.1) # * digest (0.6.27) # * ellipsis (0.3.1 < 0.3.2) # * generics (0.1.0) # * glue (1.4.2) # * htmltools (0.5.1.1) # * htmlwidgets (1.5.3) # * jsonlite (1.7.2) # * lattice (0.20.44) # * RANN (2.6.1) # * Rcpp (1.0.7) # * RcppArmadillo (0.10.2.1.0 < 0.10.6.0.0) # * RcppGSL (0.3.8 < 0.3.9) # * RcppParallel (5.0.2 < 5.1.4) # * RcppZiggurat (0.1.5 < 0.1.6) # * Rfast (2.0.1 < 2.0.3) # * rlang (0.4.11) # * tzdb (0.1.1 < 0.1.2) # * vctrs (0.3.7 < 0.3.8) # * yaml (2.2.1) # * zoo (1.8.9)
# Resets the fastverse to defaults, removing any permanent modifications fastverse_reset()
In addition to a global customization, separate fastverse's can be created for projects by adding a .fastverse
config file in the project directory and listing packages there. Only these packages will then be loaded and managed with library(fastverse)
in the project.
High-performing packages for different data manipulation and statistical computing topics are suggested below. Each topic has a 2-character topic-id, which can be used to quickly attach all available packages with fastvere_extend(topcis = c(..id's..))
, and to install missing packages by adding argument install = TRUE
. The majority of these packages provide compiled code and have few dependencies. The total (recursive) dependency count is indicated for each package.
-
xts and zoo: Fast and reliable matrix-based time series classes providing fully identified ordered observations and various utilities for plotting and computations (1 dependency).
-
roll: Very fast rolling and expanding window functions for vectors and matrices (3 dependencies).
Notes: xts/zoo objects are preserved by roll functions and by collapse's time series and data transformation functions^[collapse functions can also handle irregular time series, but this requires passing an integer time variable to the
t
argument which has consecutive integer steps for regular parts of the time series and non-consecutive integers for the irregular parts.]. As xts/zoo objects are matrices, all matrixStats functions apply to them as well. xts objects can also easily be converted to and from data.table.
-
lubridate: Facilitates 'POSIX-' and 'Date' based computations (2 dependencies).
-
anytime: Anything to 'POSIXct' or 'Date' converter (2 dependencies).
-
fasttime: Fast parsing of strings to 'POSIXct' (0 dependencies).
-
nanotime: Provides a coherent set of temporal types and functions with nanosecond precision -
based on the 'integer64' class (7 dependencies). -
clock: Comprehensive library for date-time manipulations using a new family of orthogonal date-time classes (durations, time points, zoned-times, and calendars) (6 dependencies).
-
timechange: Efficient manipulation of date-times accounting for time zones and daylight saving times (1 dependency).
Notes: Date and time variables are preserved in many data.table and collapse operations. data.table additionally offers an efficient integer based date class 'IDate' with some supporting functionality. xts and zoo also provide various functions to transform dates, and zoo provides classes 'yearmon' and 'yearqtr' for convenient computation with monthly and quarterly data. Package mondate also provides a class 'mondate' for monthly data.
-
stringi: Main R package for fast, correct, consistent, and convenient string/text manipulation (backend to stringr and snakecase) (0 dependencies).
-
stringr: Simple, consistent wrappers for common string operations, based on stringi (3 dependencies).
-
snakecase: Convert strings into any case, based on stringi and stringr (4 dependencies).
-
stringfish: Fast computation of common (base R) string operations using the ALTREP system (2 dependencies).
-
stringdist: Fast computation of string distance metrics, matrices, and fuzzy matching (0 dependencies).
-
Rfast and Rfast2: Heterogeneous sets of fast functions for statistics, estimation and data manipulation operating on vectors and matrices. Missing values and object attributes are not (consistently) supported (4-5 dependencies).
-
parallelDist: Multi-threaded distance matrix computation (3 dependencies).
-
coop: Fast implementations of the covariance, correlation, and cosine similarity (0 dependencies).
-
rsparse: Implements many algorithms for statistical learning on sparse matrices - matrix factorizations, matrix completion, elastic net regressions, factorization machines (8 dependencies). See also package MatrixExtra.
-
rrapply: The
rrapply()
function extends baserapply()
by including a condition or predicate function for the application of functions and diverse options to prune or aggregate the result (0 dependencies).Notes: Rfast has a number of like-named functions to matrixStats. These are simpler but typically faster and support multi-threading. Some highly efficient statistical functions can also be found scattered across various other packages, notable to mention here are Hmisc (60 dependencies) and DescTools (17 dependencies). fastDummies (16 dependencies) implements creation of dummy (binary) variables.
-
sf: Leading framework for geospatial computing and manipulation in R, offering a simple and flexible spatial data frame and supporting functionality (13 dependencies).
-
stars: Spatiotemporal data (raster and vector) in the form of dense arrays, with space and time being array dimensions (17 dependencies).
-
terra: Methods for spatial data analysis with raster and vector data. Processing of very large (out of memory) files is supported (4 dependencies).
Notes: collapse can be used for efficient manipulation and computations on sf data frames. sf also offers tight integration with dplyr.
-
dygraphs: Interface to 'Dygraphs' interactive time series charting library (11 dependencies).
-
lattice: Trellis graphics for R (0 dependencies).
-
grid: The grid graphics package (0 dependencies).
-
ggplot2: Create elegant data visualizations using the Grammar of Graphics (30 dependencies).
-
scales: Scale functions for visualizations (10 dependencies).
Notes: latticeExtra provides extra graphical utilities base on lattice. gridExtra provides miscellaneous functions for grid graphics (and consequently for ggplot2 which is based on grid). gridtext provides improved text rendering support for grid graphics. Many packages offer ggplot2 extensions, (typically starting with 'gg') such as ggExtra, ggalt, ggforce, ggmap, ggtext, ggthemes, ggrepel, ggridges, ggfortify, ggstatsplot, ggeffects, ggsignif, GGally, ggcorrplot, ggdendro, etc...
-
tidytable: A tidy interface to data.table that is rlang compatible. Quite comprehensive implementation of dplyr, tidyr and purr functions. tidyverse function names are appended with a
.
e.g.mutate.()
. Package uses a class tidytable that inherits from data.table. Thedt()
function makes data.table syntax pipeable (14 total dependencies). -
tidyfast: Fast tidying of data. Covers tidyr functionality,
dt_
prefix, preserves data.table object. Some unnecessary deep copies (2 dependencies). -
tidyfst: Tidy verbs for fast data manipulation. Covers dplyr and some tidyr functionality. Functions have
_dt
suffix and preserve data.table object. A cheatsheet is provided (7 dependencies). -
tidyft: Tidy verbs for fast data operations by reference. Best for big data manipulation on out of memory data using facilities provided by fst (7 dependencies).
-
maditr: Fast data aggregation, modification, and filtering with pipes and data.table. Minimal implementation with functions
let()
andtake()
for most common data manipulation tasks. Also provides Excel-like lookup functions (2 dependencies).Notes: One could also mention Rstudio's dtplyr and the table.express package here, but these packages import dplyr and thus have a around 20 dependencies.
-
qs provides a lightning-fast and complete replacement for the
saveRDS
andreadRDS
functions in R. It supports general R objects with attributes and references - at similar speeds to fst - but does not provide on-disk random access to data subsets like fst (4 dependencies). -
arrow provides both a low-level interface to the Apache Arrow C++ library (a multi-language toolbox for accelerated data interchange and in-memory processing) and some higher-level, R-flavored tools for working with it - including fast reading / writing delimited files and sharing data between R and Python (12 dependencies).
Notes: Package vroom offers fast reading and writing of delimited files, but with 24 dependencies is not really a fastverse candidate.
Feel free to notify me of any other packages you think should be included here. Such packages should be well designed, top-performing, low-dependency, and, with few exceptions, provide own compiled code. Please note that the fastverse focuses on general purpose statistical computing and data manipulation, thus I won't include fast packages to estimate specific kinds of models here (of which R also has a great many).