Skip to content

Latest commit

 

History

History
215 lines (206 loc) · 18.7 KB

Data Science Software.md

File metadata and controls

215 lines (206 loc) · 18.7 KB

Package Managers and Project Organization

Python Software, Libraries, and Packages

  • Package and Environment Management
    • Anaconda - Open data science platform powered by Python
    • ActiveState - 300+ Packages Including Data Science and Machine Learning
  • Platform
    • Python(x,y) - A free scientific and engineering development software for numerical computations, data analysis and data visualization based on Python programming language, Qt graphical user interfaces and Spyder interactive scientific development environment
  • Probabilistic Graphical Modelling
    • PyMC - A python module that implements Bayesian statistical models and fitting algorithms, including Markov chain Monte Carlo
    • pgmpy
    • libpgm
  • Visualization
    • Matplotlib - A python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms
    • Seaborn - A Python visualization library based on matplotlib
    • Altair - Declarative statistical visualization library for Python
    • Bokeh - A Python interactive visualization library that targets modern web browsers for presentation
    • ggplot - A package for plotting in Python
    • Basemap - A library for plotting 2D data on maps in Python
    • Facebook's Visdom - A flexible tool for creating, organizing, and sharing visualizations of live, rich data
    • Scikit-plot - An intuitive library to add plotting functionality to scikit-learn objects
  • Munging and Wrangling
    • Pandas - An open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language
  • Scientific and Numerical
    • Numpy - The fundamental package for scientific computing with Python
    • Scipy - A Python-based ecosystem of open-source software for mathematics, science, and engineering
  • Statistics and Mathematics
    • Statsmodels - A Python module that allows users to explore data, estimate statistical models, and perform statistical tests
    • Statsmodels Stats
  • Notebooks and Reporting
  • Web Mining and Scraping
    • Pattern - Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization
    • Scrapy - An open source and collaborative framework for extracting the data you need from websites
  • Big Data and Performance
    • Blaze - Provides Python users high-level access to efficient computation on inconveniently large data
    • Dask - A flexible parallel computing library for analytic computing
  • Network and Graph Analytics
    • NetworkX - A Python language software package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks
  • Parsing and Data Extraction
    • Beautiful Soup - A Python library for pulling data out of HTML and XML files
  • Data Pipeline
    • Fuel - A data pipeline framework which provides your machine learning models with the data they need
  • Web and API

R Software, Libraries, and Packages

  • Package and Environment Management
    • devtools - Collection of package development tools
    • packrat - Manage the R packages your project depends on in an isolated, portable, and reproducible way
    • R Docuementation
  • Platform
    • proto - An object oriented system using object-based, also called prototype-based, rather than class-based object oriented ideas
    • magrittr - Provides a mechanism for chaining commands with a new forward-pipe operator, %>%
    • DT - Data objects in R can be rendered as HTML tables using the JavaScript library 'DataTables' (typically via R Markdown or Shiny)
  • Visualization
    • ggplot2 - A plotting system for R
    • ggvis - An implementation of an interactive grammar of graphics, taking the best parts of 'ggplot2', combining them with the reactive framework of 'shiny' and drawing web graphics using 'vega'
    • htmlwidgets - A framework for creating HTML widgets that render in various contexts including the R console, 'R Markdown' documents, and 'Shiny' web applications
    • leaflet - Create and customize interactive maps using the 'Leaflet' JavaScript library and the 'htmlwidgets' package
    • googleVis - R interface to Google Charts API, allowing users to create interactive charts based on data frames
    • dygraphs - An R interface to the 'dygraphs' JavaScript charting library
    • rgl - Provides medium to high level functions for 3D interactive graphics, including functions modelled on base graphics (plot3d(), etc.) as well as functions for constructing representations of geometric objects (cube3d(), etc.)
    • shiny - Easy to build interactive web applications with R
    • manipulate - Interactive plotting functions for use within RStudio
    • RColorBrewer - Provides color schemes for maps (and other graphics)
    • scales - Graphical scales map data to aesthetics, and provide methods for automatically determining breaks and labels for axes and legends
    • labeling - Provides a range of axis labeling algorithms
    • colorspace - Carries out mapping between assorted color spaces including RGB, HSV, HLS, CIEXYZ, CIELUV, HCL (polar CIELUV), CIELAB and polar CIELAB
  • Munging and Wrangling
    • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory
    • plyr - A set of tools that solves a common set of problems
    • stringr - A consistent, simple and easy to use set of wrappers around the fantastic 'stringi' package
    • tidyr - Data tidying (not general reshaping or aggregating) and works well with 'dplyr' data pipelines
    • lubridate - Functions to work with date-times and time-spans
    • digest - Implementation of a function 'digest()' for the creation of hash digests of arbitrary R objects (using the 'md5', 'sha-1', 'sha-256', 'crc32', 'xxhash' and 'murmurhash' algorithms) permitting easy comparison of R language objects, as well as a function 'hmac()' to create hash-based message authentication code
    • reshape2 - Flexibly restructure and aggregate data using just two functions: melt and 'dcast' (or 'acast')
    • MICE - Multiple imputation using Fully Conditional Specification (FCS) implemented by the MICE algorithm
    • party - A computational toolbox for recursive partitioning
  • Scientific and Numerical
    • zoo - An S3 class with methods for totally ordered indexed observations. It is particularly aimed at irregular time series of numeric vectors/matrices and factors
    • bitops - Functions for bitwise operations on integer vectors
  • Statistics and Mathematics
  • Notebooks and Reporting
    • R Markdown - Convert R Markdown documents into a variety of formats
    • knitr - A general-purpose tool for dynamic report generation in R using Literate Programming techniques
  • Web Mining and Scraping
  • Big Data and Performance
    • Rcpp - Provides R functions as well as C++ classes which offer a seamless integration of R and C++
  • Network and Graph Analytics
  • Parsing and Data Extraction
    • readr - Read flat/tabular text files from disk (or a connection)
    • mime - Guesses the MIME type from a filename extension using the data derived from /etc/mime.types in UNIX-type systems
    • jsonlite - A fast JSON parser and generator optimized for statistical data and the web
    • Haven - Import foreign statistical formats into R via the embedded 'ReadStat' C library
    • rodbc - An ODBC database interface
  • Data Pipeline
  • Web and API
    • RCurl - A wrapper for 'libcurl' http://curl.haxx.se/libcurl/ Provides functions to allow one to compose general HTTP requests and provides convenient functions to fetch URIs, get & post forms, etc. and process the results returned by the Web server

Non-Python/R Specific

Data Mining and Analytics

  • Google Cloud Datalab - An easy to use interactive tool for large-scale data exploration, analysis, and visualization
  • Orange - Open source machine learning and data visualization for novice and expert
  • RapidMiner - Data science platform
  • Statwing

Business Intelligence and Data Visualization

IDEs