A curated list of awesome big data frameworks, ressources and other awesomeness. Inspired by awesome-php, awesome-python, awesome-ruby, hadoopecosystemtable & big-data.
Your contributions are always welcome!
- Awesome Big Data
- Frameworks
- Distributed Programming
- Distributed Filesystem
- Column Data Model
- Document Data Model
- Key-value Data Model
- Graph Data Model
- NewSQL Databases
- Time-Series Databases
- SQL-like processing
- Integrated Development Environments
- Data Ingestion
- Service Programming
- Scheduling
- Machine Learning
- Benchmarking
- Security
- System Deployment
- Applications
- Search engine and framework
- MySQL forks and evolutions
- Memcached forks and evolutions
- Embedded Databases
- Business Intelligence
- Data Visualization
- Interesting Readings
- Interesting Papers
- Other Awesome Lists
- Apache Hadoop - framework for distributed processing. Integrated MapReduce, YARN and HDFS.
- AddThis Hydra - distributed data processing and storage system.
- AMPLab SIMR - run Spark on Hadoop MapReduce v1.
- Apache Crunch - a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce.
- Apache DataFu - collection of user-defined functions for Hadoop and Pig developed by LinkedIn.
- Apache Gora - framework for in-memory data model and persistence.
- Apache Hama - BSP (Bulk Synchronous Parallel) computing framework.
- Apache MapReduce - programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
- Apache Pig - high level language to express data analysis programs for Hadoop.
- Apache S4 - framework for stream processing, implementation of S4.
- Apache Spark - framework for in-memory cluster computing.
- Apache Spark Streaming - framework for stream processing, part of Spark.
- Apache Storm - framework for stream processing by Twitter also on YARN.
- Apache Tez - application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.
- Apache Twill - abstraction over YARN that reduces the complexity of developing distributed applications.
- Cascalog - data processing and querying library.
- Cheetah - High Performance, Custom Data Warehouse on Top of MapReduce.
- Concurrent Cascading - framework for data management/analytics on Hadoop.
- Damballa Parkour - MapReduce library for Clojure.
- Datasalt Pangool - alternative MapReduce paradigm.
- DataTorrent StrAM - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance..
- Facebook Corona - Hadoop enhancement which removes single point of failure.
- Facebook Peregrine - Map Reduce framework.
- Facebook Scuba - distributed in-memory datastore.
- Google MapReduce - map reduce framework.
- Google MillWheel - fault tolerant stream processing framework.
- HadoopDB - hybrid of MapReduce and DBMS.
- JAQL - declarative programming language for working with structured, semi-structured and unstructured data.
- Kite - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
- Metamarkers Druid - framework for real-time analysis of large datasets.
- Netflix PigPen - map-reduce for Clojure whiche compiles to Apache Pig.
- Nokia Disco - MapReduce framework developed by Nokia.
- Pydoop - Python MapReduce and HDFS API for Hadoop.
- Stratosphere - general purpose cluster computing framework.
- Twitter Scalding - Scala library for Map Reduce jobs, built on Cascading.
- Twitter Summingbird - Streaming MapReduce with Scalding and Storm, by Twitter.
- Apache HDFS - a way to store large files across multiple machines.
- Ceph Filesystem - software storage platform designed.
- Facebook Haystack - object storage system.
- Google Colossus - distributed filesystem (GFS2).
- Google GFS - distributed filesystem.
- Google Megastore - scalable, highly available storage.
- GridGain - GGFS, Hadoop compliant in-memory file system.
- Lustre file system - high-performance distributed filesystem.
- Quantcast File System QFS - open-source distributed file system.
- Red Hat GlusterFS - scale-out network-attached storage file system.
- Tachyon - reliable file sharing at memory speed across cluster frameworks.
- Actian Vector - column-oriented analytic database.
- Apache Accumulo - distribuited key/value store, built on Hadoop.
- Apache Cassandra - column-oriented distribuited datastore, inspired by BigTable.
- Apache HBase - column-oriented distribuited datastore, inspired by BigTable.
- C-Store - column oriented DBMS.
- Facebook HydraBase - evolution of HBase made by Facebook.
- Google BigTable - column-oriented distributed datastore.
- Google Cloud Datastore - is a fully managed, schemaless database for storing non-relational data over BigTable
- Hypertable - column-oriented distribuited datastore, inspired by BigTable.
- InfiniDB - is accessed through a MySQL interface and use massive parallel processing to parallelize queries.
- MonetDB - column store database.
- OhmData C5 - improved version of HBase.
- Parquet - columnar storage format for Hadoop.
- Twitter Manhattan - real-time, multi-tenant distributed database for Twitter scale.
- Vertica - is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses.
- Crate Data - is an open source massively scalable data store. It requires zero administration.
- Facebook Apollo - Facebook’s Paxos-like NoSQL database.
- jumboDB - document oriented datastore over Hadoop.
- LinkedIn Espresso - horizontally scalable document-oriented NoSQL data store.
- MarkLogic - Schema-agnostic Enterprise NoSQL database technology.
- MongoDB - Document-oriented database system.
- RethinkDB - document database that supports queries like table joins and group by.
- Amazon DynamoDB - distributed key/value store, implementation of Dynamo paper.
- Edis - is a protocol-compatible Server replacement for Redis.
- ElephantDB - Distributed database specialized in exporting data from Hadoop.
- EventStore - distributed time series database.
- LinkedIn Krati - is a simple persistent data store with very low latency and high throughput.
- Linkedin Voldemort - distributed key/value storage system.
- OpenTSDB - distributed time series database on top of HBase.
- Redis - in memory key value datastore.
- Riak - a decentralized datastore.
- Storehaus - library to work with asynchronous key value stores, by Twitter.
- Tarantool - an efficient NoSQL database and a Lua application server.
- Apache Giraph - implementation of Pregel, based on Hadoop.
- Apache Spark Bagel - implementation of Pregel, part of Spark.
- ArangoDB - multi model distribuited database.
- Facebook TAO - TAO is the distributed data store that is widely used at facebook to store and serve the social graph.
- Gremlin - graph traversal Language.
- Google Cayley - open-source graph database.
- Google Pregel - graph processing framework.
- GraphLab PowerGraph - a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API.
- GraphX - resilient Distributed Graph System on Spark.
- Intel GraphBuilder - tools to construct large-scale graphs on top of Hadoop.
- Neo4j - graph database writting entirely in Java.
- OrientDB - document and graph database.
- Phoebus - framework for large scale graph processing.
- Titan - distributed graph database, built over Cassandra.
- Twitter FlockDB - distribuited graph database.
- Amazon RedShift - data warehouse service, based on PostgreSQL.
- BayesDB - statistic oriented SQL database.
- FoundationDB - distributed database, inspired by F1.
- Google F1 - distributed SQL database built on Spanner.
- Google Spanner - globally distributed semi-relational database.
- H-Store - is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications.
- Haeinsa - linearly scalable multi-row, multi-table transaction library for HBase based on Percolator.
- HandlerSocket - NoSQL plugin for MySQL/MariaDB.
- InfiniSQL - infinity scalable RDBMS.
- MemSQL - in memory SQL database witho optimized columnar storage on flash.
- NuoDB - SQL/ACID compliant distributed database.
- Postgres-XL - Scalable Open Source PostgreSQL-based Database Cluster.
- SAP HANA - SQL based in-memory database.
- SenseiDB - distributed, realtime, semi-structured database.
- Sky - database used for flexible, high performance analysis of behavioral data.
- SymmetricDS - open source software for both file and database synchronization.
- TempoDB - Cloud-based
- InfluxDB - Open-source distributed time series database
- OpenTSDB - uses HBase
- Kairosdb - similar to OpenTSDB but allows for Cassandra
- Cube - uses MongoDB to store time series data
- AMPLAB Shark - data warehouse system for Spark.
- Apache Drill - framework for interactive analysis, inspired by Dremel.
- Apache HCatalog - table and storage management layer for Hadoop.
- Apache Hive - SQL-like data warehouse system for Hadoop.
- Apache Phoenix - SQL skin over HBase.
- BlinkDB - massively parallel, approximate query engine.
- Cloudera Impala - framework for interactive analysis, Inspired by Dremel.
- Concurrent Lingual - SQL-like query language for Cascading.
- Datasalt Splout SQL - full SQL query engine for big datasets.
- Facebook PrestoDB - distributed SQL query engine.
- Google BigQuery - framework for interactive analysis, implementation of Dremel.
- Pivotal HAWQ - SQL-like data warehouse system for Hadoop.
- Spark Catalyst - is a Query Optimization Framework for Spark and Shark.
- SparkSQL - Manipulating Structured Data Using Spark.
- Splice Machine - a full-featured SQL-on-Hadoop RDBMS with ACID transactions.
- Stinger - interactive query for Hive.
- Tajo - distributed data warehouse system on Hadoop.
- Amazon Kinesis - real-time processing of streaming data at massive scale.
- Apache Chukwa - data collection system.
- Apache Flume - service to manage large amount of log data.
- Apache Kafka - distributed publish-subscribe messaging system.
- Apache Samza - stream processing framework, based on Kafla and YARN.
- Apache Sqoop - tool to transfer data between Hadoop and a structured datastore.
- Cloudera Morphlines - framework that help ETL to Solr, HBase and HDFS.
- Facebook Scribe - streamed log data aggregator.
- Fluentd - tool to collect events and logs.
- HIHO - framework for connecting disparate data sources with Hadoop.
- Kestrel - distributed message queue system.
- LinkedIn Databus - stream of change capture events for a database.
- LinkedIn Kamikaze - utility package for compressing sorted integer arrays.
- LinkedIn White Elephant - log aggregator and dashboard.
- Netflix Suro - log agregattor like Storm and Samza based on Chukwa.
- Pinterest Secor - is a service implementing Kafka log persistance.
- R-Studio - IDE for R.
- Akka Toolkit - runtime for distributed, and fault tolerant event-driven applications on the JVM.
- Apache Avro - data serialization system.
- Apache Curator - Java libaries for Apache ZooKeeper.
- Apache Karaf - OSGi runtime that runs on top of any OSGi framework.
- Apache Thrift - framework to build binary protocols.
- Apache Zookeeper - centralized service for process management.
- Google Chubby - a lock service for loosely-coupled distributed systems.
- Linkedin Norbert - cluster manager.
- OpenMPI - message passing framework.
- Serf - decentralized solution for service discovery and orchestration.
- Spring XD - distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.
- Twitter Elephant Bird - libraries for working with LZOP-compressed data.
- Twitter Finagle - asynchronous network stack for the JVM.
- Apache Aurora - is a service scheduler that runs on top of Apache Mesos.
- Apache Falcon - data management framework.
- Apache Oozie - workflow job schedul.
- Chronos - distributed and fault-tolerant scheduler.
- Linkedin Azkaban - batch workflow job scheduler.
- Sparrow - scheduling platform.
- Apache Mahout - machine learning library for Hadoop.
- brain - Neural networks in JavaScript.
- Cloudera Oryx - real-time large-scale machine learning.
- Concurrent Pattern - machine learning library for Cascading.
- convnetjs - Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.
- Decider - Flexible and Extensible Machine Learning in Ruby.
- etcML - text classification with machine learning.
- Etsy Conjecture - scalable Machine Learning in Scalding.
- H2O - statistical, machine learning and math runtime for Hadoop.
- MLbase - distributed machine learning libraries for the BDAS stack.
- MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X.
- nupic - Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.
- PredictionIO - machine learning server buit on Hadoop, Mahout and Cascading.
- scikit-learn - scikit-learn: machine learning in Python.
- Spark MLlib - a Spark implementation of some common machine learning (ML) functionality.
- Vowpal Wabbit - learning system sponsored by Microsoft and Yahoo!.
- WEKA - suite of machine learning software.
- Apache Hadoop Benchmarking - micro-benchmarks for testing Hadoop performances.
- Berkeley SWIM Benchmark - real-world big data workload benchmark.
- Intel HiBench - a Hadoop benchmark suite.
- PUMA Benchmarking - benchmark suite for MapReduce application.
- Yahoo Gridmix3 - Hadoop cluster benchmarking from Yahoo engineer team.
- Apache Knox Gateway - single point of secure access for Hadoop clusters.
- Apache Sentr - security module for data stored in Hadoop.
- Apache Ambari - operational framework for Hadoop mangement.
- Apache Bigtop - system deployment framework for the Hadoop ecosystem.
- Apache Helix - cluster management framework.
- Apache Mesos - cluster manager.
- Apache Slider - is a YARN application to deploy existing distributed applications on YARN.
- Apache Whirr - set of libraries for running cloud services.
- Apache YARN - Cluster manager.
- Brooklyn - library that simplifies application deployment and management.
- Buildoop - Similar to Apache BigTop based on Groovy language.
- Cloudera HUE - web application for interacting with Hadoop.
- Facebook Prism - multi datacenters replication system.
- Google Borg - job scheduling and monitoring system.
- Google Omega - job scheduling and monitoring system.
- Hortonworks HOYA - application that can deploy HBase cluster on YARN.
- Marathon - Mesos framework for long-running services.
- Apache Kiji - framework to collect and analyze data in real-time, based on HBas.
- Apache Nutch - open source web crawler.
- Apache OODT - capturing, processing and sharing of data for NASA’s scientific archives.
- Apache Tika - content analysis toolkit.
- Eclipse BIRT - Eclipse-based reporting system.
- Eventhub - open source event analytics platform.
- HIPI Library - API for performing image processing tasks on Hadoop’s MapReduce.
- Hunk - Splunk analytics for Hadoop.
- MADlib - data-processing library of an RDBMS to analyze data.
- PivotalR - R on Pivotal HD / HAWQ and PostgreSQL.
- Qubole - auto-scaling Hadoop cluster, built-in data connectors.
- Snowplow - enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.
- SparkR - R frontend for Spark.
- Splunk - analyzer for machine-generated date.
- Talend - unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig.
- Apache Lucene - Search engine library.
- Apache Solr - Search platform for Apache Lucene.
- ElasticSearch - Search and analytics engine based on Apache Lucene.
- Facebook Unicorn - social graph search platform.
- Google Caffeine - continuous indexing system.
- Google Percolator - continuous indexing system.
- TeraGoogle - large search index.
- HBase Comprocessor - implementation of Percolator, part of HBase.
- LinkedIn Bobo - is a Faceted Search implementation written purely in Java, an extension to Apache Lucene.
- LinkedIn Cleo - is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search.
- LinkedIn Galene - search architecture at LinkedIn.
- LinkedIn Zoie - is a realtime search/indexing system written in Java.
- Sphnix Search Server - fulltext search engine.
- Amazon RDS - MySQL databases in Amazon’s cloud.
- Drizzle - evolution of MySQL 6.0.
- Google Cloud SQL - MySQL databases in Google’s cloud.
- MariaDB - enhanced, drop-in replacement for MySQL.
- MySQL Cluster - MySQL implementation using NDB Cluster storage engine.
- Percona Server - enhanced, drop-in replacement for MySQL.
- ProxySQL - High Performance Proxy for MySQL.
- TokuDB - TokuDB is a storage engine for MySQL and MariaDB.
- WebScaleSQL - is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale.
- Facebook McDipper - key/value cache for flash storage.
- Facebook Memcached - fork of Memcache.
- Twemproxy - a fast, light-weight proxy for memcached and redis.
- Twitter Fatcache - key/value cache for flash storage.
- Twitter Twemcache - fork of Memcache.
- BerkeleyDB - a software library that provides a high-performance embedded database for key/value data.
- HanoiDB - Erlang LSM BTree Storage.
- LevelDB - a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.
- LMDB - ultra-fast, ultra-compact key-value embedded data store developed by Symas.
- RocksDB - embeddable persistent key-value store for fast storage based on LevelDB.
- Jaspersoft - powerful business intelligence suite.
- Jedox Palo - customisable business intelligence platform.
- Microsoft - business intelligence software and platform.
- Microstrategy - software platforms for business intelligence, mobile intelligence, and network applications.
- Pentaho - business intelligence platform.
- Qlik - business intelligence and analytics platform.
- Tableau - business intelligence platform.
- Spango BI - open source business intelligence platform.
- Arbor - graph visualization library using web workers and jQuery.
- Chart.js - open source HTML5 Charts visualizations.
- Cubism - JavaScript library for time series visualization.
- D3 - javaScript library for manipulating documents.
- Envisionjs - dynamic HTML5 visualization.
- Grafana - graphite dashboard frontend, editor and graph composer.
- Graphite - scalable Realtime Graphing.
- Google Charts - simple charting API.
- Highcharts - simple and flexible charting API.
- Matplotlib - plotting with Python.
- NVD3 - chart components for d3.js.
- Peity - Progressive bar, line and pie charts.
- Recline - simple but powerful library for building data applications in pure Javascript and HTML.
- Sigma.js - JavaScript library dedicated to graph drawing.
- Vega - a visualization grammar.
- Big Data Benchmark - Benchmark of Redshift, Hive, Shark, Impala and Stiger/Tez.
- NoSQL Comparison - Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison.
- 2013 - AMPLab - Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices.
- 2013 - AMPLab - MLbase: A Distributed Machine-learning System.
- 2013 - AMPLab - Shark: SQL and Rich Analytics at Scale.
- 2013 - AMPLab - GraphX: A Resilient Distributed Graph System on Spark.
- 2013 - Google - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.
- 2013 - Microsoft - Scalable Progressive Analytics on Big Data in the Cloud.
- 2013 - Metamarkets - Druid: A Real-time Analytical Data Store.
- 2013 - Google - Online, Asynchronous Schema Change in F1.
- 2013 - Google - F1: A Distributed SQL Database That Scales.
- 2013 - Google - MillWheel: Fault-Tolerant Stream Processing at Internet Scale.
- 2013 - Facebook - Scuba: Diving into Data at Facebook.
- 2013 - Facebook - Unicorn: A System for Searching the Social Graph.
- 2013 - Facebook - Scaling Memcache at Facebook.
- 2012 - AMPLab - Blink and It’s Done: Interactive Queries on Very Large Data.
- 2012 - AMPLab - Fast and Interactive Analytics over Hadoop Data with Spark.
- 2012 - AMPLab - Shark: Fast Data Analysis Using Coarse-grained Distributed Memory.
- 2012 - Microsoft - Paxos Replicated State Machines as the Basis of a High-Performance Data Store.
- 2012 - Microsoft - Paxos Made Parallel.
- 2012 - AMPLab - BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data.
- 2012 - Google - Processing a trillion cells per mouse click.
- 2012 - Google - Spanner: Google’s Globally-Distributed Database.
- 2011 - AMPLab - Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters.
- 2011 - AMPLab - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center.
- 2011 - Google - Megastore: Providing Scalable, Highly Available Storage for Interactive Services.
- 2010 - Facebook - Finding a needle in Haystack: Facebook’s photo storage.
- 2010 - AMPLab - Spark: Cluster Computing with Working Sets.
- 2010 - Google - Storage Architecture and Challenges.
- 2010 - Google - Pregel: A System for Large-Scale Graph Processing.
- 2010 - Google - Large-scale Incremental Processing Using Distributed Transactions and Notifications base of Percolator and Caffeine.
- 2010 - Google - Dremel: Interactive Analysis of Web-Scale Datasets.
- 2010 - Yahoo - S4: Distributed Stream Computing Platform.
- 2009 - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.
- 2008 - AMPLab - Chukwa: A large-scale monitoring system.
- 2007 - Amazon - Dynamo: Amazon’s Highly Available Key-value Store.
- 2006 - Google - The Chubby lock service for loosely-coupled distributed systems.
- 2006 - Google - Bigtable: A Distributed Storage System for Structured Data.
- 2004 - Google - MapReduce: Simplied Data Processing on Large Clusters.
- 2003 - Google - The Google File System.
Other amazingly awesome lists can be found in the awesome-awesomeness list.