Sparkling Water integrates H2O's fast scalable machine learning engine with Spark. It provides:
- Utilities to publish Spark data structures (RDDs, DataFrames, Datasets) as H2O's frames and vice versa.
- DSL to use Spark data structures as input for H2O's algorithms.
- Basic building blocks to create ML applications utilizing Spark and H2O APIs.
- Python interface enabling use of Sparkling Water directly from PySpark.
The Sparkling Water is developed in multiple parallel branches. Each branch corresponds to a Spark major release (e.g., branch rel-2.1 provides implementation of Sparkling Water for Spark 2.1).
Please, switch to the right branch:
For Spark 2.2 use branch rel-2.2
For Spark 2.1 use branch rel-2.1
For Spark 2.0 use branch rel-2.0
For Spark 1.6 use branch rel-1.6 (Only critical fixes)
Note: The master branch includes the latest changes for the latest Spark version. They are back-ported into older Sparkling Water versions.
- Linux/OS X/Windows
- Java 8+
- Python 2.7+ For Python version of Sparkling Water (PySparkling)
- Spark 2.2 and
SPARK_HOME
shell variable must point to your local Spark installation
For each Sparkling Water you can download binaries here:
- Sparkling Water - Latest version
- Sparkling Water - Latest 2.1 version
- Sparkling Water - Latest 2.0 version
- Sparkling Water - Latest 1.6 version
Each Sparkling Water release is published into Maven central. Published artifacts are provided with the following Scala versions:
- Sparkling Water 2.1.x - Scala 2.11
- Sparkling Water 2.0.x - Scala 2.11
- Sparkling Water 1.6.x - Scala 2.10
The artifacts coordinates are:
ai.h2o:sparkling-water-core_{{scala_version}}:{{version}}
- includes core of Sparkling Water.ai.h2o:sparkling-water-examples_{{scala_version}}:{{version}}
- includes example applications.Note: The
{{version}}
references to a release version of Sparkling Water, the{{scala_version}}
references to Scala base version (2.10
or2.11
). For example:ai.h2o:sparkling-water-examples_2.11:2.1.0
The full list of published packages is available here.
Sparkling Water is distributed as a Spark application library which can be used by any Spark application. Furthermore, we provide also zip distribution which bundles the library and shell scripts.
There are several ways of using Sparkling Water:
- Sparkling Shell
- Sparkling Water driver
- Spark Shell and include Sparkling Water library via
--jars
or--packages
option - Spark Submit and include Sparkling Water library via
--jars
or--packages
option - PySpark with PySparkling
The Sparkling shell encapsulates a regular Spark shell and append Sparkling Water library on the classpath via --jars
option.
The Sparkling Shell supports creation of an H2O cloud and execution of H2O algorithms.
Either download or build Sparkling Water
Configure the location of Spark cluster:
export SPARK_HOME="/path/to/spark/installation" export MASTER="local[*]"
In this case,
local[*]
points to an embedded single node cluster.Run Sparkling Shell:
bin/sparkling-shell
Sparkling Shell accepts common Spark Shell arguments. For example, to increase memory allocated by each executor, use the
spark.executor.memory
parameter:bin/sparkling-shell --conf "spark.executor.memory=4g"
Initialize H2OContext
import org.apache.spark.h2o._ val hc = H2OContext.getOrCreate(spark)
H2OContext
starts H2O services on top of Spark cluster and provides primitives for transformations between H2O and Spark data structures.
Sparkling Water can be also used directly from PySpark and the integration is called PySparkling.
See PySparkling README to learn about PySparkling.
To see how Sparkling Water can be used as Spark package, please see Use as Spark Package.
See Windows Tutorial to learn how to use Sparkling Water in Windows environments.
To see how to run examples for Sparkling Water, please see Running Examples.
Sparkling water supports two backend/deployment modes - internal and
external. Sparkling Water applications are independent on the selected
backend. The backend can be specified before creationg of the
H2OContext
.
For more details regarding the internal or external backend, please see Backends.
List of all Frequently Asked Questions is available at FAQ.
Complete development documentation is available at Development Documentation.
To see how to build Sparkling Water, please see Build Sparkling Water.
An application using Sparkling Water is regular Spark application which bundling Sparkling Water library. See Sparkling Water Droplet providing an example application here.
Look at our list of JIRA tasks for new contributors or send your idea to [email protected].
To report issues, please use our JIRA page at http://jira.h2o.ai/.
We also respond to questions tagged with sparkling-water and h2o tags on the Stack Overflow.
Change logs are available at Change Logs.