Skip to content

lestermartin/oss-transform-processing-comparison

Repository files navigation

oss-transform-processing-comparison

This project is intended to house the examples used in my upcoming DevNexus Transformation Processing Smackdown; Spark vs Hive vs Pig presentation.

Video

Ingesting into HDFS

Slides

Transformation Processing Smackdown; Spark vs Hive vs Pig

Setup

The following information should help you get setup to run the examples.

Hadoop Cluster

First, you will need a Hadoop cluster with Pig, Hive and Spark properly installed. You have multiple options, including major distribution providers as well as a RYO approach directly from http://hadoop.apache.org/.

For myself, and since I work for Hortonworks, I'm going to leverage the Hortonworks Sandbox, but again, you can use whatever Hadoop cluster you desire. I've tested everything using the 2.5 version.

Script Execution

All of the Pig, Hive and Spark code presented in this project should run just about everywhere.

Pig

I'm using the Ambari Pig View to run these scripts, but you could use Hue, Pig's Grunt shell, or put the contents in a file and run them from the CLI with the Pig executable.

I'm using piggybank.jar so I moved it into HDFS to make the scripts more generic from distro to distro. For my env, I ran the following to do this.

hdfs dfs -put /usr/hdp/current/pig-client/lib/piggybank.jar /tmp

Hive

I'm using the Ambari Hive View to run these scripts, but you could use Hue, the old school Hive CLI or the newer Beeline, as well as from a notebook like Zeppelin.

Spark

I'm using Zeppelin (included with the Hortonworks Sandbox), but you could use another web notebook as well as the Spark shell.

Data Set

Perform the operations described in loading the data set.

The Comparison!!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published