oss-transform-processing-comparison

This project is intended to house the examples used in my upcoming DevNexus Transformation Processing Smackdown; Spark vs Hive vs Pig presentation.

Video

Slides

Transformation Processing Smackdown; Spark vs Hive vs Pig

Setup

The following information should help you get setup to run the examples.

Hadoop Cluster

First, you will need a Hadoop cluster with Pig, Hive and Spark properly installed. You have multiple options, including major distribution providers as well as a RYO approach directly from http://hadoop.apache.org/.

For myself, and since I work for Hortonworks, I'm going to leverage the Hortonworks Sandbox, but again, you can use whatever Hadoop cluster you desire. I've tested everything using the 2.5 version.

Script Execution

All of the Pig, Hive and Spark code presented in this project should run just about everywhere.

Pig

I'm using the Ambari Pig View to run these scripts, but you could use Hue, Pig's Grunt shell, or put the contents in a file and run them from the CLI with the Pig executable.

I'm using piggybank.jar so I moved it into HDFS to make the scripts more generic from distro to distro. For my env, I ran the following to do this.

hdfs dfs -put /usr/hdp/current/pig-client/lib/piggybank.jar /tmp

Hive

I'm using the Ambari Hive View to run these scripts, but you could use Hue, the old school Hive CLI or the newer Beeline, as well as from a notebook like Zeppelin.

Spark

I'm using Zeppelin (included with the Hortonworks Sandbox), but you could use another web notebook as well as the Spark shell.

Data Set

Perform the operations described in loading the data set.

The Comparison!!

File Formats
- Delimeted Values
- XML
- JSON
- To be explored (later)
  - Other "normal" big data formats such as Avro, Parquet, ORC
  - Esoteric formats such as EBCDIC and compact RYO solns
Source to Target Mapping
Data Quality
Data Profiling
Core Processing

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
core-processing		core-processing
data-quality		data-quality
file-formats		file-formats
mapping		mapping
profiling		profiling
DATASET.md		DATASET.md
README.md		README.md
faa-ddl.hql		faa-ddl.hql
orc-ddl.hql		orc-ddl.hql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

oss-transform-processing-comparison

Video

Slides

Setup

Hadoop Cluster

Script Execution

Pig

Hive

Spark

Data Set

The Comparison!!

About

Releases

Packages

Languages

lestermartin/oss-transform-processing-comparison

Folders and files

Latest commit

History

Repository files navigation

oss-transform-processing-comparison

Video

Slides

Setup

Hadoop Cluster

Script Execution

Pig

Hive

Spark

Data Set

The Comparison!!

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages