This page presents quick comparisions across Pig, Hive and Spark with respect to the core processing features one would expect in a programming framework being used for data transformation, cleansing and enrichment.
Ensure the operations described in loading the data set are completed.
See the Script Execution section of the oss-transform-processing-comparison main page for options on running these scripts.
Create a list of the airport
records that are in the state of Texas.
Simple activity for Pig as shown in filtering.pig
.
filtering.hql
shows it just doesn't get any simplier that this.
Just as simple with the DataFrame API; see filtering.spark
.
Order the flight
dataset by the longest departure delays first and for ties on this bad value, secondary sort on the carrier names and their flight numbers.
The Pig Latin for this is rather straightforward as seen in sorting.pig
.
Easy to do in SQL as visualized in sorting.hql
.
Relatively straightforward as sorting.spark
expresses.
Augment the flight
dataset to include the year the plane was manufactured.
This gets a bit "chatty" in Pig Latin as visualized in joining.pig
.
This can easily be done with SQL as seen in joining.hql
.
Using the DataFrame API in Spark SQL does look at bit more "involved" than simply SQL as you can tell by viewing joining.spark
.
Determine a distinct list of aircraft_type
values from the plane
dataset. That list will include one call "Rotorcraft" which should be used to determine a distinct list of the manufacturer
s for this type of aircraft.
Run distinct.pig
to generate the following two lists.
Distinct aircraft types.
(Balloon)
(Rotorcraft)
(Fixed Wing Multi-Engine)
(Fixed Wing Single-Engine)
()
Distinct manufacturers of rotorcraft.
(BELL)
(SIKORSKY)
(AGUSTA SPA)
(AEROSPATIALE)
(COBB INTL/DBA ROTORWAY INTL IN)
Submit the two queries in distinct.hql
to get the following results.
Distinct aircraft types.
+---------------------------+--+
| aircraft_type |
+---------------------------+--+
| NULL |
| Balloon |
| Fixed Wing Multi-Engine |
| Fixed Wing Single-Engine |
| Rotorcraft |
+---------------------------+--+
Distinct manufacturers of rotorcraft.
+---------------------------------+--+
| manufacturer |
+---------------------------------+--+
| AEROSPATIALE |
| AGUSTA SPA |
| BELL |
| COBB INTL/DBA ROTORWAY INTL IN |
| SIKORSKY |
+---------------------------------+--+
See the code in distinct.spark
which displays the following results.
Distinct aircraft types.
+--------------------+
| aircraft_type|
+--------------------+
| Rotorcraft|
|Fixed Wing Single...|
| null|
| Balloon|
|Fixed Wing Multi-...|
+--------------------+
Distinct manufacturers of rotorcraft.
+--------------------+
| manufacturer|
+--------------------+
| BELL|
| SIKORSKY|
| AGUSTA SPA|
| AEROSPATIALE|
|COBB INTL/DBA ROT...|
+--------------------+
Determine the average dep_delay
and average taxi_out
values by aggregating the origin
airport for all flight
records.
Execute aggregate.pig
to generate the following results.
(PIR,49.5,14.0)
(ACY,35.916666666666664,11.833333333333334)
(ACK,25.558333333333334,20.625)
(CEC,23.40764331210191,12.773885350318471)
(LMT,23.40268456375839,11.208053691275168)
Run the query at aggregate.hql
to generate the following results.
+---------+---------------------+---------------------+--+
| origin | avg_dep_delay | avg_taxi_out |
+---------+---------------------+---------------------+--+
| PIR | 49.5 | 14.0 |
| ACY | 35.916666666666664 | 11.833333333333334 |
| ACK | 25.558333333333334 | 20.625 |
| CEC | 23.40764331210191 | 12.773885350318471 |
| LMT | 23.40268456375839 | 11.208053691275168 |
+---------+---------------------+---------------------+--+
5 rows selected (10.74 seconds)
Run the code in aggregate.spark
to generate the following results.
+------+------------------+------------------+
|origin| avg_dep_delay| avg_taxi_out|
+------+------------------+------------------+
| PIR| 49.5| 14.0|
| ACY|35.916666666666664|11.833333333333334|
| ACK|25.558333333333334| 20.625|
| CEC| 23.40764331210191|12.773885350318471|
| LMT| 23.40268456375839|11.208053691275168|
+------+------------------+------------------+
Hive is the slight winner in the space as everyone speaks the "language of SQL" and these basic operations presented above are very well known by almost all folks who work with data.
I only gave Hive the "slight" connotation as Pig and Spark are both equally capable in these spaces, but I fear the masses will think they are a bit "chatty" to get the job done. As is often the case, it is a matter of style.