Skip to content

Latest commit

 

History

History
103 lines (89 loc) · 2.15 KB

7-SamplingAndBucketing.md

File metadata and controls

103 lines (89 loc) · 2.15 KB

Sampling and Bucketing

  • On Hive shell, let's pick a sample of the existing flight_data table based on unique_carrier column:

SELECT
   *
FROM
   flight_data TABLESAMPLE(BUCKET 1 OUT OF 4 ON unique_carrier)s LIMIT 100;

  • On Hive shell: create a bucketed table:

CREATE EXTERNAL TABLE flight_data_b(
   year INT,
   month INT,
   day INT,
   day_of_week INT,
   dep_time INT,
   crs_dep_time INT,
   arr_time INT,
   crs_arr_time INT,
   unique_carrier STRING,
   flight_num INT,
   tail_num STRING,
   actual_elapsed_time INT,
   crs_elapsed_time INT,
   air_time INT,
   arr_delay INT,
   dep_delay INT,
   origin STRING,
   dest STRING,
   distance INT,
   taxi_in INT,
   taxi_out INT,
   cancelled INT,
   cancellation_code STRING,
   diverted INT,
   carrier_delay STRING,
   weather_delay STRING,
   nas_delay STRING,
   security_delay STRING,
   late_aircraft_delay STRING
)
CLUSTERED BY (unique_carrier) INTO 4 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/user/hive/warehouse/flight_data_b';

  • Now, let's populate this table. Before we do that, we need to populate a property that would ensure the number of reducers is the same as number of buckets in the destination table. On hive shell, set the following property:

SET hive.enforce.bucketing = true;

  • On hive shell: run the command to populate the bucketed table (it's no different than running the command on a regular table.

INSERT OVERWRITE TABLE flight_data_b
SELECT
   *
FROM
   flight_data;

  • Disable the enforce bucketing property back to its default value:

SET hive.enforce.bucketing = false;

  • On bash: verify the number of files in the HDFS directory corresponding to flight_data_b table.

hadoop fs -ls /user/hive/warehouse/flight_data_b/

  • On hive shell: run a sampling query on the bucketed table

SELECT
   *
FROM
   flight_data_b TABLESAMPLE(BUCKET 1 OUT OF 4  ON unique_carrier)s LIMIT 100;

  • Compare the time betwen sampling on non-bucketed table with that on bucketed table