Skip to content

Latest commit

 

History

History
66 lines (44 loc) · 3.73 KB

README.md

File metadata and controls

66 lines (44 loc) · 3.73 KB

20160406_1-768x396

Hadoop Mini Project

Post-Sale Automobile Report

In this project, we will utilize data from an automobile tracking platform that tracks the history of important incidents after the initial sale of a new vehicle. Such incidents include subsequent private sales, repairs, and accident reports. The platform provides a good reference for second-hand buyers to understand the vehicles they are interested in.

The report is stored as CSV files in HDFS with following schema:

Screen Shot 2022-02-01 at 11 48 56 PM

Learning Objectives

  • Utilitzing MapReduce jobs in Python.
  • Leveraging a MapReduce processing model to process large scale data and break down a complex problem into smaller tasks.
  • Getting familiar with VirtualBox environment.

Setting up Hadoop using Hortonworks Hadoop Sandbox

Step 1:

From your Local Terminal run upload_files.sh to upload to the root directory in the VirtualBox:

  • You have to input the password of root account in order to upload the files.

Screen Shot 2022-02-01 at 11 56 15 PM

Step 2:

From the Sandbox's Web Shell Client - http://localhost:4200, logging into as root account and let's put the data.csv into hadoop file system:

$ hadoop fs -mkdir test_dir
$ hadoop fs -put data.csv /user/root/test_dir  

Double check the uploaded file in the Ambari Files View:

  • Note: the owner of the folder and file must be root !

Screen Shot 2022-02-03 at 3 24 21 PM

Step 3:

From the Sandbox's Web Shell Client, run file auto.sh:

$ bash auto.sh

Step 4:

After all the MapReduce jobs were successfully executed, let's check the output:

  • From all_accidents folder: Screen Shot 2022-02-02 at 12 07 31 AM

  • From make_year_count folder:

  • Screen Shot 2022-02-02 at 12 08 41 AM


NOTE:

  • In the default Python enviroment is version 2 in VirtualBox so when you should either update the python env to 3 (or above) or tailor your code to fit the python 2.

For example, Python 2 doesn't support F-string like Python 3 which can cause error when you run the MapReduce python script. Therefore, you have to use %s acts a placeholder for a string while %d acts as a placeholder for a number. More detail

  • The easiest way to check if your Python script is compatiable with python 2 is to run python mapper1.py or other python script in Sandbox's Web Shell Client - http://localhost:4200. If there is no error occurs, it means your code is good for python 2.