Post-Sale Automobile Report
In this project, we will utilize data from an automobile tracking platform that tracks the history of important incidents after the initial sale of a new vehicle. Such incidents include subsequent private sales, repairs, and accident reports. The platform provides a good reference for second-hand buyers to understand the vehicles they are interested in.
The report is stored as CSV files in HDFS with following schema:
- Utilitzing MapReduce jobs in Python.
- Leveraging a MapReduce processing model to process large scale data and break down a complex problem into smaller tasks.
- Getting familiar with VirtualBox environment.
-
Please follow the Instruction video to set up Hadoop with Hortonworks Hadoop Sandbox (Cloudera).
- Install Virtual Box
- Install Cloudera HDP
-
Extra material how to move files from Linux env to HDFS and backward in Cloudera Hadoop Virtual Machine.
From your Local Terminal run upload_files.sh to upload to the root directory in the VirtualBox:
- You have to input the password of root account in order to upload the files.
From the Sandbox's Web Shell Client - http://localhost:4200
, logging into as root
account and let's put the data.csv
into hadoop file system:
$ hadoop fs -mkdir test_dir
$ hadoop fs -put data.csv /user/root/test_dir
Double check the uploaded file in the Ambari Files View
:
- Note: the owner of the folder and file must be
root
!
From the Sandbox's Web Shell Client, run file auto.sh:
$ bash auto.sh
After all the MapReduce jobs were successfully executed, let's check the output:
NOTE:
- In the default Python enviroment is version 2 in VirtualBox so when you should either update the python env to 3 (or above) or tailor your code to fit the python 2.
For example, Python 2 doesn't support F-string like Python 3 which can cause error when you run the MapReduce python script. Therefore, you have to use %s acts a placeholder for a string while %d acts as a placeholder for a number. More detail
- The easiest way to check if your Python script is compatiable with python 2 is to run
python mapper1.py
or other python script in Sandbox's Web Shell Client -http://localhost:4200
. If there is no error occurs, it means your code is good for python 2.