The goal of the project is to analyze the data gathered by the MTA from their turnstiles which would allow us to analyze any trends that occur throughout different parts of the year, such as by month, days of the week and weather.
- Shenghua You, shenghuayou@gmail.com
- Victor Fung, victor.fung122@gmail.com
- Simon Ling, simon-ling@hotmail.com
- Languages: Python, Javascript
- Framework: Spark
- Cluster: Hadoop, HDFS (HUE)
- MTA Turnstile Dataset - http://web.mta.info/developers/turnstile.html
- Wunderground - https://www.wunderground.com/
-
For the analysis results, view the ipython notebooks:
-
To view the pysparks code used to obtain the results in the analysis, view the ipython notebooks:
-
The code used on Hadoop cluster, view the following code:
- Make the python file executable
Command:
$ hadoop fs -chmod +x <your python file location>
Example:
$ hadoop fs -chmod +x /user/vfung000/project/python-code.py
- Executing on the cluster
Command:
$ spark-submit --name <name of job> \
--num-executors <number> \
<python code location>
Example:
$ spark-submit --name "projWeatherDayweek" \
--num-executors 10 \
hdfs:///user/vfung000/project/HadoopPHD.py \