Analysis of Airline On-Time Performance Data in Hadoop and Spark by wuyichen24

Launched and configured multiple AWS EC2 instances with proper security setting (IAM Role, Security Group, Private Key Access).
Installed and deployed Hadoop and Spark on the cluster of multiple AWS EC2 instances by Ambari.
Figured out several solutions and implemented them in Hadoop and Spark respectively for analyzing the Airline On-Time Performance Data (all non-cancelled flights between 1988 and 2008) from the US Bureau of Transportation Statistics (BTS).
Installed and deployed Cassandra database on multiple nodes and store the computing result into it.
Applied system-level optimizations by creating instances with a higher ratio of vCPUs of memory and application-level optimizations by adjusting spark.locality.* properties in SparkConf for increasing data locality