Spark Lab 2
Overview
In this lab you’ll use a bike trip dataset of 2 files. The data is in CSV format so you’ll first parse the two files, then join the trips with the stations get the start and end stations details.
The objective is to understand the lineage of RDDs, how input is partitioned, and finally the impact of partitioning when joining two RDDs.
import org.apache.spark._
val input1 = sc.textFile("/user/student/data/trips/*")
val header1 = input1.first // to skip the header row
val trips = input1.
filter(_ != header1).
map(_.split(","))
val input2 = sc.textFile("/user/student/data/stations/*")
val header2 = input2.first // to skip the header row
val stations = input2.
filter(_ != header2).
map(_.split(",")).
keyBy(_(0).toInt).
partitionBy(new HashPartitioner(trips.partitions.size))
// The id field for stations is index 0, the start terminal for trips is index 4, the end terminal is index 7.
val startTrips = stations.join(trips.keyBy(_(4).toInt))
val endTrips = stations.join(trips.keyBy(_(7).toInt))
Let’s now run an action on the two joined RDDs and examine the jobs on the Spark console. Did you properly identify the stage boundaries?
startTrips.count()
endTrips.count()