Apache Spark: Caching and Checkpointing Under the Hood
As an Apache Spark application developer, memory management is one of the most essential tasks, but the difference between caching and checkpointing can cause confusion. Both operations are essential in preventing Spark from having to lazily recompute a resilient distributed dataset (RDD) every time it is referenced, but there are also key differences between the two.
Caching computes and materializes an RDD in memory while keeping track of its lineage (dependencies). There are many levels of persistence supported that allow you to make space and compute cost tradeoffs, and specify the behavior of the RDD when it runs out of memory. Since caching remembers an RDD’s lineage, Spark can recompute loss partitions in the event of node failures. Lastly, an RDD that is cached lives within the context of the running application, and once the application terminates, cached RDDs are deleted as well.
Checkpointing saves an RDD to a reliable storage system (e.g. HDFS, S3) while forgetting the RDD’s lineage completely. Truncating dependencies becomes relevant especially when the RDD’s lineage starts getting long. Checkpointing an RDD is similar to how Hadoop stores intermediate computation values to disk, trading off execution latency with ease of recovering from failures. Since an RDD is checkpointed in an external storage system, it can be reused by other applications.
Now the bigger question is how caching and checkpointing interplay. Let’s trace through the compute path of an RDD to find out more.
At the core of Spark’s engine is the DAGScheduler that breaks down a job (generated by a Spark action) into a DAG of stages. Each of these shuffle or result stages is further broken down into individual tasks that run on a partition of an RDD. An RDD’s iterator method is the entry point for a task to access the underlying data partition. We can see from this method that if the storage level is set, indicating that the RDD may be cached, it first attempts to getOrCompute the partition from the block manager. If the block manager does not have the RDD’s partition, it falls back to computeOrReadCheckpoint. As you can guess, computeOrReadCheckpoint retrieves checkpointed values if it exists, and if not, only then is the data partition computed.
All that being said, it is up to you to decide which of the two match your use case at different points in your job. It takes longer to read and write a checkpointed RDD simply because it has to be persisted to an external storage system, but Spark worker failures need not result in a recomputation (assuming the data is intact in the external storage system). On the other hand, cached RDD’s will not permanently take up storage space, but recomputation is necessary on worker failure. In general, the length of time it takes to do a computation is a good indicator to use one or the other.