cache() and persist() are 2 methods available in Spark to improve performance of spark computation. These methods help to save intermediate results so they can be reused in subsequent stages. These interim results as RDDs are thus kept in
- Memory (default) or
- Solid storage like disk and/or replicated.
RDDs can be cached using cache operation and they can also be persisted using persist operation.
Then What is the difference? The difference between cache and persist operations is purely syntactic.
- cache is a synonym of persist or persist(MEMORY_ONLY), i.e. cache is merely persist with the default storage level MEMORY_ONLY while Persist provide you various following options:
Persist provide you various following options:
- MEMORY_ONLY(Default level) : This option stores RDD in available cluster memory as deserialized Java objects. Some partitions may not be cached if there is not enough cluster memory. Those partitions will be recalculated on the fly as needed.
- MEMORY_AND_DISK : This option stores RDD as deserialized Java objects. If RDD does not fit in cluster memory, then store those partitions on the disk and read them as needed.
- MEMORY_ONLY_SER : This options stores RDD as serialized Java objects (One byte array per partition). This is more CPU intensive but saves memory as it is more space efficient. Some partitions may not be cached. Those will be recalculated on the fly as needed.
- MEMORY_ONLY_DISK_SER: This option is same as above except that disk is used when memory is not sufficient.
- DISC_ONLY : This option stores the RDD only on the disk.
- MEMORY_ONLY_2, MEMORY_AND_DISK_2: Same as other levels but partitions are replicated on 2 slave nodes.
About the course
The Big Data Masters Program is designed to empower working professionals to develop relevant competencies and accelerate their career progression in Big Data technologies through complete Hands-on training.
This program comes with a portfolio of industry-relevant POC’s, Use cases and project work.