repartition() vs coalesce() in Spark

Spark splits data into partitions and computation is done in parallel for each partition. It is very important to understand how data is partitioned and when you need to manually modify the partitioning to run spark application efficiently.

In Spark RDD API there are 2 methods available to increase or decrease the number of partitions.

  1. repartition() method and
  2. coalesce() method
val data = 1 to 15

val numOfPartitions = 5

val rdd = spark.sparkContext.parallelize(data ,  numOfPartitions)

rdd.getNumPartitions 

5

rdd.saveAsTextFile("C:/npntraining/output_rdd")

In the above program we are creating an rdd with 5 partitions and then we are saving an rdd by invoking saveAsTextFile(str:String) method. If you open the output_rdd for each partition one output file will be created

part-00000 :

1 2 3

part-00001 : 

4 5 6

part-00002 : 

7 8 9

part-00003 : 

10 11 12

part-00004 : 

13 14 15

coalesce() method

  • coalesce() uses existing partitions to minimize the amount of data that’s shuffled
  • coalesce results in partitions with different amounts of data (sometimes partitions that have much different sizes)
val coalesceRDD = rdd.coalesceRDD(3)

coalesceRDD.saveAsTextFile("C:/npntraining/coalesce-output");


part-00000 :

1 2 3

part-00001 : 

4 5 6 7 8 9

part-00002 : 

10 11 12 13 14 15

If you analyze the output the data from partition part-00002 is merged with part-00001 and data from partition-0004 is merged with part-00002 hence minimize the amount of data that’s shuffled but results in partitions with different amounts of data [Important]

One difference I know is that with repartition() the number of partitions can be increased/decreased, but with coalesce() the number of partitions can only be decreased.

repartition() method

  • The repartition method can be used to either increase or decrease the number of partitions in a RDD.
val repartitionRDD = rdd.repartition(3)

repartition.saveAsTextFile("C:/npntraining/repartition-output");

part-00000 :

3 6 8 10 13

part-00001 : 

1 4 9 11 14

part-00002 : 

2 5 7 12 15

If you analyze the entire data is shuffled i.e but data is equally partitioned across the partitions [Important]

Summary Of Difference

coalesce() repartition()
Used to reduce the number of partitions Used to reduce or decrease the number of partitions.
Tries to minimize data movement by avoiding network shuffle. A network shuffle will be triggered which can increase data movement.
Creates unequal sized partitions Creates equal sized partitions

 


About the course

The Big Data Masters Program is designed to empower working professionals to develop relevant competencies and accelerate their career progression in Big Data technologies through complete Hands-on training.

This program comes with a portfolio of industry-relevant POC’s, Use cases and project work.

Learn by Building Real Time Project

big-data-masters-program