In this post, will look at the following Pseudo set Transformations

  1. distinct()
  2. union()
  3. intersection()
  4. subtract()
  5. cartesian()

Distinct

  • distinct(): Returns distinct element in the RDD.
  • Warning :Involves shuffling of data over N/W

$   val rdd1 = sc.parallelize(List("lion", "tiger", "tiger", "peacock", "horse"))
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[33] at parallelize at :21
scala
$   val rdd2 = sc.parallelize(List("lion", "tiger"))
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[34] at parallelize at :21

$ rdd1.distinct().collect()
res20: Array[String] = Array(peacock, lion, horse, tiger)

Union

  • union() : Returns an RDD containing data from both sources
  • Note : Unlike the Mathematical Union, duplicates are not removed. Also type should be same in both the RDD 
 
$ rdd1.union(rdd2).collect() 
res22: Array[String] = Array(lion, tiger, tiger, peacock, horse, lion, tiger) 

Intersection

  • intersection() : Returns elements that are common b/w both RDDs.
  • Also removed Duplicates
  • Warning : Involves shuffling has worst performance

$   rdd1.intersection(rdd2).collect();
res24: Array[String] = Array(lion, tiger)

Subtract

  • subtract() : Returns only elements that are present in the

$ rdd1.subtract(rdd2).collect()
res26: Array[String] = Array(peacock, horse)

Cartesian

  • cartesian(): Provides cartesian product b/w 2 RDDs
  • Warning : Is very expensive for large RDDs

$ rdd1.cartesian(rdd2).collect();
res28: Array[(String, String)] = Array((lion,lion), (lion,tiger), (tiger,lion), (tiger,tiger), (tiger,lion), (tiger,tiger), (peacock,lion), (peacock,tiger), (horse,lion), (horse,tiger))