Spark RDD : distinct(), union() & more…

spark_distinct_union_more

In this post, will look at the following Pseudo set Transformations

  1. distinct()
  2. union()
  3. intersection()
  4. subtract()
  5. cartesian()

Distinct

  • distinct(): Returns distinct element in the RDD.
  • Warning :Involves shuffling of data over N/W

$   val rdd1 = sc.parallelize(List("lion", "tiger", "tiger", "peacock", "horse"))
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[33] at parallelize at :21
scala
$   val rdd2 = sc.parallelize(List("lion", "tiger"))
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[34] at parallelize at :21

$ rdd1.distinct().collect()
res20: Array[String] = Array(peacock, lion, horse, tiger)

Union

  • union() : Returns an RDD containing data from both sources
  • Note : Unlike the Mathematical Union, duplicates are not removed. Also type should be same in both the RDD 
 
$ rdd1.union(rdd2).collect() 
res22: Array[String] = Array(lion, tiger, tiger, peacock, horse, lion, tiger) 

Intersection

  • intersection() : Returns elements that are common b/w both RDDs.
  • Also removed Duplicates
  • Warning : Involves shuffling has worst performance

$   rdd1.intersection(rdd2).collect();
res24: Array[String] = Array(lion, tiger)

Subtract

  • subtract() : Returns only elements that are present in the

$ rdd1.subtract(rdd2).collect()
res26: Array[String] = Array(peacock, horse)

Cartesian

  • cartesian(): Provides cartesian product b/w 2 RDDs
  • Warning : Is very expensive for large RDDs

$ rdd1.cartesian(rdd2).collect();
res28: Array[(String, String)] = Array((lion,lion), (lion,tiger), (tiger,lion), (tiger,tiger), (tiger,lion), (tiger,tiger), (peacock,lion), (peacock,tiger), (horse,lion), (horse,tiger))

Naveen P.N

12+ years of experience in IT with vast experience in executing complex projects using Java, Micro Services , Big Data and Cloud Platforms. I found NPN Training Pvt Ltd a India based startup to provide high quality training for IT professionals. I have trained more than 3000+ IT professionals and helped them to succeed in their career in different technologies. I am very passionate about Technology and Training. I have spent 12 years at Siemens, Yahoo, Amazon and Cisco, developing and managing technology.