Category: Apache Spark

Spark RDD : groupByKey VS reduceByKey

let’s look at two different ways to compute word counts, one using reduceByKey and the other using groupByKey: val words = Array( “a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”); val pairs = sc.parallelize(words).map(line => (line,1)); val wordCountsWithGroup = pairs.groupByKey().map(t => (t._1, t._2.sum)).collect() val wordCountsWithReduce = pairs.reduceByKey(_ + _) .collect()   While both of these functions will produce the correct answer, the reduceByKey example works much better...

Spark textFile() VS wholeTextFile()

textFile() def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings For example sc.textFile(“/home/hdadmin/wc-data.txt”) so it will create RDD in which each individual line an element. wholeTextFile() def wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions):...