Author: Naveen P.N

12+ years of experience in IT with vast experience in executing complex projects using Java, Micro Services , Big Data and Cloud Platforms. I found NPN Training Pvt Ltd a India based startup to provide high quality training for IT professionals. I have trained more than 3000+ IT professionals and helped them to succeed in their career in different technologies. I am very passionate about Technology and Training. I have spent 12 years at Siemens, Yahoo, Amazon and Cisco, developing and managing technology.

Spark RDD : groupByKey VS reduceByKey

let’s look at two different ways to compute word counts, one using reduceByKey and the other using groupByKey: val words = Array( “a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”,”a”,”b”,”c”,”a”); val pairs = sc.parallelize(words).map(line => (line,1)); val wordCountsWithGroup = pairs.groupByKey().map(t => (t._1, t._2.sum)).collect() val wordCountsWithReduce = pairs.reduceByKey(_ + _) .collect()   While both of these functions will produce the correct answer, the reduceByKey example works much better...

Spark textFile() VS wholeTextFile()

textFile() def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings For example sc.textFile(“/home/hdadmin/wc-data.txt”) so it will create RDD in which each individual line an element. wholeTextFile() def wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions):...