Spark textFile() VS wholeTextFile()

textFile()

def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]

  • Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings
  • For example sc.textFile(“/home/hdadmin/wc-data.txt”) so it will create RDD in which each individual line an element.

wholeTextFile()

def wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)]

  • Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.
  • Rather than create basic RDD, the wholeTextFile() returns pairRDD

For example, you have few files in a directory so by using wholeTextFile() method, it creates pair RDD with filename with path as key, and value being the whole file as string

Code and dataset can be found

Join us for more real time use cases and project work.

Naveen P.N

12+ years of experience in IT with vast experience in executing complex projects using Java, Micro Services , Big Data and Cloud Platforms. I found NPN Training Pvt Ltd a India based startup to provide high quality training for IT professionals. I have trained more than 3000+ IT professionals and helped them to succeed in their career in different technologies. I am very passionate about Technology and Training. I have spent 12 years at Siemens, Yahoo, Amazon and Cisco, developing and managing technology.