Changing the Output Delimiter in t


Hadoop’s default output delimiter (character isolating the output key and value) is tab (“\t”). This post explains the best approach to alter the default Hadoop output delimiter.


Output Delimiter Configuration Property

The output delimiter of a Hadoop job can easily be altered by Changing the mapred.textoutputformat.separator configuration property. This property can be set from the code itself or from the command line.

Setting delimiter in job class:

//get configuration object
Configuration conf = getConf();

//set output delimiter to comma
conf.set("mapred.textoutputformat.separator", ","); 

Setting delimiter from command line:

# adding the following args to a Hadoop job command will change output delimiter to comma
-D mapred.textoutputformat.separator=","



We will use the word count example that comes packaged with Hadoop to show how set a custom output delimiter from the command line.

Running word count with default delimiter:

# hadoop command
hadoop jar hadoop-mapreduce-examples-2.8.1.jar wordcount /input-dir /output-dir

# cat output
hadoop fs -cat /output-dir/* 

with	56
within	4
without	1
work	12


Running word count with custom delimiter

# hadoop command
hadoop jar hadoop-mapreduce-examples-2.8.1.jar wordcount -D mapred.textoutputformat.separator="," /input-dir /output-dir
# cat output
hadoop fs -cat /output-dir/* 

Related Post
Difference between collect_set and collect_list fu... Collect_set and collect_list functions in Hive In this blog post I will explain what is the difference between collect_set and collect_list functions...

NPN Training

NPNTraining Trainer.