Changing the Output Delimiter in t

Introduction

Hadoop’s default output delimiter (character isolating the output key and value) is tab (“\t”). This post explains the best approach to alter the default Hadoop output delimiter.

 

Output Delimiter Configuration Property

The output delimiter of a Hadoop job can easily be altered by Changing the mapred.textoutputformat.separator configuration property. This property can be set from the code itself or from the command line.

Setting delimiter in job class:

//get configuration object
Configuration conf = getConf();

//set output delimiter to comma
conf.set("mapred.textoutputformat.separator", ","); 

Setting delimiter from command line:

# adding the following args to a Hadoop job command will change output delimiter to comma
-D mapred.textoutputformat.separator=","

 

Example

We will use the word count example that comes packaged with Hadoop to show how set a custom output delimiter from the command line.

Running word count with default delimiter:

# hadoop command
hadoop jar hadoop-mapreduce-examples-2.8.1.jar wordcount /input-dir /output-dir

# cat output
hadoop fs -cat /output-dir/* 

with	56
within	4
without	1
work	12

 

Running word count with custom delimiter

# hadoop command
hadoop jar hadoop-mapreduce-examples-2.8.1.jar wordcount -D mapred.textoutputformat.separator="," /input-dir /output-dir
 
# cat output
hadoop fs -cat /output-dir/* 
 
with,56
within,4
without,1
work,12

Related Post