Compressing Intermediary Results of Map Output in Hadoop

It is generally recommended to always compress Intermediary map output. This is because IO operations and network transfers are biggest bottlenecks in Hadoop, and compression can help with both of these issues.

Map output is written to local disk, and then transferred (shuffled) across the network to reducer nodes. At this point in a MapReduce job, we are no longer concerned with data being splittable. Therefore a non-splittable compression type will work fine. One thing to consider is that increased compression also means increased processing time, so an fast compressor like Snappy or LZO is usually a good choice for compressing intermediate map output. This way you can get increased performance by simply reducing the amount of data sent over the network. In fact, Amazon EMR enables intermediate compression with the Snappy codec by default.

In order to enable intermediate data compression you must adjust the parameters you pass to your MapReduce job. Simply set to true, and to the compression codec of your choice.

Here is an example of enabling intermediate (Snappy) compression in our Java MapReduce job class:

//turn on intermediate (map output) compression
conf.set("", "true");
conf.set("", "");

Here is an example of enabling intermediate (Snappy) compression when from the command line:

hadoop jar bigdatums-hadoop-1.0-SNAPSHOT.jar com.bigdatums.hadoop.mapreduce.ToolMapReduceExample \
-DinputLoc=/dataIn/ -DoutputLoc=/dataOut/ \


Learn Hadoop with complete Hands-on Training and Real time Project. We offer best  Hadoop Training in Bangalore


Related Post
Aggregation on Text Fields in the MapReduce Exampl... Introduction Following is a simple MapReduce example which is a little different than the standard “Word Count” example in that it takes (tab) delimi...

NPN Training

NPNTraining Trainer.