Author: NPN Training

NPNTraining Trainer.

Aggregation on Text Fields in the MapReduce Example

Introduction Following is a simple MapReduce example which is a little different than the standard “Word Count” example in that it takes (tab) delimited text, and counts the occurrences of values in a certain field. More details about the implementing it are described below. package com.npntraining.hadoop.mapreduce; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.*; import java.io.IOException; import java.util.Iterator;...

Changing the Output Delimiter in t

Introduction Hadoop’s default output delimiter (character isolating the output key and value) is tab (“\t”). This post explains the best approach to alter the default Hadoop output delimiter.   Output Delimiter Configuration Property The output delimiter of a Hadoop job can easily be altered by Changing the mapred.textoutputformat.separator configuration property. This property can be set from the code itself or from the...

Working with Nested JSON in Spark

JSON could be a quite common way to store information. however JSON will get untidy and parsing it will get tough. Here are some samples of parsing nested data structures in JSON Spark DataFrames (examples here finished Spark one.6.0).   Sample JSON File: { “user”: “gT35Hhhre9m”, “dates”: [“2016-01-29”, “2016-01-28”], “status”: “OK”, “reason”: “some reason”, “content”: [{ “foo”: 123, “bar”: “val1″...