Working with Nested JSON in Spark

JSON could be a quite common way to store information. however JSON will get untidy and parsing it will get tough. Here are some samples of parsing nested data structures in JSON Spark DataFrames (examples here finished Spark one.6.0).

 

Sample JSON File:

{
	"user": "gT35Hhhre9m",
	"dates": ["2016-01-29", "2016-01-28"],
	"status": "OK",
	"reason": "some reason",
	"content": [{
		"foo": 123,
		"bar": "val1"
	}, {
		"foo": 456,
		"bar": "val2"
	}, {
		"foo": 789,
		"bar": "val3"
	}, {
		"foo": 124,
		"bar": "val4"
	}, {
		"foo": 126,
		"bar": "val5"
	}]
}

 

Assuming you have already got a SQLContext object created, the examples below can demonstrate the way to analyse the nested data from the JSON above.

scala> val df = sqlContext.read.json("sample.json")
df: org.apache.spark.sql.DataFrame = [content: array<struct<bar:string,foo:bigint>>, dates: array, reason: string, status: string, user: string]

//output
df.show
+--------------------+--------------------+-----------+------+-----------+
|             content|               dates|     reason|status|       user|
+--------------------+--------------------+-----------+------+-----------+
|[[val1,123], [val...|[2016-01-29, 2016...|some reason|    OK|gT35Hhhre9m|
+--------------------+--------------------+-----------+------+-----------+

 

We can see in our output that the “content” field contains an array of structs, while our “dates” field contains an array of integers. The first step to being able to access the data in these data structures is to extract and “explode” the column into a new DataFrame using the explode function.

Extracting “dates” into new DataFrame:

//explode dates field
scala> val dfDates = df.select(explode(df("dates")))

//output
dfDates.show
+----------+
|       col|
+----------+
|2016-01-29|
|2016-01-28|
+----------+

//rename "col" to "dates"
scala> val dfDates = df.select(explode(df("dates"))).toDF("dates")

//output
dfDates.show
+----------+
|     dates|
+----------+
|2016-01-29|
|2016-01-28|
+----------+

 

Our “content” field contains and array of structs. To access the data in each of these structs, we must use the dot operator.

Extracting data in array of structs:

//explode content field
scala> val dfContent = df.select(explode(df("content")))
dfContent: org.apache.spark.sql.DataFrame = [col: struct<bar:string,foo:bigint>]

//output
scala> dfContent.show
+----------+
|       col|
+----------+
|[val1,123]|
|[val2,456]|
|[val3,789]|
|[val4,124]|
|[val5,126]|
+----------+

//rename "col" to "content"
scala> val dfContent = df.select(explode(df("content"))).toDF("content")
dfContent: org.apache.spark.sql.DataFrame = [content: struct<bar:string,foo:bigint>]

//output
scala> dfContent.show
+----------+
|   content|
+----------+
|[val1,123]|
|[val2,456]|
|[val3,789]|
|[val4,124]|
|[val5,126]|
+----------+

//extracting fields in struct
scala> val dfFooBar = dfContent.select("content.foo", "content.bar")
dfFooBar: org.apache.spark.sql.DataFrame = [foo: bigint, bar: string]

//output
scala> dfFooBar.show
+---+----+
|foo| bar|
+---+----+
|123|val1|
|456|val2|
|789|val3|
|124|val4|
|126|val5|
+---+----+

 


About the course

The Big Data Masters Program is designed to empower working professionals to develop relevant competencies and accelerate their career progression in Big Data technologies through complete Hands-on training.

This program comes with a portfolio of industry-relevant POC’s, Use cases and project work.

Learn by Building Real Time Project

big-data-masters-program