Apache Spark : Loading CSV file Using Custom Timestamp Format

In this blog post, we will see how to load csv which contains timestamp as one of the column.

Creating DataFrame from CSV file

If you see the below data set it contains 2 columns event-name and event-date.The event-date column is a timestamp with following format “DD-MM-YYYY HH MM SS“.

EVENT_ID,EVENT_DATE
AUTUMN-L001,20-01-2019 15 40 23
AUTUMN-L002,21-01-2019 01 20 12
AUTUMN-L003,22-01-2019 05 50 46

Now let us read a DataFrame by reading CSV file and  then print the schema. We will check whether Spark will inferschema for event_date column.

val eventDataDF = spark.read.
                             option("header", "true").
                             option("inferSchema","true").
                             csv("d://spark-example/event_data.csv");

In the above code we have used option(“inferSchema”,”true”)which can infer schema of each column by parsing the data set.

scala> eventDataDF.printSchema
root
 |-- EVENT_ID: string (nullable = true)
 |-- EVENT_DATE: timestamp (nullable = true)

When you read the schema of the DataFrame after reading the CSV, you will see that every field has been inferred correctly by the CSV.

Spark supports following options:

Option Name Description
sep (default ,) sets a single character as a separator for each field and value.
quote (default “) sets a single character used for escaping quoted values where the separator can be part of the value. If you would like to turn off quotations, you need to set not null but an empty string. This behaviour is different from com.databricks.spark.csv.
header (default false) uses the first line as names of columns.
inferSchema (default false) infers the input schema automatically from data. It requires one extra pass over the data.

Reference documentation

Reference documentation

Apache Spark and Scala Training

Problem with different Time stamp format

But what if the timestamp fields in the CSV are in some other timestamp format? (For example, MM-dd-yyyy hh mm ss format.)

Now the content of CSV file is in this format

EVENT_ID,EVENT_DATE
AUTUMN-L001,01-21-2019 15 40 23
AUTUMN-L002,01-22-2019 01 20 12
AUTUMN-L003,01-23-2019 05 50 46

 

val eventDataDF = spark.read.
                             option("header", "true").
                             option("inferSchema","true").
                             csv("d://spark-example/event_data1.csv");

 

scala> eventDataDF.printSchema
root
 |-- EVENT_ID: string (nullable = true)
 |-- EVENT_DATE: string(nullable = true)

The problem is that the DataFrame code didn’t infer schema if the date format is in different format.

Inferring Schema for different Time stamp

The above-mentioned issues can be resolved with the code shown below:

We have a straight-forward option timestampFormat to give any timestamp format while reading CSV. We have to just add an extra option defining the custom timestamp format, like option(“timestampFormat”, “MM-dd-yyyy hh mm ss”).

val eventDataDF = spark.read
                             .option("header", "true")
                             .option("inferSchema","true")
                             .option("timestampFormat", "MM-dd-yyyy hh mm ss")
                             .csv("d://spark-example/event_data1.csv");
scala> eventDataDF.printSchema
root
 |-- EVENT_ID: string (nullable = true)
 |-- EVENT_DATE: timestamp (nullable = true)

 

 

Download code

Reference documentation


The Apache Spark and Scala Training program is our in-depth program  designed to empower professionals to develop relevant competencies and accelerate their career progression in Big Data technologies through complete Hands-on training on Spark and Scala.

Related Post
cache() vs persist() methods in Spark cache() and persist()  are 2 methods available in Spark to improve performance of spark computation. These methods help to save intermediate results s...

Naveen P.N

12+ years of experience in IT with vast experience in executing complex projects using Java, Micro Services , Big Data and Cloud Platforms. I found NPN Training Pvt Ltd a India based startup to provide high quality training for IT professionals. I have trained more than 3000+ IT professionals and helped them to succeed in their career in different technologies. I am very passionate about Technology and Training. I have spent 12 years at Siemens, Yahoo, Amazon and Cisco, developing and managing technology.