Various Entry Points for Apache Spark

In Data Engineering Apache Spark is probably one of the most popular framework to process huge volume of data. In this blog post I am going to cover the various entry points for Spark Applications and how these have evolved over the releases made.

Every Spark Application needs an entry point that allows it to communicate with data sources and perform operations such as reading and writing data.

In Spark 1.x, three entry points were introduced:

  1. SparkContext,
  2. SQLContext and
  3. HiveContext

Since Spark 2.x, a new entry point called SparkSession has been introduced that essentially combined all functionalities available in the three aforementioned contexts. Note that all contexts are still available even in newest Spark releases, mostly for backward compatibility purposes.

Spark Context

The Spark Context is used by the Driver Process of the Spark Application in order to establish a communication with the cluster and the resource managers in order to coordinate and execute jobs. SparkContext also enables the access to the other two contexts, namely SQLContext and HiveContext (more on these entry points later on).

In order to create a SparkContext, you will first need to create a Spark Configuration

Scala

import org.apache.spark.{SparkContext, SparkConf}
val sparkConf = new SparkConf().setAppName("app")
.setMaster("yarn")
val sc = new SparkContext(sparkConf)

Python

from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('app')
.setMaster(master)
sc = SparkContext(conf=conf)

Note : if you are using the spark-shell, SparkContext is already available through the variable called sc

SqlContext

SQLContext is the entry point to SparkSQL which is a Spark module for structured data processing. Once SQLContext is initialized, the user can then use it in order to perform various “sql-like” operations over Datasets and Dataframes. In order to create a SQLContext, you first need to instantiate a SparkContext as shown below:

Scala

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.SQLContext
val sparkConf = new SparkConf().setAppName("app")
.setMaster("yarn")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)

Python

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf().setAppName('app')
.setMaster(master)
sc = SparkContext(conf=conf)
sql_context = SQLContext(sc)

HiveContext

If your Spark Application needs to communicate with Hive and you are using Spark < 2.0 then you will probably need a HiveContext if . For Spark 1.5+, HiveContext also offers support for window functions.

Scala

import org.apache.spark.{SparkConf, SparkContext} 
import org.apache.spark.sql.HiveContext
val sparkConf = new SparkConf().setAppName("app")
                               .setMaster("yarn")
val sc = new SparkContext(sparkConf)
val hiveContext = new HiveContext(sc)
hiveContext.sql("select * from tableName limit 0")

Python

from pyspark import SparkContext, 
HiveContextconf = SparkConf().setAppName('app')
                             .setMaster(master)
sc = SparkContext(conf)
hive_context = HiveContext(sc)
hive_context.sql("select * from tableName limit 0")

SparkSession

Spark 2.0 introduced a new entry point called SparkSession that essentially replaced both SQLContext and HiveContext. Additionally, it gives to developers immediate access to SparkContext. In order to create a SparkSession with Hive support, all you have to do is

import org.apache.spark.sql.SparkSession
val sparkSession = SparkSession.builder()
                                .appName("myApp")
                                .enableHiveSupport()
                                .getOrCreate()

// Two ways you can access spark context from spark session

val spark_context = sparkSession._sc
val spark_context = sparkSession.sparkContex

Python

from pyspark.sql import SparkSession
spark_session = SparkSession.builder.enableHiveSupport().getOrCreate()

Two ways you can access spark context from spark session

spark_context = spark_session._sc
spark_context = spark_session.sparkContex

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *