Executor, Job, Stage, Task & RDD Persistence

Let us start an Application. For this demo, Scala shell acts as a Driver (Application)

$ spark-shell

Connect to web app(localhost:4040) and explore all the tabs. Except for Environment & Executors tab all other tabs are empty

 

 

That clearly indicates we have an Executor running in the background to support our Application.

 

The First Run


$ val data = sc.parallelize(1 to 10)
data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>

$ data.count
res0: Long = 10  

Let us check all the tabs,

Jobs Tab

 

We are able to see how our action(count) run is bifurcated into sub components (Job -> Stages -> Tasks). So any action is converted into Job which in turn is again divided into Stages, with each stage having its own set of Tasks.

Given below is the snapshot of other Tabs which are self explanatory

Stages Tab

 

Executors Tab

 

The Second run

Let us once again run the action ‘count’ and see what happens. This time we got a new Job(Job Id 1) with its own Stages & Tasks


$  data.count
res1: Long = 10

Jobs Tab

Note : Clicking on Description of each Job will take us to Stages specific to that Job

Stages Tab

Clicking on Description of each Stage will provide us information on Tasks related to that Stage

Executors Tab

Adding Persistence

Let us explore the concept of Persistence using Web App. Let us Persist to our RDD ‘data’ and run action ‘count’ again. This time we are interested in Storage Tab(It has been empty so far)

$   import org.apache.spark.storage.StorageLevel
$   import org.apache.spark.storage.StorageLevel

$   data.persist(StorageLevel.MEMORY_ONLY)
res3: data.type = ParallelCollectionRDD[0] at parallelize at <console>:21

$   data.count
res4: Long = 10
The tab provides us information on RDDs that are persisted. The link in RDD Name provides us more information on the RDD

The Executor tab also provides us information on the Memory that is used to persist data