Let us start an Application. For this demo, Scala shell acts as a Driver (Application)
Connect to web app(localhost:4040) and explore all the tabs. Except for Environment & Executors tab all other tabs are empty
That clearly indicates we have an Executor running in the background to support our Application.
The First Run
$ val data = sc.parallelize(1 to 10) data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD at parallelize at <console> $ data.count res0: Long = 10
Let us check all the tabs,
We are able to see how our action(count) run is bifurcated into sub components (Job -> Stages -> Tasks). So any action is converted into Job which in turn is again divided into Stages, with each stage having its own set of Tasks.
Given below is the snapshot of other Tabs which are self explanatory
The Second run
Let us once again run the action ‘count’ and see what happens. This time we got a new Job(Job Id 1) with its own Stages & Tasks
$ data.count res1: Long = 10
Note : Clicking on Description of each Job will take us to Stages specific to that Job
Clicking on Description of each Stage will provide us information on Tasks related to that Stage
Let us explore the concept of Persistence using Web App. Let us Persist to our RDD ‘data’ and run action ‘count’ again. This time we are interested in Storage Tab(It has been empty so far)
$ import org.apache.spark.storage.StorageLevel $ import org.apache.spark.storage.StorageLevel $ data.persist(StorageLevel.MEMORY_ONLY) res3: data.type = ParallelCollectionRDD at parallelize at <console>:21 $ data.count res4: Long = 10
The Executor tab also provides us information on the Memory that is used to persist data