Spark splits data into partitions and computation is done in parallel for each partition. It is very important to understand how data is partitioned and when you need to manually modify the partitioning to run spark application efficiently. In Spark RDD API there are 2 methods available to increase or …

cache() and persist()  are 2 methods available in Spark to improve performance of spark computation. These methods help to save intermediate results so they can be reused in subsequent stages. These interim results as RDDs are thus kept in Memory (default) or Solid storage like disk and/or replicated. RDDs can be cached …

JSON could be a quite common way to store information. however JSON will get untidy and parsing it will get tough. Here are some samples of parsing nested data structures in JSON Spark DataFrames (examples here finished Spark one.6.0).   Sample JSON File: { "user": "gT35Hhhre9m", "dates": ["2016-01-29", "2016-01-28"], "status": …

Apache Spark is an analytics engine which can process huge data volumes at a speed much faster than MapReduce, because the data is persisted on Spark’s own processing framework. That is why it has been catching the attention of both professionals and the press. It was first developed at AMPLab …

What is Lazy Evaluation in Spark As the name itself indicates its definition, lazy evaluation in Spark means that the execution will not start until an action is triggered. In Spark, the picture of lazy evaluation comes in Spark transformation. Transformations are lazy in nature meaning when we call some …

I have seen many hadoop over the internet where they search for bigdata architect roles and responsibilities. So here I tried to help them by putting most of the points. These are the main tasks which Architect need to do or should have these skill set to become BigData Architect. …

let’s look at two different ways to compute word counts, one using reduceByKey and the other using groupByKey: val words = Array( "a","b","c","a","a","b","c","a","a","b","c","a","a","b","c","a","a","b","c","a","a","b","c","a","a","b","c","a","a","b","c","a","a","b","c","a","a","b","c","a","a","b","c","a","a","b","c","a","a","b","c","a","a","b","c","a","a","b","c","a","a","b","c","a","a","b","c","a","a","b","c","a","a","b","c","a","a","b","c","a","a","b","c","a","a","b","c","a"); val pairs = sc.parallelize(words).map(line => (line,1)); val wordCountsWithGroup = pairs.groupByKey().map(t => (t._1, t._2.sum)).collect() val wordCountsWithReduce = pairs.reduceByKey(_ + _) .collect()   While both of these functions will …

In this post, will look at the following Pseudo set Transformations distinct() union() intersection() subtract() cartesian() Distinct distinct(): Returns distinct element in the RDD. Warning :Involves shuffling of data over N/W Union union() : Returns an RDD containing data from both sources Note : Unlike the Mathematical Union, duplicates are …

Components of Spark

Following are some important components of Spark Cluster Manager Is used to run the Spark Application in Cluster Mode Application User program built on Spark. Consists of, Driver Program The Program that has SparkContext. Acts as a coordinator for the Application Executors Runs computation & Stores Application Data Are launched …

Let us start an Application. For this demo, Scala shell acts as a Driver (Application) $ spark-shell Connect to web app(localhost:4040) and explore all the tabs. Except for Environment & Executors tab all other tabs are empty     That clearly indicates we have an Executor running in the background to support our Application.   The …