Apache Spark is an analytics engine which can process huge data volumes at a speed much faster than MapReduce, because the data is persisted on Spark’s own processing framework. That is why it has been catching the attention of both professionals and the press. It was first developed at AMPLab of the University of California, Berkeley in 2009, and a year later it was made open source, so that its source code was made available freely for modifications and redistribution. That resulted in Apache Spark becoming one of the largest open source communities, and today it has over 200 contributors from 50 organizations.
Each analytics engine would give best results for a particular use case, so enterprises using such engines would most likely need a combination of such tools to properly cover all use cases. We were curious to see some of the use casesSpark was being utilized for.
Humongous amounts of data are being created and processed every day, and so companies need to be able to stream that data so that it can be analyzed in real time. Spark Streaming provides that functionality, in the following ways –
- ETL – Spark Stream is reading data, cleaning it up, and then transforming it before loading it onto the target data warehouse. This is done on real time and continuous basis. This is the Streaming ETL (Extract, Transform and Load) functionality of Spark.
- Enrichment of Data – Apart from dynamic streaming data, Apache Spark is also handling static data and combining both these types of data to perform even better analysis. For example, an advertiser would love to combine static historical data with real time behavioral data of their customers to devise better ways to develop targeted promotions for those customers, and Spark is helping them here.
- Trigger Detection – real time data streaming can be more useful if the analytics engine is able to flag off warning triggers to the user when necessary. For example, a dip in the vital signs of a patient in a hospital data system, or a suspicious financial transaction in a banking data system. Apache Spark has this capability to recognize and flag off these triggers in its streaming data.
- Complex Session Analysis – Spark Streaming is helping companies like Netflix to get real time insights on their customer choices by analyzing the live sessions (when the customer logs on to Netflix App, for instance), which would help provide useful recommendations to the customer.
Machine Learning (or cognitive computing, as IBM refers to it)is all the rage today. The computing system uses the data it contains and receives to improve its own computing performance over a period of time. Spark has a Machine Learning Library (MLib) which works on areas like clustering and classification, and helps Spark provide sentiment analysis, customer segmentation and predictive intelligence. These are done on repetitive queries on different sets of data, and the recommendation engine improves its predictive or analytics performance with continued usage.
One impressive business case for this capability of Spark is in the area of network security. The Spark Streaming described above would help the users do real time monitoring of the data packets before being pushed to storage, and call out any suspicious or malicious activity as soon as they notice it. These would be from the known sources of threats. Once the data packets reach storage, Spark would use the MLib to further analyze this data, and the network security system would identify and uncover newer (and hitherto unknown) threats.
MapReduce was created to process data in batches, not real time. And other SQL-on-Hadoop analytics engines like Hive or Pig do not have the necessary speed. This is where Spark can come in with its high speed interactive analytics, since it can handle exploratory queries without the need for sampling. All the popular development languages like SQL, R and Python are compatible with Spark.
The newest version of Apache Spark, called Spark 2.0, will have a new functionality called Structured Streaming which will allow structured queries against streaming data in real time. So the users of Spark 2.0 can run interactive queries at the same time as the ongoing web session. This would be a great progress in the quality of web analytics. Further, machine learning algorithms could be applied to this structured streaming data, so that the learning from existing data could be applied to incoming new data.
Fog Computing is another impressive use case for Apache Spark, but to understand how, we need to understand first the Internet of Things (IoT), which is right on top of everyone’s minds now. IoT involves connecting all our devices so that they can communicate with each other and provide solutions to the users of those devices. This would entail parallel processing of several sets of data from various devices and sensors, and the data would be in great quantities. The current processing power available in the cloud might not be adequate as IoT spreads further. And that is where Fog Computing comes in.
Instead of using the cloud processing, Fog Computing takes the work of processing to the edge of the network, mostly embedded in the devices themselves. But this requires three things primarily – very low latency, parallel processing of machine learning, and complex graph analytics algorithms. And Apache Spark has answers to all these requirements in its stack. It has Spark Streaming, it has Shark, which is an interactive query tool that works in real time, it has the machine learning library MLib and it has an engine for graph analytics called GraphX. That is why industry watchers predict that as IoT starts to converge, Spark has an opportunity to become the go-to infrastructure for fog computing.
Some Important Users Of Spark
We already discussed how Netflix is using Spark for gaining better insights about their customers. Here are some other notable names who are using Spark :
- Uber : This radio taxi giant is using Spark Streaming along with Kafka and HDFS to ETL (extract, transform and load) its huge amount of real time data of discrete events into a mass of structured and usable data which can be analyzed further to give it better solutions, for example in its surge pricing algorithms.
- Pinterest : Millions pin their favorite topics on this social media site every day, and using the same ETL (as described earlier) process flow, Pinterest is able to make useful recommendations to users who are browsing the site or engaging with Pins.
- Conviva : This streaming video company gets about 4 million video feeds every month on average, and this leads to huge customer churn, plus managing the live video traffic is a huge task. Spark helps them do this by optimizing the video traffic and helping to provide a high quality viewing experience consistently.
A Word Of Caution Regarding Apache Spark
The in-memory capabilities of Apache Spark might not be the most suitable for all use cases, even though it is more versatile than most other engines. Since Spark was not designed to be a multi user environment, hence the memory at the disposal of a user needs to be checked first. Multiple users getting added would mean that the users would need to coordinate the times and extent of memory usage so that parallel projects can run smoothly. Apache suggests the use of other batch processing engines like Apache Hive for this kind of a situation.
The Final Word
Apache Spark is continuously developing its ecosystem, and will continue to do so. Parallelly, enterprises are also coming to terms with the pervasiveness of Big Data, and thinking of how and where to use it profitably, which will present more opportunities and use cases to Apache Spark to expand their horizons across industries. Just like the use cases described above, we are sure there will be many more coming up in the near future which will help us understand whether Apache Spark is truly as great as it sounds. If you are keen to know more about the collaboration tools offered with QDS for Spark, or about Apache Spark-as-a-Service, then do contact us.