Table of Contents
Big Data is a major buzz word in the current technological forefront. Big Data technologies have given rise to the usage of cutting-edge research to practical applications. Machine Learning and Analytics is one such example. Prior to the adoption of Big Data technologies Artificial Intelligence and Machine Learning were limited to academic research. But Big Data has helped to bring these domains out of research labs to industry. Usage of these technologies help industry to maximize their profits and gain advantage over their competition. Big Data is not a single technology, instead it is a collection of frameworks, each having applicability in specific domain. This article discusses major Big Data frameworks and the probable domains/use-cases in which they could be used.
Connection between Big Data and Analytics
In current market scenario Big Data and Analytics are connected to such an extent that both the terms are used interchangeably. Analytics mainly deals with finding insights from data and doing predictions. Insights are mainly patterns in data that give information about the business that is otherwise not available directly. Using these patterns future outcomes of the process could be computed. Machine Learning is the main tool that helps compute unseen patterns from the data. Machine Learning and Artificial Intelligence are very important domains and existed long before the emergence of Big Data. However these domains were limit to academic research as these technologies required huge amount of resources. Also in order for these technologies to work they need huge amount of data. Prior to Big Data technologies it was tough to handle such a huge amount of data. Thus Analytics was not in much use. But with the reduction in computing cost and emergence of Big Data technologies, Analytics is gaining the momentum and tending to become main-stream practice.
Why Big Data is essentials for programmers?
Data forms the backbone of every project or product. Programmers mainly deal with the job extracting the data from a data store. Then apply some business logic on this extracted data and then display it in some suitable format. As we can see all the phases lie around data, data handling makes a major part of the life of a programmer. With the emergence of these big data technologies, programmers are bound to face them at one point or other. Now as Big Data domain comes with a range of frameworks, it becomes necessary for a programmer to understand these frameworks so that he/she could use them as per the given use-case.
Big Data Frameworks every programmer should know
Big Data domain covers a wide range of frameworks ranging from Machine Learning to File System to Databases. In this article with will be discussing major Big Data frameworks that a programmer should know to enhance his skills. In this tutorial we will only focus on a brief introduction and use cases of each framework. We will be discussing these frameworks under following criterion:
1. Introduction about framework.
2. Features of framework.
3. Use cases of framework.
Mentioned below are Big Data Frameworks every programmer should know. This is not an exhaustive list. This article only focuses to give a brief introduction to major big data frameworks.
Hadoop is the main project of Big Data ecosystem. Hadoop helps to perform distributed computing on a cluster of commodity(having normal cost) computers. The main idea behind hadoop is scalable distributed computing. By scalable we mean that we could add/remove extra nodes to the cluster as the amount of data increases or decreases respectively.
Hadoop has two major components:
1. HDFS: HDFS known as Hadoop Distributed File System is the file system used by Hadoop. HDFS gives a view of single directory structure to the user while under the hood the file system is distributed in nature.
2. Map-Reduce: It is the distributed programming environment provided by Hadoop. Map-Reduce is used to implement the application logic that will use the data stored on HDFS to produce results. Map-Reduce is based on parallel computing. The normal program that you as a programmer write for conventional system will not work on Map-Reduce. For Map-Reduce you have to convert your serial program to a parallel version.
a. Hadoop is written in Java and thus has APIs available for Java language.
b. For other languages there is a utility known as Hadoop Streaming through which other languages could talk to Hadoop.
c. Hadoop mainly works on Linux platform, however recently support for windows is also added.
1. Hadoop is mainly used where batch processing is needed. It is not great for real-time processing.
2. Building a Machine Learning model takes significant amount of time. So Hadoop could be used for model building. Further these models could be used with other real-time frameworks.
Spark is another Big Data framework. Spark supports In-Memory processing. Hadoop reads and writes data directly from disk thus wasting a significant amount of time in disk I/O. To tackle this scenario Spark stores intermediate results in memory thus reducing disk I/O and increasing speed of processing. As per the statistics shown on Spark home page, it could run a program up-to 100 times faster as compared to Hadoop.
1. Spark is written in Scala programming language.
2. You could write programs in Java, Scala and Python to work directly with Spark as Spark has separate APIs for all 3 programming languages. In future versions there is also supposed to be integration for R.
3. Spark could run with its own cluster management, with MESOS or with HDFS provided by Hadoop.
4. Support In-Memory processing.
1. Could be used where fast results are needed specially iterative tasks.
2. Performs extremely well for Machine Learning tasks as these are iterative in nature.
3. Spark could be used in all the scenarios where Hadoop could be used. However Spark provides better speed compared to Hadoop. Moreover its not a replacement of Hadoop instead it could run along side Hadoop.
Frameworks of Spark ecosystem:
Spark comes with a collection of utility frameworks listed below:
1. Spark SQL: It helps you to run SQL queries on distributed data. You as programmer don’t need to worry about making the query distributed. You just need to write the query and Sparks handles all the details of distributing the query and combining back the results. Spark SQL could also be used with JDBC and ODBC.
2. Spark Streaming: It is a streaming API. In scenarios like when you have to process live streams of Facebook or Twitter, you could use Spark Streaming.
3. MLlib: It is a Machine Learning API provided by Spark. It has various algorithms for tasks like: Classification, Regression, Clustering, Feature extraction etc. MLlib could be used to create intelligent applications for Analytics and forecasting. Major algorithms provided by MLlib are:
- Basic statistics: Summary statistics, Correlations, Stratified sampling, Hypothesis testing, Random data generation.
- Classification and regression: linear models (SVM, logistic regression, linear regression), naive Bayes, decision trees, ensembles of trees (Random Forests and Gradient-Boosted Trees), isotonic regression.
- Collaborative filtering: alternating least squares (ALS)
- Clustering: k-means, Gaussian mixture, power iteration clustering (PIC), latent Dirichlet allocation (LDA), streaming k-means,
- Dimensionality reduction: singular value decomposition (SVD), principal component analysis (PCA)
- Feature extraction and transformation
- Frequent pattern mining: FP-growth
- Optimization: stochastic gradient descent, limited-memory BFGS (L-BFGS)
4. GraphX: It is a Graph analytics library. GraphX could be used for Graph analytics like in Social Network Analysis or in domains of cognitive computing.
Mahout is a Machine Learning framework. Initially Mahout was based on Hadoop Map-Reduce but as per April 2015 it is ported to Spark. Mahout has a very rich set of machine learning algorithms:
1. Collaborative Filtering: User-Based Collaborative Filtering, Item-Based Collaborative Filtering, Matrix Factorization with ALS, Matrix Factorization with ALS on Implicit Feedback, Weighted Matrix Factorization, SVD++.
2. Classification: Logistic Regression – trained via SGD, Naive Bayes / Complementary Naive Bayes, Random Forest, Hidden Markov Models, Multilayer Perceptron.
3. Clustering: Canopy Clustering, k-Means Clustering, Fuzzy k-Means, Streaming k-Means, Spectral Clustering.
4. Dimensionality Reduction: Singular Value Decomposition, Lanczos Algorithm, Stochastic SVD, PCA (via Stochastic SVD), QR Decomposition.
5. Topic Models: Latent Dirichlet Allocation
6. Miscellaneous: RowSimilarityJob, ConcatMatrices, Collocations, Sparse TF-IDF Vectors from Text, XML Parsing, Email Archive Parsing, Lucene Integration, Evolutionary Processes.
1. Mahout could be used with Hadoop or Spark.
2. Main programming languages are Scala and Java.
1. Mainly used for Machine Learning analytics tasks like: Recommendation systems, Clustering systems, Classification systems etc.
HBase is distributed, column oriented database. It closely relates to the concept of Google’s BigTable. It is one of the popular NoSQL database. It comes under column oriented category as all the rows of a table need not have same columns (as in Relational Databases). In one row you could store a row with 3 columns and in next row you could store a row with 30 columns. So in a sense it has got flexible schema. You could write Map-Reduce programs and use HBase as their back-end. HBase has scalability property i.e. if the size of the table increases then table is divided into parts and these parts of table are stored on different nodes of the cluster.
1. It is good for reading data rather than writing/updating data.
2. Does not support SQL. Although it is a database still it has its domain specific language and does not support native SQL.
3. Has rich set of APIs for Java programming language.
1. Could be used where you don’t have predefined schema.
2. Could be used where columns to the table are dynamically added/deleted.
3. Could be used where you have a large amount of data that could not be handled by a single node.
Hive is a distributed Data-warehouse. Hive is used to provide a partial structure to unstructured data. It does not support SQL but have a SQL like language known as HiveQL. Under the hood when you write a HiveQL query and execute it, it is converted to a set of Map-Reduce tasks, then results of these Map-Reduce tasks are merged to give a transparent view to the end user.
1. Provides Java API.
2. Used HDFS as the back-end, thus limited by the speed of HDFS.
3. Uses Map-Reduce for data processing.
1. Could be used where you need a distributed data warehouse.
2. Queries can’t be executed in real-time, so could be used with application which don’t have strict time limits.
Pig is another data analysis platform for large datasets. It has its domain specific language called Pig Latin to query the data. It is used with Hadoop and works with HDFS.
1. Converts Pig Latin statements to Map-Reduce under the hood.
2. Allows user defined functions. Thus you could write your custom functions and use them while quering using Pig Latin.
3. Easy to use. It requires very less lines of code compared to Map-Reduce for same task.
4. Mainly suitable ETL jobs.
5. Uses lazy evaluation. It means part of code is only executed only when it is needed.
6. Supports creation of data pipelines in form of Directed Acyclic Graphs.
1. Could be used for data ETL(Extract Transform Load) with very less lines of code.
Logstash is a log processing framework. Events and logs like server logs, system utilization logs could be easily processed with Logstash. It could accept input from a wide variety of inputs like: collectd, eventlog, ganglia, log4j, rabbitmq, rackspace, redis, tcp, udp, twitter etc. It could also be used as a middle tier which accepts unstructured logs from a source. Then process them and adds structure to this unstructured data. Then pass this data to some output source like: elasticsearch, email, file, http, mongodb, nagios, rabbitmq, solr, etc.
1. Great tool for processing of unstructured data and logs.
2. Supports Java API.
3. Uses JSON for message passing.
1. Used in the processing of log data.