Big Data Architect Masters Program

The Big Data Architect Training is designed to master Big Data technologies by learning the conceptual implementation of Hadoop + Apache Storm + Apache Spark using Scala + Kafka and MongoDB.The entire program is a structured learning path which is recommended by leading industry experts and ensures that you transform yourselves into a successful Big Data Architect.


  • About the courses
  • Curriculum
  • FAQ's
  • Certification
  • Review

About the Course

The Big Data Architect Masters Program is designed to help you learn in-depth in Big data technologies like Hadoop + Apache Storm + Apache Spark using Scala + Kafka and MongoDB through complete Hands-on training.

Being a Big Data Architect requires you to be a master of multiple technologies, and this program will ensure you to become an industry-ready Big Data Architect who can provide solutions to Big Data projects.

 

At NPN Training we believe in the philosophy "Learn by doing" hence we provide complete

Hands-on training with a real-time project development.

 

The course includes  CCA175 Cloudera Certified Associate Spark Hadoop Developer Certification training.

 

Work on a real-life Industry-based project

This program comes with a portfolio of industry-relevant POC's, Use cases and project work. Unlike other institutes we don't say use cases as a project, we clearly distinguish between use case and Project.

 

Social Media

 

Technologies used:  Hadoop, Hive (HQL)

Project #1: Discovering Insights of Social bookmarking websites +

We will be using the data accumulated from websites like the front page of the web, StumbleUpon which are famous bookmarking websites and enable you to bookmark, rate, review & seek different links on these bookmarking sites (Reddit, StumbleUpon,) and so on.

 

Project Statement: Analyze the information or data in the Hadoop ecosystem to:

 

  1. Collect the information into HDFS and examine it with the assistance of MapReduce, Pig, and Hive to identify the high rated links based on client remarks, likes etc.
  2. Using MapReduce, change the semi-structured format i.e. (XML information) into a structured format and sort the client rating as positive as well as negative for all of the thousand links.
  3. Drive the result in to HDFS and after that feed it into PIG, which divides the data into two sections: Category data and Rating data.
  4. Write a special Hive Query to examine or analyze the information or data further and drive the output or result into an (RDBMS) relational database using Sqoop.

 

 

 

Social Media

 

Technologies used:  Java, Apache Storm, Google API

Project #2: Building Real time complex event processing +

Project Description:

In this project, you will be building a real-time event processing system using Apache Storm where even sub seconds also matter for analysis, while still not fast enough for ultra-low latency (picosecond or nano second) applications, such as CDR (Calling Detailed Record) from telecommunication where you can expect millisecond response times. Sometimes, you'll see such systems use HBaseApache Storm and HDFS

 

Project Statement: You will be creating an Apache Storm application where you process the data in real time and perform the below mentioned tasks:

 

  1. Create a Spout to read the real time data which is generated by the network elements.
  2. Use Google API to perform transformation such as converting latitude and longitude to region names
  3. Perform computation to calculate some important KPI's (Key Performance Indicator) on the real time data.
  4. Make use of HDFSBolt to push the data to HDFS to perform batch processing.
 

 

Telecom Industry

 

Technologies used:  Java, Hadoop, Hive (HQL), Quartz scheduler

Project #3: Analysis of Call Detail Record +

Telecom service providers like Huawei, Ericson collect usage transaction, network performance data, cell-site data, device information and other information spread across the network which can be used for analysis.

 

Project description: You will be given a CDR (Call Detail Record) which is a data record produced by a telephone exchange or other telecommunications equipment that documents the details of a telephone call or other telecommunications transaction (e.g., text message) that passes through that facility or device.

 

Technologies used :  Java, Hadoop, Hive (HQL), Quartz scheduler

 

 

Retail industry

 

Technologies used:  Hadoop, Hive QL

Project #4: Customer complaint analysis about their products  +
Project description: Publicly available dataset, containing a few lakh observations with attributes like; CustomerId, Payment Mode, Product Details, Complaint, Location, Status of the complaint, etc. Analyze the data in the Hadoop ecosystem to:
  1. Get the number of complaints filed under each product
  2. Get the total number of complaints filed from a particular location
  3. Get the list of complaints grouped by location which has no timely response

Technologies used:  Java, Hadoop, Hive (HQL), Quartz Scheduler

 

 

Education Industry

 

Technologies used:  Java, Hadoop, Hive (HQL), Quartz Scheduler

Project #5: Scholastic Assessment Analysis +

Project description: The data set is SAT (College Board) 2010 School Level Results which gives you the information about how the students perform in the tests from different schools. It consists of the below fields. DBN, School Name, Number of Test Takers, Critical Reading Mean, Mathematics Mean, Writing Mean Here DBN will be the unique field for this data set. The students were given a test. Based on the results of the test. Here we are trying to analyze this data and below are the few problem statements that we have chosen:

  1. Find the total number of test takers.
  2. Find the highest mean/average of the Critical Reading section and the school name.
  3. Find the highest mean/average of the Mathematics section and the school name.
  4. Find the highest mean/average of the Writing section and the school name.

 

Common across Industry

 

Technologies used:  Python

Project #6: Smart Data Generator +

Project description: Creating a project which generates dynamic mock data based on the schema at a real-time, which can be further used for Real-time Processing systems like Apache Storm or Spark Streaming.

 

Banking and Finance Industry

 

Technologies used:  Java, Apache Storm

Project #7: Bank Credit Card Authorization Using Storm +

Authorization hold (also card authorization, preauthorization) is the practice within the banking industry of verifying electronic transactions initiated with a debit card or credit card and holding this balance as unavailable until either the merchant clears the transaction, also called a settlement, or the hold "falls off."

 

 

Big Data & Hadoop 2.x

Module 01 - Understanding Big Data & Hadoop 2.x +
Learning Objectives - In this module, you will understand Big Data, the limitations of the existing solutions for Big Data problem, how Hadoop solves the Big Data problem, the common Hadoop ecosystem components, Hadoop 2.x Architecture, HDFS, Anatomy of File Write and Read.

Topics -
  • Understanding what is Big Data
  • Combined storage + computation layer
  • Bussiness Usecase - Telecom
  • Challenges of Big Data
  • OLTP VS OLAP Applications
  • Limitations of existing Data Analytics
  • A combined storage compute layer
  • Introduction to Hadoop
  • Exploring Hadoop 2.x Core Components
  • Understanding Hadoop 2.x Daemon Services
    1. NameNode
    2. DataNode
    3. Secondary NameNode
    4. ResourceManager
    5. NodeManager
  • Understanding NameNode metadata
  • File Blocks in HDFS
  • Rack Awareness
  • Anatomy of File Read and File Write
  • Understanding HDFS Federation
  • Understanding High Availablity Feature in Hadoop 2.x
  • Exploring Big Data ecosystem

View Module Presentation

Check E-Learning for more Assignments + Use cases + Project work + Materials + Case studies

 

Module 02 - Exploring Administration + File System + YARN Commands +
Learning Objectives - In this module, you will learn Formatting NameNode, HDFS File System Commands, MapReduce Commands, Different Data Loading Techniques,Cluster Maintence etc.

Topics -
  • Analyzing ResourceManager and NameNode UI
  • Exploring HDFS File System Commands - [Hands-on]
  • Exploring Hadoop Admin Commands - [Hands-on]
  • Printing Hadoop Distributed File System
  • Running Map Reduce Program - [Hands-on]
  • Killing Job
  • Data Loading in Hadoop - [Hands-on]
    1. Copying Files from DFS to Unix File System
    2. Copying Files from Unix File System to DFS
    3. Understanding Parallel copying of data to HDFS - [Hands-on]
  • Executing MapReduce Jobs
  • Different techniques to move data to HDFS - [Hands-on]
  • Backup and Recovery of Hadoop cluster - [Activity]
  • Commissioning and Decommissioning a node in Hadoop cluster. - [Activity]
  • Understanding Hadoop Safe Mode - Maintenance state of NameNodeKey/value pairs - [Hands-on]
  • Configuring Trash in HDFS - [POC]

Check E-Learning for more Assignments + Use cases + Project work + Materials + Case studies

View Module Presentation

 

Module 03 - MapReduce Programming +
Learning Objectives -In this module, you will understand how MapReduce framework works.

Topics -
  • Understanding Key/Value in MapReduce
    1. What it means?
    2. Why key/value data?
  • Hadoop Topology Cluster
    1. The 0.20 MapReduce Java API
    2. The Reducer class
    3. The Mapper class
    4. The Driver class
  • Flow of Operations in MapReduce
  • Implementing Word Count Program - [Hands-on]
  • Exploring Default InputFormat - TextInputFormat
  • Submission & Initializing of MapReduce job - [Activity]
  • Handling MapReduce Job
  • Exploring Hadoop Datatypes
  • Understanding Data Locality
  • Serialization & DeSerialization

For more assignments check E-Learning

 

Module 04 - Hive and Hive QL +
Learning Objectives -  In this module you will learn Hive and its similarity with SQL,Understanding Hive concepts, Hive Data types, Loading and Querying Data in Hive.

Topics -
  • A Walkthrough of Hive Architecture
  • Understanding Hive Query Patterns
  • Internal vs External tables
  • Different ways to describe Hive tables
  • [Use case] - Disscussing where to use which types of table.
  • Different ways to load data into Hive tables - [Activity]
    1. Loading data from Local File System to hive Tables.
    2. Loading data from HDFS to Hive Tables.
  • Exploring Hive Complex Data types. - [Hands-on]
    1. Arrays
    2. Maps
    3. Structs
  • Exploring Hive built-in Functions.

For more assignments check E-Learning

 

Module 05 - Hive Optimization [Hadoop Development, Testing] +
Learning Objectives - In this module, you will understand Advanced Hive concepts such as Partitioning, Bucketing, Dynamic Partitioning, different Storage formats etc.

Topics -
  • Understanding Hive Complex Data types
    1. Arrays,
    2. Map
    3. Struct
  • Partitioning
  • [Use case] - Using Telecom dataset and learn which fields to use for Partitioning.
  • Dynamic Partitioning
  • [Use case] - Using IOT dataset and learn Dynamic Partitioning.
  • Hive Bucketing
  • Bucketing VS Partitioning
  • Dynamic Partitioning with Bucketing
  • Exploring different Input Formats in Hive
  •     TextFile Format - [Activity]
  •     SequenceFile Format - [Activity]
  •     RC File Format - [Activity]
  •     ORC Files in Hive - [Activity]
  • Using different file formats and capturing Performance reports - [POC]
  • Map-side join - [Hands-on]
  • Reduce-side join - [Hands-on]
  • [Use case] - Looking different problems to which Map-side and Reduce-side join can be used.
  • Map-side join VS Reduce-side join - [Hands-on]
  • Writing custom UDF - [Hands-on]
  • Accessing Hive with JDBC - [Hands-on]
For more Assignments + Use cases + Project work + Materials can be found in E-Learning
 
Module 06 - Sqoop +
Learning Objectives - In this module you will learn how to Import and export data from traditional databases, like SQL, Oracle to Hadoop using Sqoop to perform various operations.

Topics -

Sqoop Overview

How does Sqoop work

Sqoop JDBC Driver and Connectors

Sqoop Importing Data

Various Options to Import Data

    Table Import

    Binary Data Import

    SpeedUp the Import

    Filtering Import

    Full Database Import

Check E-Learning for more Assignments + Use cases + Project work + Materials + Case studies
Understanding & Building Data pipeline Architecture using Pig and Hive +

Project Description - We will use the the U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) data who tracks the ontime performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled, and diverted flights appear in DOT’s monthly Air Travel Consumer Report, published about 30 days after the month’s end, as well as in summary tables posted on this website.

 

Architect diagram - 

 

Real time Analytics using Apache Storm

Module 07 - Exploring Storm Technology Stack +
Learning Objectives - In this module you'll learn Storm Installation, different run modes of Storm, Spouts, Bolts, creating a simple Storm program and different topologies available in Storm.

Topics -
  • Understanding Complex Event Processing
  • Introduction to Apache Storm
  • Use cases of Apache Storm
  • Storm Architecture
  • Components of Storm Cluster
  • Storm Processes
    1. Nimbus
    2. Supervisor
    3. Worker Process
  • Understanding Topology
    1. Spout
    2. Bolt
  • Creating First Topology
  • Simple 'Hello World' Topology - [Hands-on]
    1. Implementing Spout
    2. Implementing Bolt
    3. Submitting the Topology
  • Adding Parallelism in a Storm topology - [Hands-on]
  • Reading data from a File - [Hands-on]
  • Representing Data using Tuples - [Hands-on]
  • Accessing Data from Tuples - [Hands-on]
  • Writing data to a File - [Hands-on]
  • LocalCluster VS StormSubmitter
  • Depoloying Storm Topology in Cluster

Network Error Analysis Analysis - [Real-time industry use case]

Use case Description :

Given the CDR (Call Detail Record) which is a data record produced from telecom industry used to analyze the reasons of vairous errors occured in the network

1. Report the call drop occured due to Network Error.
2. Report the call drop occured due to Subscriber Error.
3. Notify Errors if certain thresholds reached.


 For more assignments check E-Learning

 

Module 08 - Stream Grouping +
Learning Objectives - In this module, you will learn different stream grouping available in Apache Storm.

Topics - 
  • Understanding what is Stream Grouping
    1. Shuffle Grouping
    2. Fields Grouping
    3. Partial Key Grouping
    4. All Grouping
    5. Global Grouping
  • Implementing Shuffle Grouping - [Hands-on]
  • Implementing Fields Grouping - [Hands-on]
  • Implementing Partial Key Grouping - [Hands-on]
  • Implementing All Grouping - [Hands-on]
  • Implementing Global Grouping - [Hands-on]
For more assignments check E-Learning

 

Apache Spark using Scala

Module 09 - Introduction to Scala for Apache Spark +
Learning Objectives - In this module, you will understand the basics of Scala that are required for programming Spark applications. You will learn about the basic constructs of Scala such as variable types, control structures, collections, and more.

Topics -
  • Introduction to Scala REPL
  • Basic Scala operations
  • Exploring different Variable Types
    1. Mutable Variables - [Hands-on]
    2. Immutable Variables - [Hands-on]
  • Type Inference in Scala - [Hands-on]
  • Block Expressions
  • Exploring Lazy evaluation in Scala
  • Control Structures in Scala
  • Exploring different variants of for loop
    1. Enhanced for loop. - [Hands-on]
    2. For loop with yield. - [Hands-on]
    3. For Loop with if conditions : Pattern Guards - [Hands-on]
  • Match Expressions - [Hands-on]
  • Exploring Functions in Scala
  • Exploring Procedures in Scala

Collections in Scala

    i.    Array

    ii.   ArrayBuffer

    iii.   Map

    iv.   Tuples

     v.   Lists

For more assignments check E-Learning
Module 10 - OOPs and Functional Programming in Scala +
Learning Objectives - In this module, you will learn connecting application with Oracle Database.

Topics -

Class in Scala

Getters and Setters

Custom Getters and Setters

Properties with only Getters

Auxiliary Constructor

Primary Constructor

Singletons

Companion Objects

Extending a Class

Overriding Methods

Traits as Interfaces

Layered Traits

Functional Programming

Higher Order Functions

Anonymous Functions

For more assignments check E-Learning
 
Module 11 -  Overview of Apache Spark+
Learning Objectives - In this module, you will learn connecting application with Oracle Database.

Topics -

Overview of Apache Spark

Features of Apache Spark

Exploring Data sharing in MapReduce

Exploring Data sharing in Apache Spark

Spark Eco system

Introduction to RDD

Exploring Properties of RDD

    i.    Immutable

    ii.   Lazy evaluated

    iii.  Cacheable

    iv.  Type Inferred

Understanding Partitions in Spark

Characteristics of Partitions in Spark

Spark Architecture

Spark Modes

Hadoop VS Spark

Eco system of Hadoop VS Spark 

 
 
Module 12 - Spark Common Operations +
Learning Objectives - In this module, you will learn connecting application with Oracle Database.

Topics -
  • Installing Apache Spark on Windows - [Hands-on Activity]
  • Starting Spark Shell
  • Exploring different ways to start Spark
  • Understanding Spark UI - [Activity]
  • RDD Creations - [Hands-on]
    1. Loading a file
    2. Parallelize Collections
  • Exploring RDD Operations
    1. Transformations
    2. Actions
  • Rdd Actions - [Hands-on]
    1. count()
    2. first()
    3. take(int)
    4. saveAsTextFile(path:String)
    5. reduce(func)
    6. collect(func)
  • RDD Transformations - [Hands-on]
    1. map(func)
    2. foreach(func)
    3. filter(func)
    4. coalesce(int)
  • Passing functions to Spark High Order functions
    1. Anonymous function - [Industry Practices]
    2. Passing Named function - [Industry Practices]
    3. Static singleton function - [Industry Practices]
  • [Use case] - Analyzing Movie lens dataset and performing Actions and Transformation
  • Chaining Transformation and Actions in Spark
  • Running Spark programs in eclipse using Maven - [Industry Practices]
  • Understanding and initializing SparkSession  i.e Spark 2.0 entry point
  • [Performance Improvement] - Understanding Spark Caching
  • Loading RDD
    1. textFile
    2. wholeTextFiles
  • Saving RDD

 

For more assignments check E-Learning
 
Module 13 - Playing with RDD's +
Learning Objectives - In this module, you will learn connecting application with Oracle Database.

Topics -

RDD Caching and Persistance

reduce() vs fold()

Scala RDD Extensions

    i.    DoubleRDDFunctions

    ii.   PairRDDFunctions

    iii.   OrderedRDDFunctions

    iv.  SequenceFileRDDFunctions

Exploring Aggregate Functions

groupByKey function

reduceByKey function

For more assignments check E-Learning
Module 14 - DataFrames and Spark SQL +
Learning Objectives - In this module, you will learn about Spark SQL which is used to process structured data with SQL queries. You will learn about data-frames and datasets in Spark SQL and perform SQL operations on data-frames.

Topics -
  • Introduction to SparkSQL
  • Features of Spark SQL
  • Overview of DataFrames
  • Creating a DataFrame from JSON file - [Hands-on]
  • Creating a DataFrame from JSON file - [Hands-on]
  • Creating a custom schema and querying
  • Understanding DataFrame explain() function [Industry practices]
  • Registering DataFrame as a Table
  • Operations supported by DataFrames
  • Converting RDD to DataFrame
  • [Use case] Analysing Employee dataset
  • Exploring Pivots - [Industry practices]
  • Join Operations in DataFrame

E-Commerce Data Analysis - [Real-time industry use case]

Use case Description :

Given the E-commerce data set of customer information of an organization and item dataset we have to perform below mentioned analysis.

1. Filter out the entire customer belonging to group 400.
2. Find the number of the customer from each country and find the total item price of all the purchased items from each country.


 

For more assignments check E-Learning

 

Apache Kafka - Real-time Messaging System

Module 01 - Understanding Apache Kafka +
Learning Objectives - In this module, you will understand Big Data, Kafka and Kafka Architecture.

Topics -
  • Integration without Messaging System
  • Integration with Messaging System
  • What is Kafka
  • Download and install Kafka
  • Components of Messaging System
  • Exploring Kafka components
    1. Producer
    2. Consumer
    3. Broker
    4. Cluster
    5. Topic
    6. Partitions
    7. Offset
    8. Consumer groups
  • Installing Kafka
  • Kafka concepts - [Hands-on]
    1. Starting Zookeeper
    2. Starting Kafka Server
    3. Creating a topic
    4. Start a console producer
    5. Start a console consumer
    6. Sending and receiving messages

View Module Presentation

Check E-Learning for more Assignments + Use cases + Project work + Materials + Case studies

 

Module 02 - Exploring Kafka Core API +
Learning Objectives - In this module, you will understand Big Data, Kafka and Kafka Architecture.

Topics -
  • Adding Kafka dependency to Maven - [Activity]
  • Exploring Kafka core API's
  • Exploring Kafka producer API
  • Kafka Producer coding using Java - [Hands-on]
  • Exploring Kafka consumer API
  • Kafka Consumer coding using Java - [Hands-on]
  • Changing the configuration of a topic - [Hands-on]

Check E-Learning for more Assignments + Use cases + Project work + Materials + Case studies

 

 

NoSQL - MongoDB

Module 01 - MongoDB Getting Started +
Learning Objectives -  In this module, you will get an understanding of basics of MongoDB.

Topics -
  • What is MongoDB
  • Difference between MongoDB and RDBMS
  • Installing MogoDB in windows - [Activity]
  • Configuring MongoDB server with configuration file - [Activity]
  • Creating First Database - [Hands-on]
  • Creating Document and saving it to collection - [Hands-on]
  • Dropping a database - [Hands-on]
  • Creating a Collection - Using db.createCollection(name,options) - [Hands-on]
  • Understanding Capped Collections - [Industry Practices]
  • Dropping a Collection - [Hands-on]

For more Assignments + Use cases + Project work + Materials check E-Learning

Module 02 - MongoDB CRUD Operations - Create, Read, Update and Delete +
Learning Objectives -  In this module, you will perform CRUD operations.

Topics -
  • reading/Inserting a document in collection using javascript file - [Hands-on]
  • Inserting Array of Documents - [Hands-on]
  • Reading a Document - Querying - [Hands-on]
  • Reading a Document with $lt, $gt operator - [Hands-on]
  • Other Query Operators - [Hands-on]
  • Updating Documents - [Hands-on]
  • Deleting documents - [Hands-on]

For more Assignments + Use cases + Project work + Materials check E-Learning

Module 03 - Connecting MongoDB with Java +
Learning Objectives -  In this module, you will connect to MongoDB using Java.

Topics -
  • Creating Maven Project & Adding dependencies for MongoDB-Java Driver - [Activity]
  • Connecting to MongoDB server - [Hands-on]
  • Displaying all databases - [Hands-on]
  • Creating a datbase and collection - [Hands-on]
  • Reading/Inserting a document in collection using Java - [Hands-on]

For more Assignments + Use cases + Project work + Materials check E-Learning

How will I execute the Practicals?

We will help you to setup NPN Training's Virtual Machine + Cloudera Virtual Machine in your System with local access. The detailed installation guides are provided in the E-Learning for setting up the environment.


Is Java a pre-requisite to learn Big Data and Hadoop?

Yes, you definitely can. We will provide you the Video Tutorial for Java. You can start immediately and before the Java is introduced in the course from the third week (Map-Reduce), you would have enough time already to clear your concepts in Java.

NPN Training Certification Process:

1. Once you are successfully through the project (Reviewed by a NPN Training expert), you will be awarded with NPN Training's Big Data and Hadoop certificate.

2. Each Certificate will be given a unique Certification ID which can be validated through Verify Certificate link

Shreyas Gowda
Company:
Facebook



Best Institute to learn big data architect course, Naveen follow more of a hands on problem solving approach for programming and also guide us to solve many use cases, which helps us understand the concepts well. I would definitely recommend NPN training!

Surabhi KS
Company:
Facebook



I have opted for Big data Architecture course in this institute. I had enquired in many other institutes before and one of my friend referred me here. After first few classes I was very much convinced that I have made a right decision by joining here. The materials and assignments given are very helpful. He summarizes all the topics covered at the end and beginning of each class which will be very helpful for us to remember. I would definitely recommend this institute to my friends. It's totally worth the amount we pay.

Contact us


+91 8095918383 | +91 9535584691

Upcoming batches

Dec

09

Big Data Architect Training

Timings
- (Weekend Saturday batch)
Fees 23,000 INR

Dec

23

Big Data Architect Training

Timings
- (Weekend Saturday batch)
Fees 23,000 INR

Jan

13

Big Data Architect Training

Timings
- (Weekend Saturday batch)
Fees 23,000 INR

Course Features

Big Data Architect Masters Program Training
4.8 stars - based on 150 reviews