Table of Contents1 HDFS2 Block3 Block Size4 What happens when the block size is small5 What happens when the block size is large6 Important Points to consider while choosing Block Size HDFS HDFS Stands for Hadoop Distributed File System is the worlds most reliable Distributed Storage System. HDFS is a …

Spark splits data into partitions and computation is done in parallel for each partition. It is very important to understand how data is partitioned and when you need to manually modify the partitioning to run spark application efficiently. In Spark RDD API there are 2 methods available to increase or …

cache() and persist()  are 2 methods available in Spark to improve performance of spark computation. These methods help to save intermediate results so they can be reused in subsequent stages. These interim results as RDDs are thus kept in Memory (default) or Solid storage like disk and/or replicated. RDDs can be cached …

Kafka Commands

How to start a zookeeper ? > zkServer.sh start How to start a Kafka Broker ? > kafka-server-start.sh /home/npntraining/opt/kafka_2.11-0.10.2.1/config/server.properties How to create a topic ? > kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic Hello-Kafka  

Table of Contents1 Introduction2 Output Delimiter Configuration Property3 Example Introduction Hadoop’s default output delimiter (character isolating the output key and value) is tab (“\t”). This post explains the best approach to alter the default Hadoop output delimiter.   Output Delimiter Configuration Property The output delimiter of a Hadoop job can …

When working with Apache Kafka you might want to write data from a Kafka topic to a local text file. This is actually very easy to do with Kafka Connect. Kafka Connect is a framework that provides scalable and reliable streaming of data to and from Apache Kafka. With Kafka …

JSON could be a quite common way to store information. however JSON will get untidy and parsing it will get tough. Here are some samples of parsing nested data structures in JSON Spark DataFrames (examples here finished Spark one.6.0).   Sample JSON File: { "user": "gT35Hhhre9m", "dates": ["2016-01-29", "2016-01-28"], "status": …

Table of Contents1 What are the main components, when using quartz in java?2 How do you start a quartz process which will start executing jobs? Ans. To begin a quartz process you must initiate all the components, scheduler, job and trigger, and then call the start method on the scheduler.3 …