Kafka Commands

Published on

How to start a zookeeper ? > zkServer.sh start How to start a Kafka Broker ? > kafka-server-start.sh /home/npntraining/opt/kafka_2.11-0.10.2.1/config/server.properties How to create a topic ? > kafka-topics.sh –create –zookeeper localhost:2181 –replication-factor 1 –partitions 1 –topic Hello-Kafka  

Aggregation on Text Fields in the MapReduce Example

Published on

Table of Contents1 Introduction2 Here is how to set things up to run the above MapReduce job:2.1 Create an Executable Jar containing your MapReduce classes2.2 Create a working Hadoop instance2.3 You must first have a working Hadoop installation to run this on. I personally like to create a Docker container using the sequenceiq/docker-spark image.2.4 Create … Continue reading Aggregation on Text Fields in the MapReduce Example

Compressing Intermediary Results of Map Output in Hadoop

Published on

It is generally recommended to always compress Intermediary map output. This is because IO operations and network transfers are biggest bottlenecks in Hadoop, and compression can help with both of these issues. Map output is written to local disk, and then transferred (shuffled) across the network to reducer nodes. At this point in a MapReduce … Continue reading Compressing Intermediary Results of Map Output in Hadoop

Changing the Output Delimiter in t

Published on

Table of Contents1 Introduction2 Output Delimiter Configuration Property3 Example Introduction Hadoop’s default output delimiter (character isolating the output key and value) is tab (“\t”). This post explains the best approach to alter the default Hadoop output delimiter.   Output Delimiter Configuration Property The output delimiter of a Hadoop job can easily be altered by Changing … Continue reading Changing the Output Delimiter in t

Working with Nested JSON in Spark

Published on

JSON could be a quite common way to store information. however JSON will get untidy and parsing it will get tough. Here are some samples of parsing nested data structures in JSON Spark DataFrames (examples here finished Spark one.6.0).   Sample JSON File: { “user”: “gT35Hhhre9m”, “dates”: [“2016-01-29”, “2016-01-28”], “status”: “OK”, “reason”: “some reason”, “content”: … Continue reading Working with Nested JSON in Spark

Message Retention in Kafka

Published on

The retention period of records in Kafka is configurable. The default retention period is 7 days. The retention period is specific to topic. SO in the cluster each topic can have their own retention period. The retention attribute is available in the server.properties of the apache kafka distribution. The attribute is log.retention.hours=168 Lets say the … Continue reading Message Retention in Kafka

What is a Distributed System

Published on

A distributed system is a model in which components located on networked computers communicate and coordinate their actions by passing messages. Some important points: Distributed systems are the systems which are designed in such a way that it distributes the load within the system and process the load simultaneously. To acheive simultaneuos process the load … Continue reading What is a Distributed System