HDFS HDFS Stands for Hadoop Distributed File System is the worlds most reliable Distributed Storage System. HDFS is a FileSystem designed for storing very large files. Block In Hadoop a file is split into small chunks known as Blocks. These are considered as smallest unit of data in a FileSystem. …

Introduction Following is a simple MapReduce example which is a little different than the standard “Word Count” example in that it takes (tab) delimited text, and counts the occurrences of values in a certain field. More details about the implementing it are described below. package com.npntraining.hadoop.mapreduce; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; …

Introduction Hadoop’s default output delimiter (character isolating the output key and value) is tab (“\t”). This post explains the best approach to alter the default Hadoop output delimiter.   Output Delimiter Configuration Property The output delimiter of a Hadoop job can easily be altered by Changing the mapred.textoutputformat.separator configuration property. This property …

In this blog post we will explain you how to “skip header and footer rows in hive”. In Hive we can ignore N number of rows from top and bottom from a file using TBLPROPRTIES clause. The TBLPROPERTIES clause provides various feature which can be set as per our need.   …

A distributed system is a model in which components located on networked computers communicate and coordinate their actions by passing messages. Some important points: Distributed systems are the systems which are designed in such a way that it distributes the load within the system and process the load simultaneously. To …

What kind of issues you’re facing while using cluster? 1. Lack of configuration management. 2. Poor allocation of resources. 3. Lack of a dedicated network. 4. Lack of monitoring and metrics. 5. Ignorance of what log files contain what information. 6. Inadvertent introduction of single points of failure. Cluster issues …

I have seen many hadoop over the internet where they search for bigdata architect roles and responsibilities. So here I tried to help them by putting most of the points. These are the main tasks which Architect need to do or should have these skill set to become BigData Architect. …

Use case Given below is the data regarding the electrical consumption of an organization. It contains the monthly electrical consumption and the annual average for various years. If the above data is given as input, we have to write applications to process it and produce results such as finding the …

HDFS VS HBase

The below tables gives the difference betweek HDFS and HBase HDFS HBase HDFS is a distributed file system suitable for storing large file. HBase is NoSQL database built on top of the HDFS. It doesn’t support fast individual record lookups It provides fast lookups for large tables. It provides high …