Table of Contents
HDFS Stands for Hadoop Distributed File System is the worlds most reliable Distributed Storage System. HDFS is a FileSystem designed for storing very large files.
- In Hadoop a file is split into small chunks known as Blocks. These are considered as smallest unit of data in a FileSystem.
- The default block size in Hadoop 1.x is 64 MB and 128 MB in Hadoop 2.x
- The size of the block effects sequential read and writes.
There is no as such rule set by Hadoop to the bound user with certain block size. Usually, it depends on the input data. If you want to maximize throughput for a very large input file, using very large blocks (may be 128MB or even 256MB) is best. But on the other hand for smaller files, using a smaller block size is better.
So we are talking about larger file large block & smaller file small blocks. In Industry we can get files of different sizes & we can have files with different block sizes on the same file system. So in order to overcome that situation “dfs.block.size” parameter can be used when the file is written. It will help you in overriding default block size written in hdfs-site.xml
What happens when the block size is small
- When the block size is small number of seeks increases as small size of block means the data when divided into blocks will be distributed in more number of blocks and as more blocks are created, there will be more number of seeks to read/write data from/to the blocks.
- Also, large number of blocks increases overhead for the name node as it required more memory to store the metadata.
- When the block size is smaller there will be more tasks to execute by the JVM.
What happens when the block size is large
- When the block size is larger, then parallel processing takes a hit and the complete processing will take a very long time as data in one block may take large amount of time for processing
Hence we should choose a moderate block size of 128 MB and then analyze and observe the performance of the cluster.We can then choose to increase/decrease the block size depending upon our observation.
Important Points to consider while choosing Block Size
- Typically a file will have fewer blocks if the block size is larger. The advantage it is possible for clients to read/write more data without interacting with the NameNode which saves time.
- Having larger block size also reduces the metadata size of the NameNode, reducing NameNode load.
- With fewer blocks, the file may potentially be stored on fewer nodes in total, this can reduce total throughput of parallel access.
- Having fewer & larger blocks, also means longer tasks which in turn may not gain maximum parallelism.
- Also while larger block is being processed and some failure occur more work need to be done.