Each file or directory or block occupies about 150 bytes in the namenode memory. So a cluster with a namenode with 32G RAM can support a maximum of (assuming namenode is the bottleneck) about 38 million files. (Each file will also take up a block, so each file takes 300 bytes in effect. I am also assuming 3x replication. So each file takes up 900 bytes)
In practice however, the number will be much lesser because all of the 32G will not be available to the namenode for keeping the mapping. You can increase it by allocating more heap space to the namenode in that machine.
Replication also effects this to a lesser degree. Each additional replica adds about 16 bytes to the memory requirement.
|(Each file metadata = 150bytes) + (block metadata for the file=150bytes)=300bytes so 1million files each with 1 block will consume=300*1000000=300000000bytes =300MB for replication factor of 1. with replication factor of 3 it requires 900MB|
So as thumb rule for every 1GB you can store 1million files.
There are several technical limits to the NameNode (NN), and facing any of them will limit your scalability.
- Memory – NameNode consume about 150 bytes per each block.
- IO – NN is doing 1 IO for each change to filesystem (like create, delete block etc). So your local IO should allow enough. It is harder to estimate how much you need. Taking into account fact that we are limited in number of blocks by memory you will not claim this limit unless your cluster is very big.
- CPU – Namenode has considerable load keeping track of health of all blocks on all datanodes. Each datanode once a period of time report state of all its block. Again, unless cluster is not too big it should not be a problem.
Big Data Hadoop Training in Bangalore provided by NPN Training is a program designed to help professionals gain proficiency to work with the latest and core components of Hadoop.