What kind of issues you’re facing while using cluster?
1. Lack of configuration management.
2. Poor allocation of resources.
3. Lack of a dedicated network.
4. Lack of monitoring and metrics.
5. Ignorance of what log files contain what information.
6. Drastic measures to address simple problems.
7. Inadvertent introduction of single points of failure.
8. Over reliance on defaults
Cluster issues are somehow related to Admin team. Other task that need to be manage daily are
1. Managing space between application users.
2. Distcp – Data back ups and migration.
3. Managing Services and adding nodes using Ambari .
4. Changing cluster capacity .
5. user/group permission management.
6. Alerts and Notifications.
7. Script configuration
Mention recommend hard-disk and ram size?
What kind of jobs have you used can you explain?
Mostly we use it to schedule job at cluster node instead of running manual script each time.
1. Alert mails are triggered when threshold value is reached.
What trouble shooting issues you faced?
1. Issues can be related to cluster or logs like
2. IO exception error
3. Cluster in safe mode
4. Host unreachable,
5. Change in host identification
Cluster maintenance & backup?
1. FileSystem Checksrecursively Health check up
2. sudo -u hdfs hadoop fsck /
3. HDFS Balancer utility
4. sudo -u hdfs hdfs balancer -threshold <threshold-value>
5. Adding or Decommissioning nodes to the cluster
6. Node Failures
7. Database and Metadata Backups for individual database dumps.
8. Purging older log files
9. Plan unplaned downtime.
10. Network issue (host unreachable)
Have you used any monitoring tools?
Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. We haven’t used
Ganglia is more concerned with gathering metrics and tracking them over time while Nagios has focused on being an alerting mechanism.