Hadoop Cluster performance tuning is little hectic, because hadoop framework uses all type of resource for processing and analyzing data. So tuning its parameter for good performance is not static one. Parameter values should be change based on clusters following items for better performance:
- · Operating System
- · Processor and its number of cores
- · Memory (RAM)
- · Number of nodes in cluster
- · Storage capacity of each node
- · Network bandwidth
- · Amount of input data
- · Number of jobs in business logic
Recommended OS for hadoop clusters is Linux, because windows and other GUI based OS runs lot of GUI (Graphical user interface) processes and will occupy most of the memory.
Storage capacity of each node should have at-least 5GB extra after storing distributed HDFS input data. For Example if input data in 1 TB and with 1000 node cluster means, (1024GB x 3(replication factor))/1000 nodes = approx 3GB of distributed data in each node, so it is recommended to have at-least 8GB of storage in each node. Because each data node writes log and need some space for swapping memory.
Network bandwidth is recommended to have at-least 100 Mbps, as well known while processing and loading data into HDFS, Hadoop moves lot of data over network. Lower bandwidth channel also degrade the performance of hadoop cluster.
Number of nodes requires for cluster is depends on amount of data to be processed and capacity of each node. For example node with 2GB Memory and 2 core processor can process 1GB of data in average time. It can also process 2 data block (of 256MB 0r 512MB) simultaneously. For Example: To process 5TB of data, it is recommended to have 1000 nodes with 4-to-8 Core processor and 8-to-10 GB of memory in each node to produce result in few minutes.
Hadoop Parameters:
Data block size (Chunk size):
dfs.block.size parameter will be in hdfs-site.xml file, parameter value is mentioned in number of bytes. Block size should be chosen completely based on each node memory capacity. If memory is less then set smaller block size. Because TaskTracker, bring whole block of data to memory while processing. So for 512MB RAM, it is advised to set block size as 64MB or 128MB. If it is dual core processor then TaskTracker can process 2 block of data at same time, so two data block will be bring to memory while processing, so it should be planned according to that, for this have to set concurrent tasktracker parameter also.
Number of Maps and Reducer:
mapred.reduce.tasks & mapred.map.tasks parameter will be in mapred-site.xml file. By default, number of maps will be equal to number of data block. For example, if input data is 2GB and block size is 256MB means, while processing 8 Maps will run. It won’t bother about memory capacity and number of processor. So we need to tune this parameter to number of nodes*number of cores in each node.
Number of Maps = Total number of processor core available in cluster.
As per above example it runs 8 Maps, if that cluster have only 4 processor core, then multiple thread will start running and keep swapping the memory data, which will degrade the performance of hadoop cluster. In same way set number of reducer to number of core in cluster. After mapping job is over, most of nodes go idle and few nodes working for reducer to complete, to make reducer job to complete fast, set its value to number of nodes or number of core processor.
Logging Level:
HADOOP_ROOT_LOGGER = ERROR set this value in hadoop script file. By default its set to INFO mode, in information mode, hadoop will log all information about including all event, jobs, tasks completed, IO info, warning and error. It won’t increase huge performance improvement, but it will help to reduce number of log file I/Os and give small improvement in performance.
Above suggestions are observed with Hadoop cluster with Hive querying, please leave a comment and recommend this post by clicking Facebook ‘Like’ button and ‘+1’ at bottom of this page.
No comments:
Post a Comment