Wednesday, January 23, 2013

Hadoop Performance tuning (Hadoop-Hive) Part 2



[Note: This post is second part of Hadoop performance tuning, if you directly reached this page, please click here for part 1.]

I am testing these parameters with Hadoop and Hive framework from sql based queries. For checking performance improvement with configuration parameters, I use sample data of 100 million records and running some complex queries in Hive interface in top of Hadoop. In this part 2 we will see few more Hadoop configuration parameter to get maximum performance improvement in Hadoop cluster. 

Map Output compression ( mapred.compress.map.output )
By default this value set to false, its recommend to set this parameter to true for cluster with large amount of input data to be processed.  Because of compression data transfer between nodes are fast. Map output will not directly move to reducer, intermediately it will write to disk. So this setting helps to save disk space and fast disk read/write. And it’s not recommended to set this parameter to true for small amount of input data to be processed, because it will increase the processing time for compressing and decompressing data. But for Big data compressing and decompression time is considerably small when compare to time its saves in transferring and disk read/write. 

Once we set above configuration parameter to true, other dependent parameter will be active such as setting compression technique (codec) and compression type.  

Compression method or technique orcodec (mapred.map.output.compression.codec )
Default value for this parameter is org.apache.hadoop.io.compress.DefaultCodec. Other available codec are org.apache.hadoop.io.compress.GzipCodec. DefaultCodec will take more time but more compression. In LZO method it will take less time for compression amount of compression is less. Our own codec also can be added. Add codec or compression library which is suitable (best) for your input data type. 

mapred.map.output.compression.type parameter help to identify in which basis data should be compressed. User can set either RECORD or BLOCK. Record type is default type in which each individual value is compressed, means it will compress whole data block as it is. Block type is recommended one, in which data compressed based on data block key-value pairs, so it helps for sorting data in reducer side. In Cloudera Hadoop, default type is set to Block for better performance.

Three more configuration parameters are there 

1. mapred.output.compress
2. mapred.output.compression.type 
3. mapred.output.compression.codec

Same above rules apply here, but this parameter meant for MapReduce job output, first three parameters specify compressed output for map output alone. These three configuration parameter specify for all job output which should be compressed or not and in which type and codec.

Above suggestions are observed with Hadoop cluster with Hive querying, please leave a comment and recommend this post by clicking  Facebook ‘Like’ button and ‘+1’ at bottom of this page.

1 comment:

  1. As the growth of Big data modernization solutions , it is essential to spread knowledge in people. This meetup will work as a burst of awareness.

    ReplyDelete