Wednesday, January 23, 2013

Hadoop with Hive


Nowadays, there are lots of Hadoop emerging. Indeed, by “Lots of Hadoop”, I mean companies releasing their own versions of Hadoop (e.g. Cloudera) by building a layer over the original Apache Hadoop distribution. We can also call these “customized” versions of Apache Hadoop. But when we think about the core part, it remains the same across different Hadoop flavors. Apache Software Foundation (ASF) focuses on improving Hadoop by bringing many smaller sub-projects under it to facilitate open source tools development around Hadoop. Hive happens to be one of Hadoop’s more prominent child projects.
Hive is a data warehouse infrastructure, initially developed by Facebook. Hadoop with Hive combination gives us advantages of Distributed File System, Map-Reduce and SQL. As we know, to process huge amounts of data in Hadoop for each and every process/operation, we have to write new Map-Reduce program (job). For users with limited number of operations or sequences of same operation, this task will be an easy one. But for those whose requirements are a bit more prone to change, the challenge is they have to write new Map-Reduce program for every new requirement. Unfortunately, this is the only way to deal with unstructured data.
But for structured data, like logging (log4j) files, relational type data, and other similar, more predictable sets, the data can be stored in table-like structures. This is the area where Apache Hive really shines. Hive is a layer running on top of Hadoop that helps process the data in Hadoop by using SQL-like queries written in Hive Query Language (HQL). While loading data in HDFS through Hive as table, it also stores metadata of input, which describes the structure of input data. Note that Hive is required to be installed on the Hadoop master node. Hive converts an input query into a Map-Reduce job and submits it to Hadoop, making it easy for users to analyze and process data.

Hive Prerequisites

  • Hadoop 0.20 and above
  • Java 1.6 and above
  • MySQL or Derby lightweight database in master node to store only Hive metadata
Hadoop with Hive Diagram

Advantages of Hive

  • Supports rich data types like List, Map, and Struct, apart from basic data types.
  • Provides Web UI and Command Line Interface UI that are incorporated for querying data. This provides helpful tools for developers and learners for testing and debuging their queries.
  • Thrift server that comes with Hive helps with JDBC and ODBC connections, so any application can interact with Hive to Hadoop as a backend database. Thrift takes care of language conversion, which allows ANY type of language program to interact with Hadoop.
  • Even for complex structured input data, we can write our own DeSer (serializers and deserializers) programs for parsing input data, storing their table structure in metadata repository, and loading data on Hadoop File System (HDFS).
  • Supports queries with SQL filters, Joins, Group By, Order By, Inner Table, Functions, and other SQL-like operators. Using HQL we can also redirect query output to a new table. Along with all SQL features, we can also attach our own functions and Map-Reduce programs as the part of HQL query.
  • Partition and Bucket: partitioning helps split data into different chunks based on input value range, which allows to skip unwanted data while executing queries. Bucket split data is based on a hash function. Both help to improve the performance of querying.
  • Optimizers are being developed by Apache for Hive for better performance. We can improve our Hadoop and Hive performance by tuning few configuration parameters based on our application requirements. To learn more, read my recent article on Hadoop and Hive Performance Tuning.
  • Hive is used by major companies like Facebook, Yahoo, and Amazon. Hadoop and Hive play a major role in the proliferation of Cloud Computing. Amazon provides S3 (Simple Storage Service) and Elastic MapReduce as a service in cloud environment, which is a Cloud server pre-installed with Hadoop and Hive. It allows us to load our data in Hadoop (Elastic MapReduce) and execute queries on it with the help of Hive. Amazon Elastic MapReduce is a successful product which uses Hadoop and Hive jointly. Click here to learn more about how this technology works.
With more and more Hadoop distributions appearing in the “wild”, it’s clear that this project isn’t going anywhere anytime soon. If anything, it will only gain momentum as more and more companies switch to Hadoop to handle their large data repositories. Hive is a relatively mature Hadoop sub-project companion that facilitates easy data analysis, ad-hoc queries, and manipulation of large datasets stored in Hadoop. These two are a “Match Made in Heaven”!
Above suggestions are observed with Hadoop cluster with Hive querying, please leave a comment and recommend this post by clicking  Facebook ‘Like’ button and ‘+1’ at bottom of this page.

1 comment: