Wednesday, January 2, 2013

Include values during execution time in hive QL/ Dynamically substitute values in hive


When you play around with data warehousing it is very common to come across scenarios where you’d like to submit values at run time. In production environments when we have to enable a hive job we usually write our series of hive operations in HQL on a file and trigger it using the hive –f option from a shell script or some workflow management systems like oozie. Let’s have this discussion limited to triggering the hive job from shell as it is the basic one.
      Say I’m having a hive job called hive_job.hql, normally from a shell I’d trigger the hive job as
hive -f hive_job.hql

If I need to set some hive config parameters, say I need to enable compression in hive then I’d include the following arguments along as
hive -f hive_job.hql  -hiveconf hive.exec.compress.output=true -hiveconf mapred.output.compress=true -hiveconf mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec

Now the final one, say my hive QL is doing some operations on date range and this date is varying/dynamic. This date is to be accepted from CLI each time. We can achieve the same with the following steps
1.       Pass the variables as config parameters
2.       Refer these config parameters in your hive query

                Let me make it more specific with a small example. I need to perform some operation on records in a table called ‘test_table’ which has a column/field named ‘creation_date’ .(ie I need to filter records based on creation_date). The date range for the operation is supplied at run time. It is achieved as
1.       Pass the variables as config parameters
                Here we need to pass two parameters, the start date and end date to get all the records within a specific date range
hive -f hive_job.hql  -hiveconf start_date=2011-01-01 –hiveconf end _date=2011-12-31
2.       Refer these config parameters in your hive query
                In our hive QL the start date and end date are to be decorated with place holders to be replaced by actual values during execution time.
SELECT * FROM test_table t1 WHERE t1.created_date=’ ${hiveconf: start_date }’ AND t2.created_date=’ ${hiveconf: end _date}’


Let us look into one more example, decide on the number of reducers during run time with the previous set of requirements along with compression. I have 2 components here 

·         Shell script that triggers .hql
Hive_trigger.sh
# The script accepts 3 parameters in order- number of reducers, start date and end date.
NUM_REDUCERS=$1
BEGIN_DATE=$2
CLOSE_DATE=$3
hive -f hive_job.hql  -hiveconf mapred.reduce.tasks= NUM_REDUCERS  -hiveconf start_date= BEGIN_DATE –hiveconf end _date= CLOSE_DATE

·         HQL that accepts parameters at run time
hive_job.hql
SELECT * FROM test_table t1 WHERE t1.created_date=’ ${hiveconf: start_date }’ AND t2.created_date=’ ${hiveconf: end _date}’

The components are ready you can now trigger you shell script. For example as
./hive_trigger.sh 50 2011-01-01 2011-12-31

No comments:

Post a Comment