Friday, January 4, 2013

Implementing basic SQL Update statement in Hive


Hive is not meant for point to point queries and hence sql update functionality would be least required in hive that should be the reason hive doesn’t have update functionality for rows or rather individual columns in a row. There would be cases you find a much more suitable use case in hive, but the same can’t be implemented as it includes an update statement on a few rows specified by a condition. We can implement the basic sql row update in hive by following a few series of steps as follows

1.       Load all the rows that has to be updated into a hdfs dir
a.       Here in the select statement that fetches the rows that satisfy the condition, we specify the column names one by one rather than a Select *
b.      In place of the colum whose value is to be updated we include the new value hard coded
2.       Remove these rows loaded into hdfs dir in step 1 from table
a.       Do an insert overwrite on the same table to include only those record that do not satisfies the condition
3.       Load the updated rows from hdfs dir in Step into into table

Let’s look at the same through an example
                We have a hive table employee with the following columns employeeId, employeeName, experienceMonths, salary and visaEligibility. We need to do an operation equivalent to the SQL query below in hive.

SQL Query
update table employee set visaEligibility = 'YES' where experienceMonths >36 ;

Equivalent Hive QL
#load the rows that fall in the condition into an hdfs dir with the corresponding column value modified
INSERT OVERWRITE DIRECTORY '/userdata/bejoy/employeetemp' SELECT employeeId,employeeName, experienceMonths ,salary,'YES' FROM employee WHERE experienceMonths > =36;

#removes the records from table that has been loaded into hdfs dir
#use the negate condition to the one provided in previous step
INSERT OVERWRITE TABLE employee SELECT * FROM employee WHEREexperienceMonths < 36;

#load the rows from hdfs dir on to table on top of its current contents
LOAD DATA INPATH '/userdata/bejoy/employeetemp' INTO TABLE employee;

A better Approach
                When I understood some more functionalities of hive, it was evident to me that there is a better way to solve this out using CASE statement. The same result could be obtained in a single hive query as stated below

INSERT OVERWRITE TABLE employee SELECT employeeId,employeeName, experienceMonths ,salary, CASE WHEN experienceMonths >=36 THEN ‘YES’ ELSE visaEligibility END AS visaEligibility FROM employee;

The approaches and solutions described in this post are for non partitioned tables. Ideally in real rime scenarios the data volume would be too large to handle in a single partition so we have to go in for multiple partitions. But the same approaches could be used out there as well.

No comments:

Post a Comment