Joins are a very frequently used tool in SQL, though sometimes appear a little intimidating to those new to the language. Essentially a join allows rows in multiple tables to be logically combined together based on a conditional statement. Hive has rich support for joins which we will now examine.
join.hql
:SELECT t1.sighted, t2.full_name FROM ufodata t1 JOIN states t2 ON (LOWER(t2.abbreviation) = LOWER(SUBSTR( t1.sighting_location, (LENGTH(t1.sighting_location)-1)))) LIMIT 5 ;
$ hive -f join.hql
You will receive the following response:
OK 20060930 Alaska 20051018 Alaska 20050707 Alaska 20100112 Alaska 20100625 Alaska Time taken: 33.255 seconds
The actual join
query is relatively straightforward; we want to extract the sighted date and location for a series of records but instead of the raw location field, we wish to map this into the full state name. The HiveQL file we created performs such a query. The join itself is specified by the standard JOIN
keyword and the matching condition is contained in the ON
clause.
Things are complicated by a restriction of Hive in that it only supports equijoins, that is, those where the ON clause contains an equality check. It is not possible to specify a join condition using operators such as >
, ?
, <
, or as we would have preferred to use here, the LIKE
keyword.
Instead, therefore, we have an opportunity to introduce several of Hive's built-in functions, in particular, those to convert a string to lowercase (LOWER
), to extract a substring from a string (SUBSTR
) and to return the number of characters in a string (LENGTH
).
We know that most location entries are of the form "city, state_abbreviation." So we use SUBSTR
to extract the third and second from last characters in the string, using length
to calculate the indices. We convert both the state abbreviation and extracted string to lower case via LOWER
because we cannot assume that all entries in the sighting table will correctly use uniform capitalization.
After executing the script, we get the expected sample lines of output that indeed include the sighting date and full state name instead of the abbreviation.
Note the use of the LIMIT
clause that simply constrains how many output rows will be returned from the query. This is also an indication that HiveQL is most similar to SQL dialects such as those found in open source databases such as MySQL.
This example shows an inner join; Hive also supports left and right outer joins as well as left semi joins. There are a number of subtleties around the use of joins in Hive (such as the aforementioned equijoin restriction) and you should really read through the documentation on the Hive homepage if you are likely to use joins, especially when using very large tables.
As well as the string functions we used previously, Hive also has functions such as RLIKE
and REGEXP_EXTRACT
that provide direct support for Java-like regular expression manipulation. Rewrite the preceding join specification using regular expressions to make a more accurate and elegant join statement.
18.191.189.23