Time for action – performing a join

Joins are a very frequently used tool in SQL, though sometimes appear a little intimidating to those new to the language. Essentially a join allows rows in multiple tables to be logically combined together based on a conditional statement. Hive has rich support for joins which we will now examine.

  1. Create the following as join.hql:
    SELECT t1.sighted, t2.full_name
    FROM ufodata t1 JOIN states t2
    ON (LOWER(t2.abbreviation) = LOWER(SUBSTR( t1.sighting_location, (LENGTH(t1.sighting_location)-1)))) 
    LIMIT 5 ;
  2. Execute the query:
    $ hive -f join.hql
    

    You will receive the following response:

    OK
    20060930  Alaska
    20051018  Alaska
    20050707  Alaska
    20100112  Alaska
    20100625  Alaska
    Time taken: 33.255 seconds
    

What just happened?

The actual join query is relatively straightforward; we want to extract the sighted date and location for a series of records but instead of the raw location field, we wish to map this into the full state name. The HiveQL file we created performs such a query. The join itself is specified by the standard JOIN keyword and the matching condition is contained in the ON clause.

Things are complicated by a restriction of Hive in that it only supports equijoins, that is, those where the ON clause contains an equality check. It is not possible to specify a join condition using operators such as >, ?, <, or as we would have preferred to use here, the LIKE keyword.

Instead, therefore, we have an opportunity to introduce several of Hive's built-in functions, in particular, those to convert a string to lowercase (LOWER), to extract a substring from a string (SUBSTR) and to return the number of characters in a string (LENGTH).

We know that most location entries are of the form "city, state_abbreviation." So we use SUBSTR to extract the third and second from last characters in the string, using length to calculate the indices. We convert both the state abbreviation and extracted string to lower case via LOWER because we cannot assume that all entries in the sighting table will correctly use uniform capitalization.

After executing the script, we get the expected sample lines of output that indeed include the sighting date and full state name instead of the abbreviation.

Note the use of the LIMIT clause that simply constrains how many output rows will be returned from the query. This is also an indication that HiveQL is most similar to SQL dialects such as those found in open source databases such as MySQL.

This example shows an inner join; Hive also supports left and right outer joins as well as left semi joins. There are a number of subtleties around the use of joins in Hive (such as the aforementioned equijoin restriction) and you should really read through the documentation on the Hive homepage if you are likely to use joins, especially when using very large tables.

Note

This is not a criticism of Hive alone; joins are incredibly powerful tools but it is probably fair to say that badly written joins or those created in ignorance of critical constraints have brought more relational databases to a grinding halt than any other type of SQL query.

Have a go hero – improve the join to use regular expressions

As well as the string functions we used previously, Hive also has functions such as RLIKE and REGEXP_EXTRACT that provide direct support for Java-like regular expression manipulation. Rewrite the preceding join specification using regular expressions to make a more accurate and elegant join statement.

Hive and SQL views

Another powerful SQL feature supported by Hive is views. These are useful when instead of a static table the contents of a logical table are specified by a SELECT statement and subsequent queries can then be executed against this dynamic view (hence the name) of the underlying data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.7.131