Time for action – importing data from a raw query

Let's see an example of an import where a raw SQL statement is used to select the data to be imported.

  1. Delete any existing output directory:
    $ hadoop fs –rmr employees
    
  2. Drop any existing Hive employee table:
    $ hive -e 'drop table employees'
    
  3. Import data using an explicit query:
    sqoop import --connect jdbc:mysql://10.0.0.100/hadooptest 
    --username hadoopuser -P
    --target-dir employees  
    --query 'select first_name, dept, salary, 
    timestamp(start_date) as start_date from employees where $CONDITIONS' 
    --hive-import --hive-table employees 
    --map-column-hive start_date=timestamp -m 1
    
  4. Examine the created table:
    $ hive -e "describe employees"
    

    You will receive the following response:

    OK
    first_name  string  
    dept  string  
    salary  int  
    start_date  timestamp  
    Time taken: 2.591 seconds
    
  5. Examine the data:
    $ hive -e "select * from employees"
    

    You will receive the following response:

    OK
    Alice  Engineering  50000  2009-03-12 00:00:00
    BobSales  35000  2011-10-01 00:00:00
    Camille  Marketing  40000  2003-04-20 00:00:00
    David  Executive  75000  2001-03-20 00:00:00
    Erica  Support  34000  2011-07-07 00:00:00
    Time taken: 2.709 seconds
    

What just happened?

To achieve our goal, we used a very different form of the Sqoop import. Instead of specifying the desired table and then either letting Sqoop import all columns or a specified subset, here we use the --query option to define an explicit SQL statement.

In the statement, we select all the columns from the source table but apply the timestamp() function to convert the start_date column to the correct type. (Note that this function simply adds a 00:00 time element to the date). We alias the result of this function, which allows us to name it in the type mapping option.

Because we have no --table option, we have to add --target-dir to tell Sqoop the name of the directory it should create on HDFS.

The WHERE clause in the SQL is required by Sqoop even though we are not actually using it. Having no --table option does not just remove Sqoop's ability to auto-generate the name of the export directory, it also means that Sqoop does not know from where data is being retrieved, and hence, how to partition the data across multiple mappers. The $CONDITIONS variable is used in conjunction with a --where option; specifying the latter provides Sqoop with the information it needs to partition the table appropriately.

We take a different route here and instead explicitly set the number of mappers to 1, which obviates the need for an explicit partitioning clause.

After executing Sqoop, we examine the table definition in Hive, which as before, has the correct datatypes for all columns. We then look at the data, and this is now successful, with the start_date column data being appropriately converted into the TIMESTAMP values.

Note

When we mentioned in the Sqoop and HDFS section that Sqoop provided mechanisms to restrict the data extracted from the database, we were referring to the query, where, and columns options. Note that these can be used by any Sqoop import regardless of the destination.

Have a go hero

Though it truly is not needed for such a small data set, the $CONDITIONS variable is an important tool. Modify the preceding Sqoop statement to use multiple mappers with an explicit partitioning statement.

Sqoop and Hive partitions

In Chapter 8, A Relational View on Data with Hive, we talked a lot about Hive partitions and highlighted how important they are in allowing query optimization for very large tables. The good news is that Sqoop can support Hive partitions; the bad news is that the support is not complete.

To import data from a relational database into a partitioned Hive table, we use the --hive-partition-key option to specify the partition column and the --hive-partition-value option to specify the value for the partition into which this Sqoop command will import data.

This is excellent but does require each Sqoop statement to be imported into a single Hive partition; there is currently no support for Hive auto-partitioning. Instead, if a data set is to be imported into multiple partitions in a table, we need use a separate Sqoop statement for insertion into each partition.

Field and line terminators

Until now, we have been implicitly relying on some defaults but should discuss them at this point. Our original text file was tab separated, but you may have noticed that the data we exported onto HDFS was comma-separated. If you go look in the files under /user/hive/warehouse/employees (remember this is the default location on HDFS where Hive keeps its source files), the records use ASCII code 001 as the separator. What is going on?

In the first instance, we let Sqoop use its defaults, which in this case, means using a comma to separate fields and using for records. However, when Sqoop is importing into Hive, it instead employs the Hive defaults, which include using the 001 code (^A) to separate fields.

We can explicitly set separators using the following Sqoop options:

  • fields-terminated-by: This is the separator between fields
  • lines-terminated-by: The line terminator
  • escaped-by: Used to escape characters (for example, )
  • enclosed-by: The character enclosing fields (for example, ")
  • optionally-enclosed-by: Similar to the preceding option but not mandatory
  • mysql-delimiters: A shortcut to use the MySQL defaults

This may look a little intimidating, but it's not as obscure as the terminology may suggest, and the concepts and syntax should be familiar to those with SQL experience. The first few options are pretty self-explanatory; where it gets less clear is when talking of enclosing and optionally enclosing characters.

This is really about (usually free-form) data where a given field may include characters that have special meanings. For example, a string column in a comma-separated file that includes commas. In such a case, we could enclose the string columns within quotes to allow the commas within the field. If all fields need such enclosing characters, we would use the first form; if it was only required for a subset of the fields, it could be specified as optional.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.134.17