Time for action – a more selective import

Let's see how this works by performing an import that is limited by a conditional expression.

  1. Delete any existing employee import directory:
    $ hadoop fs -rmr employees
    

    You will receive the following response:

    Deleted hdfs://head:9000/user/hadoop/employees
    
  2. Import selected columns with a predicate:
    sqoop import --connect jdbc:mysql://10.0.0.100/hadooptest 
    --username hadoopuser -P
    --table employees --columns first_name,salary
     --where "salary > 45000" 
    --hive-import --hive-table salary 
    

    You will receive the following response:

    12/05/23 15:02:03 INFO hive.HiveImport: Hive import complete.
    
  3. Examine the created table:
    $ hive -e "describe salary"
    

    You will receive the following response:

    OK
    first_name  string  
    salary  int  
    Time taken: 2.57 seconds
    
  4. Examine the imported data:
    $ hive -e "select * from salary"
    

    You will see the following output:

    OK
    Alice  50000
    David  75000
    Time taken: 2.754 seconds
    

What just happened?

This time, our Sqoop command first added the --columns option that specifies which columns to include in the import. This is a comma-separated list.

We also used the --where option that allows the free text specification of a WHERE clause that is applied to the SQL used to extract data from the database.

The combination of these options is that our Sqoop command should import only the names and salaries of those with a salary greater than the threshold specified in the WHERE clause.

We execute the command, see it complete successfully, and then examine the table created in Hive. We see that it indeed only contains the specified columns, and we then display the table contents to verify that the where predicate was also applied correctly.

Datatype issues

In Chapter 8, A Relational View on Data with Hive, we mentioned that Hive does not support all the common SQL datatypes. The DATE and DATETIME types in particular are not currently implemented though they do exist as identified Hive issues; so hopefully, they will be added in the future. We saw this impact our first Hive import earlier in this chapter. Though the start_date column was of type DATE in MySQL, the Sqoop import flagged a conversion warning, and the resultant column in Hive was of type STRING.

Sqoop has an option that is of use here, that is, we can use --map-column-hive to explicitly tell Sqoop how to create the column in the generated Hive table.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.228.88