Time for action – a more selective import

Let's see how this works by performing an import that is limited by a conditional expression.

Delete any existing employee import directory:
```
$ hadoop fs -rmr employees
```
You will receive the following response:
```
Deleted hdfs://head:9000/user/hadoop/employees
```

Import selected columns with a predicate:

sqoop import --connect jdbc:mysql://10.0.0.100/hadooptest 
--username hadoopuser -P
--table employees --columns first_name,salary
 --where "salary > 45000" 
--hive-import --hive-table salary

You will receive the following response:

12/05/23 15:02:03 INFO hive.HiveImport: Hive import complete.

Examine the created table:

$ hive -e "describe salary"

You will receive the following response:

OK
first_name  string  
salary  int  
Time taken: 2.57 seconds

Examine the imported data:

$ hive -e "select * from salary"

You will see the following output:

OK
Alice  50000
David  75000
Time taken: 2.754 seconds

What just happened?

This time, our Sqoop command first added the --columns option that specifies which columns to include in the import. This is a comma-separated list.

We also used the --where option that allows the free text specification of a WHERE clause that is applied to the SQL used to extract data from the database.

The combination of these options is that our Sqoop command should import only the names and salaries of those with a salary greater than the threshold specified in the WHERE clause.

We execute the command, see it complete successfully, and then examine the table created in Hive. We see that it indeed only contains the specified columns, and we then display the table contents to verify that the where predicate was also applied correctly.

Datatype issues

In Chapter 8, A Relational View on Data with Hive, we mentioned that Hive does not support all the common SQL datatypes. The DATE and DATETIME types in particular are not currently implemented though they do exist as identified Hive issues; so hopefully, they will be added in the future. We saw this impact our first Hive import earlier in this chapter. Though the start_date column was of type DATE in MySQL, the Sqoop import flagged a conversion warning, and the resultant column in Hive was of type STRING.

Sqoop has an option that is of use here, that is, we can use --map-column-hive to explicitly tell Sqoop how to create the column in the generated Hive table.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Time for action – a more selective import

Create new playlist

Sign In

Sign Up

Time for action – a more selective import

What just happened?

Datatype issues

Table of Contents for
Time for action – a more selective import