Let's see how this works by performing an import that is limited by a conditional expression.
$ hadoop fs -rmr employees
You will receive the following response:
Deleted hdfs://head:9000/user/hadoop/employees
sqoop import --connect jdbc:mysql://10.0.0.100/hadooptest --username hadoopuser -P --table employees --columns first_name,salary --where "salary > 45000" --hive-import --hive-table salary
You will receive the following response:
12/05/23 15:02:03 INFO hive.HiveImport: Hive import complete.
$ hive -e "describe salary"
You will receive the following response:
OK first_name string salary int Time taken: 2.57 seconds
$ hive -e "select * from salary"
You will see the following output:
OK Alice 50000 David 75000 Time taken: 2.754 seconds
This time, our Sqoop command first added the --columns
option that specifies which columns to include in the import. This is a comma-separated list.
We also used the --where
option that allows the free text specification of a WHERE
clause that is applied to the SQL used to extract data from the database.
The combination of these options is that our Sqoop command should import only the names and salaries of those with a salary greater than the threshold specified in the WHERE
clause.
We execute the command, see it complete successfully, and then examine the table created in Hive. We see that it indeed only contains the specified columns, and we then display the table contents to verify that the where predicate was also applied correctly.
In Chapter 8, A Relational View on Data with Hive, we mentioned that Hive does not support all the common SQL datatypes. The DATE
and DATETIME
types in particular are not currently implemented though they do exist as identified Hive issues; so hopefully, they will be added in the future. We saw this impact our first Hive import earlier in this chapter. Though the start_date
column was of type DATE
in MySQL, the Sqoop import flagged a conversion warning, and the resultant column in Hive was of type STRING
.
Sqoop has an option that is of use here, that is, we can use --map-column-hive
to explicitly tell Sqoop how to create the column in the generated Hive table.
18.219.228.88