Let's see an example of an import where a raw SQL statement is used to select the data to be imported.
$ hadoop fs –rmr employees
$ hive -e 'drop table employees'
sqoop import --connect jdbc:mysql://10.0.0.100/hadooptest --username hadoopuser -P --target-dir employees --query 'select first_name, dept, salary, timestamp(start_date) as start_date from employees where $CONDITIONS' --hive-import --hive-table employees --map-column-hive start_date=timestamp -m 1
$ hive -e "describe employees"
You will receive the following response:
OK first_name string dept string salary int start_date timestamp Time taken: 2.591 seconds
$ hive -e "select * from employees"
You will receive the following response:
OK Alice Engineering 50000 2009-03-12 00:00:00 BobSales 35000 2011-10-01 00:00:00 Camille Marketing 40000 2003-04-20 00:00:00 David Executive 75000 2001-03-20 00:00:00 Erica Support 34000 2011-07-07 00:00:00 Time taken: 2.709 seconds
To achieve
our goal, we used a very different form of the Sqoop import. Instead of specifying the desired table and then either letting Sqoop import all columns or a specified subset, here we use the --query
option to define an explicit SQL statement.
In the statement, we select all the columns from the source table but apply the timestamp()
function to convert the start_date
column to the correct type. (Note that this function simply adds a 00:00
time element to the date). We alias the result of this function, which allows us to name it in the type mapping option.
Because we have no --table
option, we have to add --target-dir
to tell Sqoop the name of the directory it should create on HDFS.
The WHERE
clause in the SQL is required by Sqoop even though we are not actually using it. Having no --table
option does not just remove Sqoop's ability to auto-generate the name of the export directory, it also means that Sqoop does not know from where data is being retrieved, and hence, how to partition the data across multiple mappers. The $CONDITIONS
variable is used in conjunction with a --where
option; specifying the latter provides Sqoop with the information it needs to partition the table appropriately.
We take a different route here and instead explicitly set the number of mappers to 1
, which obviates the need for an explicit partitioning clause.
After executing Sqoop, we examine the table definition in Hive, which as before, has the correct datatypes for all columns. We then look at the data, and this is now successful, with the start_date
column data being appropriately converted into the TIMESTAMP
values.
Though it truly is not needed for such a small data set, the $CONDITIONS
variable is an important tool. Modify the preceding Sqoop statement to use multiple mappers with an explicit partitioning statement.
In Chapter 8, A Relational View on Data with Hive, we talked a lot about Hive partitions and highlighted how important they are in allowing query optimization for very large tables. The good news is that Sqoop can support Hive partitions; the bad news is that the support is not complete.
To import data from a relational database into a partitioned Hive table, we use the --hive-partition-key
option to specify the partition column and the --hive-partition-value
option to specify the value for the partition into which this Sqoop command will import data.
This is excellent but does require each Sqoop statement to be imported into a single Hive partition; there is currently no support for Hive auto-partitioning. Instead, if a data set is to be imported into multiple partitions in a table, we need use a separate Sqoop statement for insertion into each partition.
Until now, we have been implicitly relying on some defaults but should discuss them at this point. Our original text file was tab separated, but you may have noticed that the data we exported onto HDFS was comma-separated. If you go look in the files under /user/hive/warehouse/employees
(remember this is the default location on HDFS where Hive keeps its source files), the records use ASCII code 001 as the separator. What is going on?
In the first instance, we let Sqoop use its defaults, which in this case, means using a comma to separate fields and using
for records. However, when Sqoop is importing into Hive, it instead employs the Hive defaults, which include using the 001 code (^A) to separate fields.
We can explicitly set separators using the following Sqoop options:
fields-terminated-by
: This is the separator between fieldslines-terminated-by
: The line terminatorescaped-by
: Used to escape characters (for example, )enclosed-by
: The character enclosing fields (for example, ")optionally-enclosed-by
: Similar to the preceding option but not mandatorymysql-delimiters
: A shortcut to use the MySQL defaultsThis may look a little intimidating, but it's not as obscure as the terminology may suggest, and the concepts and syntax should be familiar to those with SQL experience. The first few options are pretty self-explanatory; where it gets less clear is when talking of enclosing and optionally enclosing characters.
This is really about (usually free-form) data where a given field may include characters that have special meanings. For example, a string column in a comma-separated file that includes commas. In such a case, we could enclose the string columns within quotes to allow the commas within the field. If all fields need such enclosing characters, we would use the first form; if it was only required for a subset of the fields, it could be specified as optional.
3.133.134.17