Let's demonstrate this by importing data into a MySQL table from an HDFS file.
newemployees.tsv
with the following entries: Frances Operations 34000 2012-03-01 Greg Engineering 60000 2003-11-18 Harry Intern 22000 2012-05-15 Iris Executive 80000 2001-04-08 Jan Support 28500 2009-03-30
$hadoop fs -mkdir edata $ hadoop fs -put newemployees.tsv edata/newemployees.tsv
$ echo "select count(*) from employees" | mysql –u hadoopuser –p hadooptest
You will receive the following response:
Enter password: count(*) 5
$ sqoop export --connect jdbc:mysql://10.0.0.100/hadooptest --username hadoopuser -P --table employees --export-dir edata --input-fields-terminated-by ' '
You will receive the following response:
12/05/27 07:52:22 INFO mapreduce.ExportJobBase: Exported 5 records.
Echo "select count(*) from employees" | mysql -u hadoopuser -p hadooptest
You will receive the following response:
Enter password: count(*) 10
$ echo "select * from employees" | mysql -u hadoopuser -p hadooptest
You will receive the following response:
Enter password: first_name dept salary start_date Alice Engineering 50000 2009-03-12 … Frances Operations 34000 2012-03-01 Greg Engineering 60000 2003-11-18 Harry Intern 22000 2012-05-15 Iris Executive 80000 2001-04-08 Jan Support 28500 2009-03-30
We first created a data file containing information on five more employees. We created a directory for our data on HDFS into which we copied the new file.
Before running the export, we confirmed that the table in MySQL contained the original five employees only.
The Sqoop command has a similar structure as before with the biggest change being the use of the export
command. As the name suggests, Sqoop exported export data from Hadoop into a relational database.
We used several similar options as before, mainly to specify the database connection, the username and password needed to connect, and the table into which to insert the data.
Because we are exporting data from HDFS, we needed to specify the location containing any files to be exported which we do via the --export-dir
option. All files contained within the directory will be exported; they do not need be in a single file; Sqoop will include all files within its MapReduce job. By default, Sqoop uses four mappers; if you have a large number of files it may be more effective to increase this number; do test, though, to ensure that load on the database remains under control.
The final option passed to Sqoop specified the field terminator used in the source files, in this case, the tab character. It is your responsibility to ensure the data files are properly formatted; Sqoop will assume there is the same number of elements in each record as columns in the table (though null is acceptable), separated by the specified field separator character.
After watching the Sqoop command complete successfully, we saw it reports that it exported five records. We check, using the mysql
tool, the number of rows now in the database and then view the data to confirm that our old friends are now joined by the new employees.
Though similar conceptually and in the command-line invocations, there are a number of important differences between Sqoop imports and exports that are worth exploring.
Firstly, Sqoop imports can assume much more about the data being processed; through either explicitly named tables or added predicates, there is much information about both the structure and type of the data. Sqoop exports, however, are given only a location of source files and the characters used to separate and enclose fields and records. While Sqoop imports into Hive can automatically create a new table based on the provided table name and structure, a Sqoop export must be into an existing table in the relational database.
Even though our earlier demonstration with dates and timestamps showed there are some sharp edges, Sqoop imports are also able to determine whether the source data complies with the defined column types; the data would not have been possible to insert into the database otherwise. Sqoop exports again only have access effectively to fields of characters with no understanding of the real datatype. If you have the luxury of very clean and well-formatted data, this may never matter, but for the rest of us, there will be a need to consider data exports and type conversions, particularly in terms of null and default values. The Sqoop documentation goes into these options in some detail and is worth a read.
Our preceding example was very straightforward; we added an entire new set of data that can happily coexist with the existing contents of the table. Sqoop exports by default do a series of appends, adding each record as a new row in the table.
However, what if we later want to update data when, for example, our employees get increased salaries at the end of the year? With the database table defining first_name
as a primary key, any attempt to insert a new row with the same name as an existing employee will fail with a failed primary key constraint.
In such cases, we can set the Sqoop --update-key
option to specify the primary key, and Sqoop will generate UPDATE
statements based on this key (it can be a comma-separated list of keys), as opposed to INSERT
statements adding new rows.
If we also want the option of an update that adds new rows for non-existing data, we can set the --update-mode
option to allowinsert
.
Create another data file that contains three new employees as well as updated salaries for two of the existing employees. Use Sqoop in import mode to both add the new employees as well as apply the needed updates.
Given the
preceding example, it may not be surprising to learn that Sqoop does not currently have any direct support to export a Hive table into a relational database. More precisely, there are no explicit equivalents to the --hive-import
option we used earlier.
However, in some cases, we can work around this. If a Hive table is storing its data in text format, we could point Sqoop at the location of the table data files on HDFS. In case of tables referring to external data, this may be straightforward, but once we start seeing Hive tables with complex partitioning, the directory structure becomes more involved.
Hive can also store tables as binary SequenceFiles, and a current limitation is that Sqoop cannot transparently export from tables stored in this format.
18.119.163.238