Time for action – importing data from Hadoop into MySQL

Let's demonstrate this by importing data into a MySQL table from an HDFS file.

  1. Create a tab-separated file named newemployees.tsv with the following entries:
    Frances  Operations  34000  2012-03-01
    Greg  Engineering  60000  2003-11-18
    Harry  Intern  22000  2012-05-15
    Iris  Executive  80000  2001-04-08
    Jan  Support  28500  2009-03-30
  2. Create a new directory on HDFS and copy the file into it:
    $hadoop fs -mkdir edata
    $ hadoop fs -put newemployees.tsv edata/newemployees.tsv
    
  3. Confirm the current number of records in the employee table:
    $ echo "select count(*) from employees" | 
    mysql –u hadoopuser –p hadooptest
    

    You will receive the following response:

    Enter password: 
    count(*)
    5
    
  4. Run a Sqoop export:
    $ sqoop export --connect jdbc:mysql://10.0.0.100/hadooptest 
    --username hadoopuser  -P --table employees 
    --export-dir edata --input-fields-terminated-by '	'
    

    You will receive the following response:

    12/05/27 07:52:22 INFO mapreduce.ExportJobBase: Exported 5 records.
    
  5. Check the number of records in the table after the export:
    Echo "select count(*) from employees" 
    | mysql -u hadoopuser -p hadooptest
    

    You will receive the following response:

    Enter password: 
    count(*)
    10
    
  6. Check the data:
    $ echo "select * from employees" 
    | mysql -u hadoopuser -p hadooptest
    

    You will receive the following response:

    Enter password: 
    first_name  dept  salary  start_date
    Alice  Engineering  50000  2009-03-12
    
    Frances  Operations  34000  2012-03-01
    Greg  Engineering  60000  2003-11-18
    Harry  Intern  22000  2012-05-15
    Iris  Executive  80000  2001-04-08
    Jan  Support  28500  2009-03-30
    

What just happened?

We first created a data file containing information on five more employees. We created a directory for our data on HDFS into which we copied the new file.

Before running the export, we confirmed that the table in MySQL contained the original five employees only.

The Sqoop command has a similar structure as before with the biggest change being the use of the export command. As the name suggests, Sqoop exported export data from Hadoop into a relational database.

We used several similar options as before, mainly to specify the database connection, the username and password needed to connect, and the table into which to insert the data.

Because we are exporting data from HDFS, we needed to specify the location containing any files to be exported which we do via the --export-dir option. All files contained within the directory will be exported; they do not need be in a single file; Sqoop will include all files within its MapReduce job. By default, Sqoop uses four mappers; if you have a large number of files it may be more effective to increase this number; do test, though, to ensure that load on the database remains under control.

The final option passed to Sqoop specified the field terminator used in the source files, in this case, the tab character. It is your responsibility to ensure the data files are properly formatted; Sqoop will assume there is the same number of elements in each record as columns in the table (though null is acceptable), separated by the specified field separator character.

After watching the Sqoop command complete successfully, we saw it reports that it exported five records. We check, using the mysql tool, the number of rows now in the database and then view the data to confirm that our old friends are now joined by the new employees.

Differences between Sqoop imports and exports

Though similar conceptually and in the command-line invocations, there are a number of important differences between Sqoop imports and exports that are worth exploring.

Firstly, Sqoop imports can assume much more about the data being processed; through either explicitly named tables or added predicates, there is much information about both the structure and type of the data. Sqoop exports, however, are given only a location of source files and the characters used to separate and enclose fields and records. While Sqoop imports into Hive can automatically create a new table based on the provided table name and structure, a Sqoop export must be into an existing table in the relational database.

Even though our earlier demonstration with dates and timestamps showed there are some sharp edges, Sqoop imports are also able to determine whether the source data complies with the defined column types; the data would not have been possible to insert into the database otherwise. Sqoop exports again only have access effectively to fields of characters with no understanding of the real datatype. If you have the luxury of very clean and well-formatted data, this may never matter, but for the rest of us, there will be a need to consider data exports and type conversions, particularly in terms of null and default values. The Sqoop documentation goes into these options in some detail and is worth a read.

Inserts versus updates

Our preceding example was very straightforward; we added an entire new set of data that can happily coexist with the existing contents of the table. Sqoop exports by default do a series of appends, adding each record as a new row in the table.

However, what if we later want to update data when, for example, our employees get increased salaries at the end of the year? With the database table defining first_name as a primary key, any attempt to insert a new row with the same name as an existing employee will fail with a failed primary key constraint.

In such cases, we can set the Sqoop --update-key option to specify the primary key, and Sqoop will generate UPDATE statements based on this key (it can be a comma-separated list of keys), as opposed to INSERT statements adding new rows.

Note

In this mode, any record that does not match an existing key value will silently be ignored, and Sqoop will not flag errors if a statement updates more than one row.

If we also want the option of an update that adds new rows for non-existing data, we can set the --update-mode option to allowinsert.

Have a go hero

Create another data file that contains three new employees as well as updated salaries for two of the existing employees. Use Sqoop in import mode to both add the new employees as well as apply the needed updates.

Sqoop and Hive exports

Given the preceding example, it may not be surprising to learn that Sqoop does not currently have any direct support to export a Hive table into a relational database. More precisely, there are no explicit equivalents to the --hive-import option we used earlier.

However, in some cases, we can work around this. If a Hive table is storing its data in text format, we could point Sqoop at the location of the table data files on HDFS. In case of tables referring to external data, this may be straightforward, but once we start seeing Hive tables with complex partitioning, the directory structure becomes more involved.

Hive can also store tables as binary SequenceFiles, and a current limitation is that Sqoop cannot transparently export from tables stored in this format.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.163.238