Sqoop 2

Sqoop 2 contains all the capabilities as described above in the context of Sqoop 1, as well as many of the new capabilities which do not exist in Sqoop 1. These include Sqoop-shell, metadata support for multiple databases as repositories, as well as concept of links and remotability of Sqoop jobs.

Sqoop 2 runs from within an embedded container and depends on Hadoop environment variables to locate all dependencies.

Here, we are referring to v 5 of Sqoop 2 primarily because it worked slightly better than the latest versions of Sqoop 2 at the time of authoring this book.

In order to install Sqoop 2, the following steps are required to be performed:

  1. Download the Sqoop 1.99.5 (yes, this officially called as Sqoop 2) binary distribution for Hadoop 2 using the following command:
wget http://archive.apache.org/dist/sqoop/1.99.5/sqoop-1.99.5-bin-hadoop200.tar.gz
  1. Once the download is complete, extract the contents of the tarball into a user directory with the following command. Let us refer to the extracted Sqoop folder as ${SQOOP2_HOME} and update the ~/.bashrc file accordingly.
tar -zxvf ${DOWNLOAD_DIR}/sqoop-1.99.5-bin-hadoop200.tar.gz
  1. Configure the file located at ${SQOOP2_HOME}/server/conf/catalina.properties to contain all the absolute classpath directories of our Hadoop install. A sample of such a configuration in our case is given below for reference (/data1/home/centos/hadoop-2.7.3 below refers to ${HADOOP_HOME} and makes sure absolute path is given, as it doesn't resolve environment variables correctly):
common.loader=${catalina.base}/lib,${catalina.base}/lib/*.jar,${catalina.home}/lib,${catalina.home}/lib/*.jar,${catalina.home}/../lib/*.jar,/data1/home/centos/hadoop-2.7.3/share/hadoop/common/lib/*.jar,/data1/home/centos/hadoop-2.7.3/share/hadoop/common/*.jar,/data1/home/centos/hadoop-2.7.3/share/hadoop/hdfs/*.jar,/data1/home/centos/hadoop-2.7.3/share/hadoop/hdfs/lib/*.jar,/data1/home/centos/hadoop-2.7.3/share/hadoop/mapreduce/*.jar,/data1/home/centos/hadoop-2.7.3/share/hadoop/mapreduce/lib/*.jar,/data1/home/centos/hadoop-2.7.3/share/hadoop/yarn/*.jar,/data1/home/centos/hadoop-2.7.3/share/hadoop/yarn/lib/*.jar
  1. Configure ${SQOOP2_HOME}/server/conf/sqoop.properties to point to the Hadoop configurations as shown:
org.apache.sqoop.submission.engine.mapreduce.configuration.directory=/data1/home/centos/hadoop-2.7.3/etc/hadoop
  1. Set the permissions to the Sqoop 2 scripts with the following command:
chmod +x ${SQOOP2_HOME}/bin/*
  1. Verify the Sqoop 2 configuration with the following command (if there is an error Caused by: java.sql.SQLNonTransientConnectionException: No current connection, ignore it and the configuration is all good):
${SQOOP2_HOME}/bin/sqoop2-tool verify
  1. Launch the sqoop 2 server with the following command:
${SQOOP2_HOME}/bin/sqoop2-server start

Hue comes with native integration with Sqoop 2, and at the time of authoring this book, there were incompatibilities found between Hue and Sqoop 2. These incompatibilities have been raised and are currently bugs registered in Hue JIRA. Once such link is provided for reference: https://issues.cloudera.org/browse/HUE-5128.

However, we observed that once we have Sqoop 2 set up and working, we could use Sqoop 2 shell and view the configured jobs and connectors in Hue. Some of the screenshots have been provided as an indicative reference. Once we have the issues fixed in Hue for Sqoop 2, we will see that this would be a very strong capability for actively managing Sqoop 2 via Hue.

In the images below, we see a few configurations for resource links that can be configured via Hue for Sqoop 2 (in the Data Browser menu item, click on Sqoop Transfer). Here, we are configuring a database link and an HDFS link.

Figure 20: RDBMS to HDFS via Sqoop 2 and Hue

Clicking on next navigates Hue to the next screen which captures the database connection details to finally save the database link to the database repository.

Figure 21: RDBMS Link

Similarly, an HDFS link can also be configured with hdfs link name and HDFS URI as shown:

Figure 22: HDFS Link

Like Sqoop 1, Sqoop 2 also needs a metadata repository. By default, Sqoop 2 uses an embedded Apache Derby database (https://db.apache.org/derby/); however, external databases can also be configured via sqoop.properties file. A sample of this file is provided in the following code block highlighting the Sqoop 2 repository configuration:

# External connectors load path
# "/path/to/external/connectors/": Add all the connector JARs in the specified folder
#
org.apache.sqoop.connector.external.loadpath=
org.apache.sqoop.repository.jdbc.handler=org.apache.sqoop.repository.postgresql.PostgresqlRepositoryHandler
org.apache.sqoop.repository.jdbc.maximum.connections=4
org.apache.sqoop.repository.jdbc.url=jdbc:postgresql://192.168.43.28/sqoop?schema=SQOOP
org.apache.sqoop.repository.jdbc.driver=org.postgresql.Driver
org.apache.sqoop.repository.jdbc.user=sqoop
org.apache.sqoop.repository.jdbc.password=sqoop
org.apache.sqoop.repository.jdbc.transaction.isolation=READ_COMMITTED

As shown in the code snippet above, we would need to create a user sqoop for the above configuration to work. Cloudera provides a reference page on how to set up this user at the following location: https://goo.gl/F6iJsb.

We have also referred (from Cloudera) to those setup instructions below, with very minor modifications with respect to the specific PostgreSQL version we are working with in this book. Full setup and detailing of Sqoop 2 is outside the scope of this book:

$ psql -U postgres
Password for user postgres: *****
postgres=# CREATE ROLE sqoop LOGIN ENCRYPTED PASSWORD 'sqoop'
NOSUPERUSER INHERIT CREATEDB NOCREATEROLE;
CREATE ROLE

postgres=# CREATE DATABASE "sqoop" WITH OWNER = sqoop

ENCODING = 'UTF8'
TABLESPACE = pg_default
LC_COLLATE = 'en_US.UTF-8'
LC_CTYPE = 'en_US.UTF-8'
CONNECTION LIMIT = -1;
CREATE DATABASE

postgres=# q
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.136.18.218