This section covers activity to load data into Impala tables in two steps. First step moves the file, which we have created previously, from the local filesystem to HDFS. In the second step, the data is transferred into Impala tables, from the source file located on HDFS.
Now we will make sure that the preceding files are located in our Linux files system as follows:
[cloudera@localhost ~]$ ls -l *.txt -rw-rw-r-- 1 cloudera cloudera 985 Oct 15 18:48 automobiles.txt -rw-rw-r-- 1 cloudera cloudera 932 Oct 15 18:49 motorcycles.txt
Our next step is to move these files from the local filesystem to HDFS in separate folders. Depending on your Hadoop version, you can use either Hadoop fs
or hdfs dfs
; however, I have used hdfs dfs
as follows:
[cloudera@localhost ~]$ hdfs dfs -mkdir /user/cloudera/automobiles [cloudera@localhost ~]$ hdfs dfs -mkdir /user/cloudera/motorcycles [cloudera@localhost ~]$ hdfs dfs -ls /user/cloudera/ Found 2 items drwxr-xr-x - cloudera cloudera 0 2013-10-15 19:16 /user/cloudera/automobiles drwxr-xr-x - cloudera cloudera 0 2013-10-15 19:16 /user/cloudera/motorcycles [cloudera@localhost ~]$ hdfs dfs -moveFromLocal automobiles.txt /user/cloudera/automobiles/automobiles.txt [cloudera@localhost ~]$ hdfs dfs -moveFromLocal motorcycles.txt /user/cloudera/motorcycles/motorcycles.txt [cloudera@localhost ~]$ hdfs dfs -ls /user/cloudera/motorcycles/ Found 1 items -rw-r--r-- 3 cloudera cloudera 932 2013-10-15 19:19 /user/cloudera/motorcycles/motorcycles.txt [cloudera@localhost ~]$ hdfs dfs -ls /user/cloudera/automobiles/ Found 1 items -rw-r--r-- 3 cloudera cloudera 985 2013-10-15 19:17 /user/cloudera/automobiles/automobiles.txt
Now, we will load the preceding data into two separate tables in two different steps, to learn various ways of loading data. The tables we are using here are external tables instead of internal. For automobile data, I will load them directly from a script into the automobiles
table; and then I will load motorcycle data in the motorcycles
table inside the Impala shell. In the script, I will add another empty table, automakers
. Later, we will join a list of automakers from both tables. All of this processing will be done in a database named autos
.
Here is the SQL script to create a database autos
first, create the automobiles
table, and then load the whole dataset from HDFS. I am also creating an empty table automakers
in the autos_script.sql
script as follows:
USE default; DROP DATABASE IF EXISTS autos; CREATE DATABASE autos; USE autos; DROP TABLE IF EXISTS automobiles; CREATE EXTERNAL TABLE automobiles ( make STRING, model STRING, year INTEGER, fuelType STRING, numOfDoors INTEGER, design STRING, type STRING, cylinders INTEGER, horsePower INTEGER, city_hwy_mpg STRING, price FLOAT ) ROW FORMAT DELIMITED FIELD TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/user/cloudera/automobiles/automobiles.txt'; CREATE EXTERNAL TABLE IF NOT EXISTS automakers ( autoMaker STRING ) ROW FORMAT DELIMITED FIELD TERMINATED BY ',';
Let's understand the preceding script. In the first line, we are setting the default database and then removing the autos
database if it exists. Next, we are creating a totally new autos
database and then using the auto
database. After that, we are dropping the automobiles
table if it exists. Then, we are creating a new automobiles
table using the schema as defined earlier. Then, we are passing the data source text files so the table can be populated with appropriate content. At last, we will also create another table named automaker
and use the IF NOT EXISTS
syntax. This syntax means that this particular table is created only if it does not exist.
Now, we are going to execute the preceding SQL script with the Impala shell using the following command syntax:
$impala-shell -i Impala-Server-Name:PORT -f SQL_Script_Name.sql
Here is the execution of the preceding SQL script:
[cloudera@localhost ~]$ impala-shell -i localhost.localdomain:21000 -f autos_setup.sql Connected to localhost.localdomain:21000 Server version: impalad version 1.0.1 RELEASE (build df844fb967cec8740f08dfb8b21962bc053527ef) Query: use default Query: drop DATABASE IF EXISTS autos Query: create DATABASE autos Query: use autos Query: drop TABLE IF EXISTS automobiles Query: create EXTERNAL TABLE automobiles ( make STRING, model STRING, autoyear INTEGER, fuelType STRING, numOfDoors INTEGER, design STRING, autoType STRING, cylinders INTEGER, horsePower INTEGER, city_hwy_mpg STRING, price FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/user/cloudera/automobiles' Query: drop TABLE IF EXISTS automakers Query: create EXTERNAL TABLE automakers (autoMaker STRING)
3.145.169.109