Now, we will try to check which maker is common between both tables and their models by using the join
SQL command as follows:
[Hadoop.testdomain:21000] > select automobiles.make, automobiles.model, motorcycles.make,motorcycles.model from automobiles JOIN motorcycles USING (make); Query finished, fetching results ... +-------+-------+-------+---------+ | make | model | make | model | +-------+-------+-------+---------+ | BMW | M3 | BMW | R1200RT | | BMW | X5 | BMW | R1200RT | | Honda | Civic | Honda | VFR400 | | Honda | Civic | Honda | CB900 | | Honda | Civic | Honda | CBR600 | +-------+-------+-------+---------+ Returned 5 row(s) in 0.40s
Now, we will insert automakers from both the automobiles
and motorcycles
tables into the automakers table by using the Insert Overwrite
and Insert Into
SQL statements as follows:
[Hadoop.testdomain:21000] > INSERT OVERWRITE TABLE automakers SELECT distinct(make) from automobiles; Inserted 10 rows in 2.00s [Hadoop.testdomain:21000] > INSERT INTO TABLE automakers SELECT distinct(make) from motorcycles; Inserted 10 rows in 0.71s
In the automobiles
table, I have included a field named city_hwy_mpg
. It includes miles per gallon in the city and highway, separated by a dash (-
). Here, I want to show you how to use the STRING
manipulation to get both highway and city miles per gallon values along with the LIMIT
and WHERE
clauses:
[Hadooptestdomain:21000] > select city_hwy_mpg, substr(city_hwy_mpg,1,2), substr(city_hwy_mpg, instr(city_hwy_mpg, "-")+1 , 5) from automobiles WHERE price > 15000 LIMIT 2; Query finished, fetching results ... +--------------+----------------------------+-------------------------------------------------------+ | city_hwy_mpg | substr(city_hwy_mpg, 1, 2) | substr(city_hwy_mpg, instr(city_hwy_mpg, '-') + 1, 5) | +--------------+----------------------------+-------------------------------------------------------+ | 22-30 | 22 | 30 | | 24-32 | 24 | 32 | +--------------+----------------------------+-------------------------------------------------------+ Returned 2 row(s) in 0.35s
In the preceding code snippet, the substr
SQL command is used to select the first two letters that represent the city_mpg
value. After this, instr
is used to find the location of -
and then the same substr
is used to get the hwy_mpg
part of the value.
Now, let's use the COUNT
and DISTINCT
clause to collect unique automakers as follows:
[Hadoop.testdomain:21000] > select count(distinct(make)) from automobiles; Query finished, fetching results ... +----------------------+ | count(distinct make) | +----------------------+ | 10 | +----------------------+ Returned 1 row(s) in 0.48s
The preceding distinct
SQL command removes the duplicate make
values to make them unique and after that the count
SQL command counts all these items to return the result.
Now, you will see how to use the
EXPLAIN
clause to understand a query execution as follows:
[Hadoop.testdomain:21000] > explain select * from motorcycles where price > 10000 and price < 20000; PLAN FRAGMENT 0 PARTITION: UNPARTITIONED 1:EXCHANGE tuple ids: 0 PLAN FRAGMENT 1 PARTITION: RANDOM STREAM DATA SINK EXCHANGE ID: 1 UNPARTITIONED 0:SCAN HDFS table=autos.motorcycles #partitions=1 size=889B predicates: price > 10000.0, price < 20000.0 tuple ids: 0
The explain
command takes the remaining part of the query and then shows how it is going to handle the query. If you pass the select
command with explain
, you will get results about the query execution plan from the optimizer. Depending upon your query, explain
will show you if a partition is used, where the data is stored, how the results are fetched from the table, and if any index is used or not. You can use the result from the explain
SQL command to do an interactive analysis of your query and then modify it for faster execution as and if needed. You can also use the results from explain
to troubleshoot specific issues.
3.22.181.47