In this chapter, we will go over a use case to see Impala concepts in action. This way you can experience a real-world scenario using Impala, and understand how and where to use Impala statements in real-world applications. In this chapter, I will be using a scenario as described in the following sections.
We are going to deal with information related to automobiles. We have two data files that contain information about automobiles and motorcycles in two separate text files. The following conceptual image shows that within the Autos database, there are two tables named Motorcycles and Automobiles.
So far, it is imprinted on your mind that Impala is running on DataNode, and the files in our project are stored on HDFS. First we will load these files from HDFS to Impala and then we will use SQL statements to process this information through multiple queries.
Let's take a look at this example dataset, which has a list of automobile names and their properties as defined in the schema. The following is the first text file, which has automobile-specific data:
automobiles.txt
make
, model
, year
, fuel-type
, numOfDoors
, design
, type
, cylinders
, horsepower
, city_hwy_mpg
, price
Here is the data in the automobiles.txt
file:
Audi,A4,2011,gas,4,sedan,casual,6,476,22-30,45000 Jeep,Compass,2007,gas,3,suv,sport,6,170,24-32,22000 Dodge,Challenger,2013,gas,4,coupe,casual,6,210,20-30,28000 Chevrolet,Volt,2014,electric,4,sedan,casual,0,180,35-40,35000 Toyota,Prius,2013,hybrid,4,sedan,casual,4,134,51-48,32000 BMW,M3,2010,gas,2,coupe,sport,6,300,18-28,41000 BMW,X5,2005,gas,4,suv,sport,6,265,19-26,55000 Toyota,Camry,2009,gas,4,sedan,casual,6,178,25-35,26000 Toyota,Camry,2014,hybrid,4,sedan,sport,6,200,43-39,30000 Honda,Civic,2013,gas,2,coupe,sport,4,140,28-36,18000 Nissan,Leaf,2014,electric,4,sedan,casual,0,107,129-102,29000 Audi,Q7,2013,gas,4,suv,sport,6,333,16-22,60000 Audi,A7 TDI,2014,diesel,4,sedan,sport,6,240,30-25,58000 Mercedes,CLA,2013,gas,4,sedan,casual,4,208,22-28,29000 Fisker,Karma,2014,electric,2,coupe,sport,0,260,85-90,100000
Now let's take a look at another example dataset, which has a list of motorcycle names and their properties as defined in the schema. The following is the second text file, which includes motorcycles-specific data:
motorcycles.txt
make
, model
, year
, fuelType
, wheels
, body
, style
, cc_rpm
, highSpeed
, auto
, price
Here is the data in the motorcycles.txt
file:
Honda,CBR600,1990,gas,2,casual,sport,599,165,false,12000 BMW,R1200RT,1990,gas,2,casual,sport,1170,135,false,20000 Honda,CB900,1995,gas,2,casual,sport,919,135,false,10000 Honda,VFR400,1998,gas,2,casual,sport,399,130,false,8000 KTM,Super Duke,2005,gas,2,casual,sport,999,151,false,25000 Triumph,Speed Triple,2001,gas,2,deluxe,sport,1050,150,false,23000 Suzuki,RGV250,2000,gas,2,deluxe,sport,249,127,false,7500 Triumph,Daytona,gas,2005,2,deluxe,sport,675,156,false,16000 Triumph,Street Triple,2010,gas,2,casual,sport,123,141,true,12000 Ducati,1098,2010,gas,2,deluxe,sport,1099,180,false,30000 Harley,Tri Glide,2012,gas,3,deluxe,luxury,1600,180,false,35000 Harley,Iron,2012,gas,casual,luxury,1500,200,false,14000 Bramo,Icon,2012,electric,2,casual,sport,4500,80,false,20000 Zero,Police,2013,electric,2,casual,sport,4300,95,false,25000 Can-am,spider,2014,gas,3,deluxe,luxury,998,120,false,22000
You cannot use a dash (-
) with the schema name; instead use an underscore. In the preceding example's dataset schema, you can see some of the schema definition are using underscores while others are not.
In your data, you must not have spaces between a separator; otherwise, numeric values cannot be processed correctly or you will get NULL
instead.
3.145.35.194