Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 4. Impala Walkthrough with an Example

In this chapter, we will go over a use case to see Impala concepts in action. This way you can experience a real-world scenario using Impala, and understand how and where to use Impala statements in real-world applications. In this chapter, I will be using a scenario as described in the following sections.

Creating an example scenario

We are going to deal with information related to automobiles. We have two data files that contain information about automobiles and motorcycles in two separate text files. The following conceptual image shows that within the Autos database, there are two tables named Motorcycles and Automobiles.

So far, it is imprinted on your mind that Impala is running on DataNode, and the files in our project are stored on HDFS. First we will load these files from HDFS to Impala and then we will use SQL statements to process this information through multiple queries.

Example dataset one – automobiles (automobiles.txt)

Let's take a look at this example dataset, which has a list of automobile names and their properties as defined in the schema. The following is the first text file, which has automobile-specific data:

File: automobiles.txt

Schema: make, model, year, fuel-type, numOfDoors, design, type, cylinders, horsepower, city_hwy_mpg, price

Here is the data in the automobiles.txt file:

Audi,A4,2011,gas,4,sedan,casual,6,476,22-30,45000
Jeep,Compass,2007,gas,3,suv,sport,6,170,24-32,22000
Dodge,Challenger,2013,gas,4,coupe,casual,6,210,20-30,28000
Chevrolet,Volt,2014,electric,4,sedan,casual,0,180,35-40,35000
Toyota,Prius,2013,hybrid,4,sedan,casual,4,134,51-48,32000
BMW,M3,2010,gas,2,coupe,sport,6,300,18-28,41000
BMW,X5,2005,gas,4,suv,sport,6,265,19-26,55000
Toyota,Camry,2009,gas,4,sedan,casual,6,178,25-35,26000
Toyota,Camry,2014,hybrid,4,sedan,sport,6,200,43-39,30000
Honda,Civic,2013,gas,2,coupe,sport,4,140,28-36,18000
Nissan,Leaf,2014,electric,4,sedan,casual,0,107,129-102,29000
Audi,Q7,2013,gas,4,suv,sport,6,333,16-22,60000
Audi,A7 TDI,2014,diesel,4,sedan,sport,6,240,30-25,58000
Mercedes,CLA,2013,gas,4,sedan,casual,4,208,22-28,29000
Fisker,Karma,2014,electric,2,coupe,sport,0,260,85-90,100000

Example dataset two – motorcycles (motorcycles.txt)

Now let's take a look at another example dataset, which has a list of motorcycle names and their properties as defined in the schema. The following is the second text file, which includes motorcycles-specific data:

File: motorcycles.txt

Schema: make, model, year, fuelType, wheels, body, style, cc_rpm, highSpeed, auto, price

Here is the data in the motorcycles.txt file:

Honda,CBR600,1990,gas,2,casual,sport,599,165,false,12000
BMW,R1200RT,1990,gas,2,casual,sport,1170,135,false,20000
Honda,CB900,1995,gas,2,casual,sport,919,135,false,10000
Honda,VFR400,1998,gas,2,casual,sport,399,130,false,8000
KTM,Super Duke,2005,gas,2,casual,sport,999,151,false,25000
Triumph,Speed Triple,2001,gas,2,deluxe,sport,1050,150,false,23000
Suzuki,RGV250,2000,gas,2,deluxe,sport,249,127,false,7500
Triumph,Daytona,gas,2005,2,deluxe,sport,675,156,false,16000
Triumph,Street Triple,2010,gas,2,casual,sport,123,141,true,12000
Ducati,1098,2010,gas,2,deluxe,sport,1099,180,false,30000
Harley,Tri Glide,2012,gas,3,deluxe,luxury,1600,180,false,35000
Harley,Iron,2012,gas,casual,luxury,1500,200,false,14000
Bramo,Icon,2012,electric,2,casual,sport,4500,80,false,20000
Zero,Police,2013,electric,2,casual,sport,4300,95,false,25000
Can-am,spider,2014,gas,3,deluxe,luxury,998,120,false,22000

Data and schema considerations

You cannot use a dash (-) with the schema name; instead use an underscore. In the preceding example's dataset schema, you can see some of the schema definition are using underscores while others are not.

In your data, you must not have spaces between a separator; otherwise, numeric values cannot be processed correctly or you will get NULL instead.

Note

Before we go further, I would like to make sure it is understood that the preceding data is created by me, specifically for reference purpose to demonstrate Impala query processing. This data is completely fabricated and does not reflect any correct information.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 4. Impala Walkthrough with an Example

Create new playlist

Sign In