Chapter 4. Impala Walkthrough with an Example

In this chapter, we will go over a use case to see Impala concepts in action. This way you can experience a real-world scenario using Impala, and understand how and where to use Impala statements in real-world applications. In this chapter, I will be using a scenario as described in the following sections.

Creating an example scenario

We are going to deal with information related to automobiles. We have two data files that contain information about automobiles and motorcycles in two separate text files. The following conceptual image shows that within the Autos database, there are two tables named Motorcycles and Automobiles.

Creating an example scenario

So far, it is imprinted on your mind that Impala is running on DataNode, and the files in our project are stored on HDFS. First we will load these files from HDFS to Impala and then we will use SQL statements to process this information through multiple queries.

Example dataset one – automobiles (automobiles.txt)

Let's take a look at this example dataset, which has a list of automobile names and their properties as defined in the schema. The following is the first text file, which has automobile-specific data:

  • File: automobiles.txt
  • Schema: make, model, year, fuel-type, numOfDoors, design, type, cylinders, horsepower, city_hwy_mpg, price

    Here is the data in the automobiles.txt file:

    Audi,A4,2011,gas,4,sedan,casual,6,476,22-30,45000
    Jeep,Compass,2007,gas,3,suv,sport,6,170,24-32,22000
    Dodge,Challenger,2013,gas,4,coupe,casual,6,210,20-30,28000
    Chevrolet,Volt,2014,electric,4,sedan,casual,0,180,35-40,35000
    Toyota,Prius,2013,hybrid,4,sedan,casual,4,134,51-48,32000
    BMW,M3,2010,gas,2,coupe,sport,6,300,18-28,41000
    BMW,X5,2005,gas,4,suv,sport,6,265,19-26,55000
    Toyota,Camry,2009,gas,4,sedan,casual,6,178,25-35,26000
    Toyota,Camry,2014,hybrid,4,sedan,sport,6,200,43-39,30000
    Honda,Civic,2013,gas,2,coupe,sport,4,140,28-36,18000
    Nissan,Leaf,2014,electric,4,sedan,casual,0,107,129-102,29000
    Audi,Q7,2013,gas,4,suv,sport,6,333,16-22,60000
    Audi,A7 TDI,2014,diesel,4,sedan,sport,6,240,30-25,58000
    Mercedes,CLA,2013,gas,4,sedan,casual,4,208,22-28,29000
    Fisker,Karma,2014,electric,2,coupe,sport,0,260,85-90,100000

Example dataset two – motorcycles (motorcycles.txt)

Now let's take a look at another example dataset, which has a list of motorcycle names and their properties as defined in the schema. The following is the second text file, which includes motorcycles-specific data:

  • File: motorcycles.txt
  • Schema: make, model, year, fuelType, wheels, body, style, cc_rpm, highSpeed, auto, price

    Here is the data in the motorcycles.txt file:

    Honda,CBR600,1990,gas,2,casual,sport,599,165,false,12000
    BMW,R1200RT,1990,gas,2,casual,sport,1170,135,false,20000
    Honda,CB900,1995,gas,2,casual,sport,919,135,false,10000
    Honda,VFR400,1998,gas,2,casual,sport,399,130,false,8000
    KTM,Super Duke,2005,gas,2,casual,sport,999,151,false,25000
    Triumph,Speed Triple,2001,gas,2,deluxe,sport,1050,150,false,23000
    Suzuki,RGV250,2000,gas,2,deluxe,sport,249,127,false,7500
    Triumph,Daytona,gas,2005,2,deluxe,sport,675,156,false,16000
    Triumph,Street Triple,2010,gas,2,casual,sport,123,141,true,12000
    Ducati,1098,2010,gas,2,deluxe,sport,1099,180,false,30000
    Harley,Tri Glide,2012,gas,3,deluxe,luxury,1600,180,false,35000
    Harley,Iron,2012,gas,casual,luxury,1500,200,false,14000
    Bramo,Icon,2012,electric,2,casual,sport,4500,80,false,20000
    Zero,Police,2013,electric,2,casual,sport,4300,95,false,25000
    Can-am,spider,2014,gas,3,deluxe,luxury,998,120,false,22000

Data and schema considerations

You cannot use a dash (-) with the schema name; instead use an underscore. In the preceding example's dataset schema, you can see some of the schema definition are using underscores while others are not.

In your data, you must not have spaces between a separator; otherwise, numeric values cannot be processed correctly or you will get NULL instead.

Note

Before we go further, I would like to make sure it is understood that the preceding data is created by me, specifically for reference purpose to demonstrate Impala query processing. This data is completely fabricated and does not reflect any correct information.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.35.194