DataFrames

I have already mentioned that a DataFrame is based on a columnar format. Temporary tables can be created from it, but I will expand on this in the next section. There are many methods available to the data frame that allow data manipulation, and processing. I have based the Scala code used here, on the code in the last section, so I will just show you the working lines and the output. It is possible to display a data frame schema as shown here:

adultDataFrame.printSchema()

root
 |-- age: string (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: string (nullable = true)
 |-- education: string (nullable = true)
 |-- educational-num: string (nullable = true)
 |-- marital-status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- capital-gain: string (nullable = true)
 |-- capital-loss: string (nullable = true)
 |-- hours-per-week: string (nullable = true)
 |-- native-country: string (nullable = true)
 |-- income: string (nullable = true)

It is possible to use the select method to filter columns from the data. I have limited the output here, in terms of rows, but you get the idea:

adultDataFrame.select("workclass","age","education","income").show()

workclass         age education     income
 Private          25   11th          <=50K
 Private          38   HS-grad       <=50K
 Local-gov        28   Assoc-acdm    >50K
 Private          44   Some-college  >50K
 none             18   Some-college  <=50K
 Private          34   10th          <=50K
 none             29   HS-grad       <=50K
 Self-emp-not-inc 63   Prof-school   >50K
 Private          24   Some-college  <=50K
 Private          55   7th-8th       <=50K

It is possible to filter the data returned from the DataFrame using the filter method. Here, I have added the occupation column to the output, and filtered on the worker age:

    adultDataFrame
      .select("workclass","age","education","occupation","income")
      .filter( adultDataFrame("age") > 30 )
      .show()

workclass         age education     occupation         income
 Private          38   HS-grad       Farming-fishing    <=50K
 Private          44   Some-college  Machine-op-inspct  >50K
 Private          34   10th          Other-service      <=50K
 Self-emp-not-inc 63   Prof-school   Prof-specialty     >50K
 Private          55   7th-8th       Craft-repair       <=50K

There is also a group by method for determining volume counts within a data set. As this is an income-based dataset, I think that volumes within the wage brackets would be interesting. I have also used a bigger dataset to give more meaningful results:

    adultDataFrame
      .groupBy("income")
      .count()
      .show()

income count
 <=50K 24720
 >50K  7841

This is interesting, but what if I want to compare income brackets with occupation, and sort the results for a better understanding? The following example shows how this can be done, and gives the example data volumes. It shows that there is a high volume of managerial roles compared to other occupations. This example also sorts the output by the occupation column:

    adultDataFrame
      .groupBy("income","occupation")
      .count()
      .sort("occupation")
      .show()

income occupation         count
 >50K   Adm-clerical      507
 <=50K  Adm-clerical      3263
 <=50K  Armed-Forces      8
 >50K   Armed-Forces      1
 <=50K  Craft-repair      3170
 >50K   Craft-repair      929
 <=50K  Exec-managerial   2098
 >50K   Exec-managerial   1968
 <=50K  Farming-fishing   879
 >50K   Farming-fishing   115
 <=50K  Handlers-cleaners 1284
 >50K   Handlers-cleaners 86
 >50K   Machine-op-inspct 250
 <=50K  Machine-op-inspct 1752
 >50K   Other-service     137
 <=50K  Other-service     3158
 >50K   Priv-house-serv   1
 <=50K  Priv-house-serv   148
 >50K   Prof-specialty    1859
 <=50K  Prof-specialty    2281

So, SQL-like actions can be carried out against DataFrames, including select, filter, sort group by, and print. The next section shows how tables can be created from the DataFrames, and how the SQL-based actions are carried out against them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.70.21