I have already mentioned that a DataFrame is based on a columnar format. Temporary tables can be created from it, but I will expand on this in the next section. There are many methods available to the data frame that allow data manipulation, and processing. I have based the Scala code used here, on the code in the last section, so I will just show you the working lines and the output. It is possible to display a data frame schema as shown here:
adultDataFrame.printSchema() root |-- age: string (nullable = true) |-- workclass: string (nullable = true) |-- fnlwgt: string (nullable = true) |-- education: string (nullable = true) |-- educational-num: string (nullable = true) |-- marital-status: string (nullable = true) |-- occupation: string (nullable = true) |-- relationship: string (nullable = true) |-- race: string (nullable = true) |-- gender: string (nullable = true) |-- capital-gain: string (nullable = true) |-- capital-loss: string (nullable = true) |-- hours-per-week: string (nullable = true) |-- native-country: string (nullable = true) |-- income: string (nullable = true)
It is possible to use the select
method to filter columns from the data. I have limited the output here, in terms of rows, but you get the idea:
adultDataFrame.select("workclass","age","education","income").show() workclass age education income Private 25 11th <=50K Private 38 HS-grad <=50K Local-gov 28 Assoc-acdm >50K Private 44 Some-college >50K none 18 Some-college <=50K Private 34 10th <=50K none 29 HS-grad <=50K Self-emp-not-inc 63 Prof-school >50K Private 24 Some-college <=50K Private 55 7th-8th <=50K
It is possible to filter the data returned from the DataFrame using the filter
method. Here, I have added the occupation column to the output, and filtered on the worker age:
adultDataFrame .select("workclass","age","education","occupation","income") .filter( adultDataFrame("age") > 30 ) .show() workclass age education occupation income Private 38 HS-grad Farming-fishing <=50K Private 44 Some-college Machine-op-inspct >50K Private 34 10th Other-service <=50K Self-emp-not-inc 63 Prof-school Prof-specialty >50K Private 55 7th-8th Craft-repair <=50K
There is also a group by
method for determining volume counts within a data set. As this is an income-based dataset, I think that volumes within the wage brackets would be interesting. I have also used a bigger dataset to give more meaningful results:
adultDataFrame .groupBy("income") .count() .show() income count <=50K 24720 >50K 7841
This is interesting, but what if I want to compare income
brackets with occupation
, and sort the results for a better understanding? The following example shows how this can be done, and gives the example data volumes. It shows that there is a high volume of managerial roles compared to other occupations. This example also sorts the output by the occupation column:
adultDataFrame .groupBy("income","occupation") .count() .sort("occupation") .show() income occupation count >50K Adm-clerical 507 <=50K Adm-clerical 3263 <=50K Armed-Forces 8 >50K Armed-Forces 1 <=50K Craft-repair 3170 >50K Craft-repair 929 <=50K Exec-managerial 2098 >50K Exec-managerial 1968 <=50K Farming-fishing 879 >50K Farming-fishing 115 <=50K Handlers-cleaners 1284 >50K Handlers-cleaners 86 >50K Machine-op-inspct 250 <=50K Machine-op-inspct 1752 >50K Other-service 137 <=50K Other-service 3158 >50K Priv-house-serv 1 <=50K Priv-house-serv 148 >50K Prof-specialty 1859 <=50K Prof-specialty 2281
So, SQL-like actions can be carried out against DataFrames, including select
, filter
, sort group by
, and print
. The next section shows how tables can be created from the DataFrames, and how the SQL-based actions are carried out against them.
3.15.206.25