Aggregation operations

We have seen how to apply an operation to every row in a DataFrame to create a new column, and we have seen how to use filters to build new DataFrames with a sub-set of rows from the original DataFrame. The last set of operations on DataFrames is grouping operations, equivalent to the GROUP BY statement in SQL. Let's calculate the average BMI for smokers and non-smokers. We must first tell Spark to group the DataFrame by a column (the isSmoker column, in this case), and then apply an aggregation operation (averaging, in this case) to reduce each group:

scala> val smokingDF = readingsWithBmiDF.groupBy(
  "isSmoker").agg(avg("BMI"))
smokingDF: org.apache.spark.sql.DataFrame = [isSmoker: boolean, AVG(BMI): double]

This has created a new DataFrame with two columns: the grouping column and the column over which we aggregated. Let's show this DataFrame:

scala> smokingDF.show
+--------+------------------+
|isSmoker|          AVG(BMI)|
+--------+------------------+
|    true|23.733355491389517|
|   false|23.095078245456424|
+--------+------------------+

Besides averaging, there are several operators for performing the aggregation across each group. We outline some of the more important ones in the table below, but, for a full list, consult the Aggregate functions section of http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$:

Operator

Notes

avg(column)

Group averages of the values in the specified column.

count(column)

Number of elements in each group in the specified column.

countDistinct(column, ... )

Number of distinct elements in each group. This can also accept multiple columns to return the count of unique elements across several columns.

first(column), last(column)

First/last element in each group

max(column), min(column)

Largest/smallest element in each group

sum(column)

Sum of the values in each group

Each aggregation operator takes either the name of a column, as a string, or an expression of type Column. The latter allows aggregation of compound expressions. If we wanted the average height, in meters, of the smokers and non-smokers in our sample, we could use:

scala> readingsDF.groupBy("isSmoker").agg { 
  avg($"heightCm"/100.0) 
}.show
+--------+-----------------------+
|isSmoker|AVG((heightCm / 100.0))|
+--------+-----------------------+
|    true|                  1.715|
|   false|     1.6949999999999998|
+--------+-----------------------+

We can also use compound expressions to define the column on which to group. For instance, to count the number of patients in each age group, increasing by decade, we can use:

scala> readingsDF.groupBy(floor($"age"/10)).agg(count("*")).show
+-----------------+--------+
|FLOOR((age / 10))|count(1)|
+-----------------+--------+
|              4.0|       3|
|              2.0|       1|
+-----------------+--------+

We have used the short-hand "*" to indicate a count over every column.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.216.254