groupBy

A common task seen in data analysis is to group the data into grouped categories and then perform calculations on the resultant groups of data.

A quick way to understand grouping is to imagine being asked to assess what supplies you need for your office very quickly. You could start looking around you and just group different types of items, such as pens, paper, staplers, and analyze what you have and what you need.

Let's run groupBy function on the DataFrame to print aggregate counts of each State:

scala> statesPopulationDF.groupBy("State").count.show(5)
+---------+-----+
| State|count|
+---------+-----+
| Utah| 7|
| Hawaii| 7|
|Minnesota| 7|
| Ohio| 7|
| Arkansas| 7|
+---------+-----+

You can also groupBy and then apply any of the aggregate functions seen previously, such as min, max, avg, stddev, and so on:

import org.apache.spark.sql.functions._
scala> statesPopulationDF.groupBy("State").agg(min("Population"), avg("Population")).show(5)
+---------+---------------+--------------------+
| State|min(Population)| avg(Population)|
+---------+---------------+--------------------+
| Utah| 2775326| 2904797.1428571427|
| Hawaii| 1363945| 1401453.2857142857|
|Minnesota| 5311147| 5416287.285714285|
| Ohio| 11540983|1.1574362714285715E7|
| Arkansas| 2921995| 2957692.714285714|
+---------+---------------+--------------------+

Table of Contents for groupBy

Create new playlist

Sign In

Sign Up

Table of Contents for
groupBy