groupBy

A common task seen in data analysis is to group the data into grouped categories and then perform calculations on the resultant groups of data.

A quick way to understand grouping is to imagine being asked to assess what supplies you need for your office very quickly. You could start looking around you and just group different types of items, such as pens, paper, staplers, and analyze what you have and what you need.

Let's run groupBy function on the DataFrame to print aggregate counts of each State:

scala> statesPopulationDF.groupBy("State").count.show(5)
+---------+-----+
| State|count|
+---------+-----+
| Utah| 7|
| Hawaii| 7|
|Minnesota| 7|
| Ohio| 7|
| Arkansas| 7|
+---------+-----+

You can also groupBy and then apply any of the aggregate functions seen previously, such as min, max, avg, stddev, and so on:

import org.apache.spark.sql.functions._
scala> statesPopulationDF.groupBy("State").agg(min("Population"), avg("Population")).show(5)

+---------+---------------+--------------------+
| State|min(Population)| avg(Population)|
+---------+---------------+--------------------+
| Utah| 2775326| 2904797.1428571427|
| Hawaii| 1363945| 1401453.2857142857|
|Minnesota| 5311147| 5416287.285714285|
| Ohio| 11540983|1.1574362714285715E7|
| Arkansas| 2921995| 2957692.714285714|
+---------+---------------+--------------------+

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.182.150