Count

Count is the most basic aggregate function, which simply counts the number of rows for the column specified. An extension is the countDistinct, which also eliminates duplicates.

The count API has several implementations, as follows. The exact API used depends on the specific use case:

def count(columnName: String): TypedColumn[Any, Long]
Aggregate function: returns the number of items in a group.

def count(e: Column): Column
Aggregate function: returns the number of items in a group.

def countDistinct(columnName: String, columnNames: String*): Column
Aggregate function: returns the number of distinct items in a group.

def countDistinct(expr: Column, exprs: Column*): Column
Aggregate function: returns the number of distinct items in a group.

Let's look at examples of invoking count and countDistinct on the DataFrame to print the row counts:

import org.apache.spark.sql.functions._
scala> statesPopulationDF.select(col("*")).agg(count("State")).show

scala> statesPopulationDF.select(count("State")).show
+------------+
|count(State)|
+------------+
| 350|
+------------+

scala> statesPopulationDF.select(col("*")).agg(countDistinct("State")).show
scala> statesPopulationDF.select(countDistinct("State")).show
+---------------------+
|count(DISTINCT State)|
+---------------------+
| 50|
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.153.251