approx_count_distinct

Approximate distinct count is much faster at approximately counting the distinct records rather than doing an exact count, which usually needs a lot of shuffles and other operations. While the approximate count is not 100% accurate, many use cases can perform equally well even without an exact count.

The approx_count_distinct API has several implementations, as follows. The exact API used depends on the specific use case.

def approx_count_distinct(columnName: String, rsd: Double): Column
Aggregate function: returns the approximate number of distinct items in a group.

def approx_count_distinct(e: Column, rsd: Double): Column
Aggregate function: returns the approximate number of distinct items in a group.

def approx_count_distinct(columnName: String): Column
Aggregate function: returns the approximate number of distinct items in a group.

def approx_count_distinct(e: Column): Column
Aggregate function: returns the approximate number of distinct items in a group.

Let's look at an example of invoking approx_count_distinct on the DataFrame to print the approximate count of the DataFrame:

import org.apache.spark.sql.functions._
scala> statesPopulationDF.select(col("*")).agg(approx_count_distinct("State")).show
+----------------------------+
|approx_count_distinct(State)|
+----------------------------+
| 48|
+----------------------------+

scala> statesPopulationDF.select(approx_count_distinct("State", 0.2)).show
+----------------------------+
|approx_count_distinct(State)|
+----------------------------+
| 49|
+----------------------------+

Table of Contents for approx_count_distinct

Create new playlist

Sign In

Sign Up

Table of Contents for
approx_count_distinct