Window functions

Window functions allow you to perform aggregations over a window of data rather than entire data or some filtered data. The use cases of such window functions are:

  • Cumulative sum
  • Delta from previous value for same key
  • Weighted moving average

The best way to understand window functions is to imagine a sliding window over the larger dataset universe. You can specify a window looking at three rows T-1, T, and T+1, and by performing a simple calculation. You can also specify a window of latest/most recent ten values:

The API for the window specification requires three properties, the partitionBy(), orderBy(), and the rowsBetween(). The partitionBy chunks the data into the partitions/groups as specified by partitionBy(). orderBy() is used to order the data within each partition of data.

The rowsBetween() specifies the window frame or the span of the sliding window to perform the calculations.

To try out the windows function, there are certain packages that are needed. You can import the necessary packages using import directives, as follows:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions.max

Now, you are ready to write some code to learn about the window functions. Let's create a window specification for the partitions sorted by Population and partitioned by State. Also, specify that we want to consider all rows until the current row as part of the Window.

 val windowSpec = Window
.partitionBy("State")
.orderBy(col("Population").desc)
.rowsBetween(Window.unboundedPreceding, Window.currentRow)

Compute the rank over the window specification. The result will be a rank (row number) added to each row, as long as it falls within the Window specified. In this example, we chose to partition by State and then order the rows of each State further by descending order. Hence, all State rows have their own rank numbers assigned.

import org.apache.spark.sql.functions._
scala> statesPopulationDF.select(col("State"), col("Year"), max("Population").over(windowSpec), rank().over(windowSpec)).sort("State", "Year").show(10)

+-------+----+------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+
| State|Year|max(Population) OVER (PARTITION BY State ORDER BY Population DESC NULLS LAST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)|RANK() OVER (PARTITION BY State ORDER BY Population DESC NULLS LAST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)|
+-------+----+------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+
|Alabama|2010| 4863300| 6|
|Alabama|2011| 4863300| 7|
|Alabama|2012| 4863300| 5|
|Alabama|2013| 4863300| 4|
|Alabama|2014| 4863300| 3|
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.201.26