ntiles

The ntiles is a popular aggregation over a window and is commonly used to divide input dataset into n parts. For example, in predictive analytics, deciles (10 parts) are often used to first group the data and then divide it into 10 parts to get a fair distribution of data. This is a natural function of the window function approach, hence ntiles is a good example of how window functions can help.

For example, if we want to partition the statesPopulationDF by State (window specification was shown previously), order by population, and then divide into two portions, we can use ntile over the windowspec:

import org.apache.spark.sql.functions._
scala> statesPopulationDF.select(col("State"), col("Year"), ntile(2).over(windowSpec), rank().over(windowSpec)).sort("State", "Year").show(10)

+-------+----+-----------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+
| State|Year|ntile(2) OVER (PARTITION BY State ORDER BY Population DESC NULLS LAST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)|RANK() OVER (PARTITION BY State ORDER BY Population DESC NULLS LAST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)|
+-------+----+-----------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------+
|Alabama|2010| 2| 6|
|Alabama|2011| 2| 7|
|Alabama|2012| 2| 5|
|Alabama|2013| 1| 4|
|Alabama|2014| 1| 3|
|Alabama|2015| 1| 2|
|Alabama|2016| 1| 1|
| Alaska|2010| 2| 7|
| Alaska|2011| 2| 6|
| Alaska|2012| 2| 5|
+-------+----+-----------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------

As shown previously, we have used Window function and ntile() together to divide the rows of each State into two equal portions.


A popular use of this function is to compute deciles used in data science Models.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.171.58