Covariance

Covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly corresponds with the greater values of the other variable, and the same holds for the lesser values, then the variables tend to show similar behavior and the covariance is positive. If the opposite is true, and the greater values of one variable correspond with the lesser values of the other variable, then the covariance is negative.

The covar API has several implementations, as follows. The exact API used depends on the specific use case.

def covar_pop(columnName1: String, columnName2: String): Column
Aggregate function: returns the population covariance for two columns.

def covar_pop(column1: Column, column2: Column): Column
Aggregate function: returns the population covariance for two columns.

def covar_samp(columnName1: String, columnName2: String): Column
Aggregate function: returns the sample covariance for two columns.

def covar_samp(column1: Column, column2: Column): Column
Aggregate function: returns the sample covariance for two columns.

Let's look at an example of invoking covar_pop on the DataFrame to calculate the covariance between the year and population columns:

import org.apache.spark.sql.functions._
scala> statesPopulationDF.select(covar_pop("Year", "Population")).show
+---------------------------+
|covar_pop(Year, Population)|
+---------------------------+
| 183977.56000006935|
+---------------------------+

Table of Contents for Covariance

Create new playlist

Sign In

Sign Up

Table of Contents for
Covariance