Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

The divergences

Fundamentally,ces are algorithms that compute the similarity between two probability distributions. In the field of information theory, divergences are used to estimate the minimum discrimination information.

Although divergences are not usually defined as dimension-reduction techniques, they are a vital tool for measuring the redundancy of information between features.

Let's consider a set of observations: X with a feature set {fi}. Two features that are highly correlated generate redundant information (or information gains). Therefore, it is conceivable to remove one of these two features from the training set without incurring a loss of information.

The list of divergences is quite extensive and includes the following methods:

Kullback-Leibler (KL) divergence estimates the similarity between two probability distributions [5:1]
Jensen-Shannon metric extends the KL formula with symmetrization and boundary values [5:2]
Mutual information, based on KL, measures the mutual dependence between two random variables as KL of their joint probability over the product of their marginal probability [5:3]
Bregman distance is used for a continuous differentiable convex probability distribution. The Bregman divergence is used for reliability as an objective function for clustering [5:4]
Note
Divergence such as Mutual information (MI) is used for feature extraction.

The Kullback-Leibler divergence

The Kullback-Leibler divergence is the most commonly used and known divergence measure. It computes the relative entropy between two distributions. This divergence is also known as relative entropy (KL).

Overview

The reader can find a Spark implementation of the Kullback-Leibler divergence in the Apache Spark extensibility section of Chapter 17, Apache Spark MLlib.

In the context of measuring the similarity between two features, p and q, given the frequency of the distribution over observations Pi and Qi, the Kullback-Leibler is computed as follows:

Let's implement the Kullback-Leibler algorithm to compute the similarity of the two datasets:

Implementation

Let's start by defining a generic Divergence trait:

trait Divergence {
  def divergence(nSteps: Int): Double
}

The KL algorithm is implemented by the KullbackLeibler class that takes three parameters, the two observed features p and q, and an optional normalization factor (line 1):

class KullbackLeibler[@specialized(Double) T: ToDouble](
  p: Seq[T],
  q: Seq[T],
  normalized: Boolean = true //1
)(implicit ordering: Ordering[T]) extends Divergence { //2

  val (min, max): (Double, Double) = ( //3
    Math.min(p.min, q.min),
    Math.max(p.max, q.max)
  )
  override def divergence(nSteps: Int): Double = {..}
}

The class cannot assume that the features have Scala.Double as a data type. Therefore, the constructor defines the following items:

Context-bound parameterized type T
An implicit ordering for the type T (line 2)

The frequency distribution for each dataset requires the computation of the minimum and maximum values across the two observations p and q (line 3).

The KL computation is implemented by the overridden divergence method:

  override def divergence(nSteps: Int): Double = {
    val freq1 = frequencies(p, nSteps)  //4
    val freq2 = frequencies(q, nSteps)  //5

    val kl = freq1.zip(freq2)./:(0.0) {
      case (kl, (f1, f2)) => {   //6
        val num = if (f2 == 0) 1 else f2 //8
        val den = if(f1 == 0) 1 else f1
        kl + num * Math.log(num) / den
      }
    }
    if(normalized) kl/Math.max(p.size , q.size) else kl //7
  }

The method divergence computes the histograms for the two feature observations (lines 4 and 5). The KL formula is applied cumulatively to each pair of feature instances (f1 and f2) (line 6) and summed by a fold method. The result value is optionally normalized (line 7).

The generation of the histogram is implemented by the frequencies method:

  def frequencies(input: Seq[T], nSteps: Int): Array[Int] = {
    val histogram = new Histogram(min, max)
    histogram.frequencies(nSteps, input.map(toDouble(_)))
  }

Tip

Handling null frequencies

There is no guarantee that the histograms have no null empty for which the computation would generate a Double.NaN value for (case of log (0) or /0). There are three options to handle this condition:

Throwing an exception
Discarding the particular observation
Using Laplace smoothing and adding 1 to the frequency count (line 8)

The implementation of the ancillary class Histogram and related methods are available in the companion source code.

Testing

The purpose of the test is to evaluate the impact of the number of histogram buckets, nSteps, on the accuracy of the computation of the Kullback-Leibler distance as a reliable estimator of the similarity between two features. The algorithm is tested with two identical random gamma distributions (line 8):

   val numPoints = 1000000
   val nSteps = 100

  	def gammaDistribution(   //8
  	    shape: Double, 
       scale: Double): Seq[Double] = {
      val gamma = new GammaDistribution(shape, scale)
      Seq.fill(numPoints)(gamma.density(Random.nextDouble))
    }
    
     val divergence = new KullbackLeibler[Double](
        gammaDistribution(2.0, 1.0), 
        gammaDistribution(2.0, 1.0)
     ).divergence(nSteps)

The test produces the following output for various values of nSteps:

nSteps	KL-divergence
50	2.691
100	0.477
250	6.138e-4
500	7.902e-8

As expected, the divergence between the two datasets decreased significantly as the number of buckets used in the histogram increased.

The mutual information

The mutual information is a symmetric divergence that extends the Kullback-Leibler divergence. It measures the similarity between two random distributions, X and Y, by computing the KL divergence of the joint probability of these distributions over the production of their marginal probability, p(X) and p(Y). Contrary to the KL divergence, the Mutual information is symmetric:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for The divergences

Create new playlist

Sign In

Sign Up