Fundamentally,ces are algorithms that compute the similarity between two probability distributions. In the field of information theory, divergences are used to estimate the minimum discrimination information.
Although divergences are not usually defined as dimension-reduction techniques, they are a vital tool for measuring the redundancy of information between features.
Let's consider a set of observations: X with a feature set {fi}. Two features that are highly correlated generate redundant information (or information gains). Therefore, it is conceivable to remove one of these two features from the training set without incurring a loss of information.
The list of divergences is quite extensive and includes the following methods:
The Kullback-Leibler divergence is the most commonly used and known divergence measure. It computes the relative entropy between two distributions. This divergence is also known as relative entropy (KL).
The reader can find a Spark implementation of the Kullback-Leibler divergence in the Apache Spark extensibility section of Chapter 17, Apache Spark MLlib.
In the context of measuring the similarity between two features, p and q, given the frequency of the distribution over observations Pi and Qi, the Kullback-Leibler is computed as follows:
Let's implement the Kullback-Leibler algorithm to compute the similarity of the two datasets:
Let's start by defining a generic Divergence
trait:
trait Divergence {
def divergence(nSteps: Int): Double
}
The KL algorithm is implemented by the KullbackLeibler
class that takes three parameters, the two observed features p
and q
, and an optional normalization factor (line 1):
class KullbackLeibler[@specialized(Double) T: ToDouble](
p: Seq[T],
q: Seq[T],
normalized: Boolean = true //1
)(implicit ordering: Ordering[T]) extends Divergence { //2
val (min, max): (Double, Double) = ( //3
Math.min(p.min, q.min),
Math.max(p.max, q.max)
)
override def divergence(nSteps: Int): Double = {..}
}
The class cannot assume that the features have Scala.Double
as a data type. Therefore, the constructor defines the following items:
T
T
(line 2)The frequency distribution for each dataset requires the computation of the minimum and maximum values across the two observations p
and q
(line 3).
The KL computation is implemented by the overridden divergence
method:
override def divergence(nSteps: Int): Double = { val freq1 = frequencies(p, nSteps) //4 val freq2 = frequencies(q, nSteps) //5 val kl = freq1.zip(freq2)./:(0.0) { case (kl, (f1, f2)) => { //6 val num = if (f2 == 0) 1 else f2 //8 val den = if(f1 == 0) 1 else f1 kl + num * Math.log(num) / den } } if(normalized) kl/Math.max(p.size , q.size) else kl //7 }
The method divergence computes the histograms for the two feature observations (lines 4 and 5). The KL formula is applied cumulatively to each pair of feature instances (f1
and f2
) (line 6) and summed by a fold method. The result value is optionally normalized
(line 7).
The generation of the histogram is implemented by the frequencies
method:
def frequencies(input: Seq[T], nSteps: Int): Array[Int] = {
val histogram = new Histogram(min, max)
histogram.frequencies(nSteps, input.map(toDouble(_)))
}
Handling null frequencies
There is no guarantee that the histograms have no null empty for which the computation would generate a Double.NaN
value for (case of log (0) or /0). There are three options to handle this condition:
The implementation of the ancillary class Histogram
and related methods are available in the companion source code.
The purpose of the test is to evaluate the impact of the number of histogram buckets, nSteps
, on the accuracy of the computation of the Kullback-Leibler distance as a reliable estimator of the similarity between two features. The algorithm is tested with two identical random gamma distributions (line 8):
val numPoints = 1000000 val nSteps = 100 def gammaDistribution( //8 shape: Double, scale: Double): Seq[Double] = { val gamma = new GammaDistribution(shape, scale) Seq.fill(numPoints)(gamma.density(Random.nextDouble)) } val divergence = new KullbackLeibler[Double]( gammaDistribution(2.0, 1.0), gammaDistribution(2.0, 1.0) ).divergence(nSteps)
The test produces the following output for various values of nSteps:
nSteps |
KL-divergence |
---|---|
50 |
2.691 |
100 |
0.477 |
250 |
6.138e-4 |
500 |
7.902e-8 |
As expected, the divergence between the two datasets decreased significantly as the number of buckets used in the histogram increased.
The mutual information is a symmetric divergence that extends the Kullback-Leibler divergence. It measures the similarity between two random distributions, X and Y, by computing the KL divergence of the joint probability of these distributions over the production of their marginal probability, p(X) and p(Y). Contrary to the KL divergence, the Mutual information is symmetric:
18.118.24.106