Chapter 2: Mathematical Foundations (3/5)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

d ([x

, x

, · · · , x

] , [y

, y

, · · · , y

]) =

||.

⎛⎞

−

⎜⎟

⎝⎠

∑

(2.3.4)

The case r = 2 is the usual L2-norm just mentioned. Another common

distance measure is the L1-norm, or Manhattan distance. There, the distance

between two points is the sum of the magnitudes of the differences in each

dimension. It is called “Manhattan distance” because it is the distance one

would have to travel between points if one were constrained to travel along

grid lines, as on the streets of a city such as Manhattan.

Another interesting distance measure is the L∞ -norm, which is the limit

as r approaches inﬁ nity of the Lr-norm. As r gets larger, only the dimension

with the largest difference matters, so formally, the L∞ -norm is deﬁ ned as

the maximum of|x

− y

|over all dimensions i.

2.3.3 Minkowski Distance

The Minkowski distance is a metric on Euclidean space which can be

considered as a generalization of the Euclidean distance. The Minkowski

distance of order p between two points

P = (x

, x

, · · · , x

) and Q = (y

, y

, · · · , y

) ¢ R

(2.3.5)

is deﬁ ned as:

(| |)

−

∑

(2.3.6)

The Minkowski distance is a metric as a result of the Minkowski

inequality. Minkowski distance is typically used with p being 1 or 2. The

latter is the Euclidean distance, while the former is sometimes known as

the Manhattan distance. In the limiting case of p reaching inﬁ nity we obtain

the Chebyshev distance:

lim(||)max||

ii ii

xy xy

→∞

−=−

∑

(2.3.7)

Similarly, when p reaches negative inﬁ nity we have

lim(||)min||

ii ii

xy xy

→−∞ =

−=−

∑

(2.3.8)

The Minkowski distance is often used when variables are measured

on ratio scales with an absolute zero value. Variables with a wider range

can overpower the result. Even a few outliers with high values bias the

result and disregard the alikeness given by a couple of variables with a

lower upper bound.

Mathematical Foundations 31

32 Applied Data Mining

2.3.4 Chebyshev Distance

In mathematics, Chebyshev distance is a metric deﬁ ned on a vector space

where the distance between two vectors is the greatest of their differences

along any coordinate dimension. It is also known as chessboard distance,

since in the game of chess the minimum number of moves needed by

the king to go from one square on a Chessboard to another equals the

Chebyshev distance between the centers of the squares. The Chebyshev

distance between two vectors or points p and q, with standard coordinates

and q

, respectively, is

Chebyshev

(p, q) =

max | |

pq−

(2.3.9)

This equals the limit of the L

metrics. In one dimension, all L

metrics

are equal—they are just the absolute value of the difference:

lim( | | ) .

→∞

−

∑

(2.3.10)

Mathematically, the Chebyshev distance is a metric induced by the

supremum norm or uniform norm. It is an example of an injective metric.

In two dimensions, i.e., plane geometry, if the points p and q have Cartesian

coordinates (x

, y

) and (x

, y

) , their Chebyshev distance is

Chess

= max(|x

− x

|, |y

− y

|). (2.3.11)

In fact, Manhattan distance, Euclidean distance above and Chebyshev

distance are Minkowski distance in special conditions.

2.3.5 Mahalanobis Distance

In statistics, Mahalanobis distance is another distance measure. It is based

on correlations between variables by which different patterns can be

identiﬁ ed and analyzed. It gauges similarity of an unknown sample set to

a known one. It differs from Euclidean distance in that it takes into account

the correlations of the data set and is scale-invariant. In other words, it is a

multivariate effect size. Formally, the Mahalanobis distance of a multivariate

vector x = (x

, x

, · · · , x

)

from a group of values with mean µ = (µ

, µ

, · · · , µ

)

and covariance matrix S is deﬁ ned as:

(x) =

()().

xSxμμ

−

−−

(2.3.12)

Mahalanobis distance can also be deﬁ ned as a dissimilarity measure

between two random vectors



and



of the same distribution with the

covariance matrix S:

(, ) ( ) ( ).

dxy x y S x y

−

=− −



(2.3.13)

If the covariance matrix is the identity matrix, the Mahalanobis distance

reduces to the Euclidean distance. If the covariance matrix is diagonal,

then the resulting distance measure is called the normalized Euclidean

distance:

()

(, ) .

dxy

−

∑



(2.3.14)

where s

is the standard deviation of the x

and y

over the sample set.

Mahalanobis’ discovery was prompted by the problem of identifying the

similarities of skulls based on measurements. And now, it is widely used

in cluster analysis and classiﬁ cation techniques.

2.4 Similarity Measures

2.4.1 Cosine Similarity

In some applications, the classic vector space model is used generally,

such as Relevance rankings of documents in a keyword search. It can be

calculated, using the assumptions of document similarities theory, by

comparing the deviation of angles between each document vector and the

original query vector where the query is represented as same kind of vector

as the documents.

An important problem that arises when we search for similar items of

any kind is that there may be far too many pairs of items to test each pair

for their degree of similarity, even if computing the similarity of any one

pair can be made very easy. Finally, we explore notions of “similarity” that

are not expressible as inter-section of sets. This study leads us to consider

the theory of distance measures in arbitrary spaces. Cosine similarity is

often used to compare documents in text mining.

In addition, it is used to measure cohesion within clusters in the

ﬁ eld of data mining. The cosine distance makes sense in spaces that have

dimensions, including Euclidean spaces and discrete versions of Euclidean

spaces, such as spaces where points are vectors with integer components or

boolean (0 or 1) components. In such a space, points may be thought of as

directions. We do not distinguish between a vector and a multiple of that

vector. Then the cosine distance between two points is the angle that the

vectors to those points make. This angle will be in the range of 0º to 180º,

regardless of how many dimensions the space has.

We can calculate the cosine distance by ﬁ rst computing the cosine of the

angle, and then applying the arc-cosine function to translate to an angle in

the 0–180º range. Given two vectors x and y, the cosine of the angle between

Mathematical Foundations 33

34 Applied Data Mining

them is the dot product of x and y divided by the L2-norms of x and y (i.e.,

their Euclidean distances from the origin). Recall that the dot product of

vectors



= [x

, x

, · · · , x

] and



= [y

, y

, · · · , y

] is

∗

∑

, the cosine

similarity is deﬁ ned as:

CosSim

(, )

|| || || ||

⋅

∗



(2.4.1)

We must show that the cosine similarity is indeed a distance measure.

We have deﬁ ned that the angle of two vector is in the range of 0 to 180, no

negative similarity value is possible. Two vectors have an angle of zero if

and only if they are along the same direction but with possible different

length magnitude. Symmetry is obvious: the angle between x and y is the

same as the angle between y and x. The triangle inequality is best argued

by physical reasoning.

One way to rotate from x to y is to rotate to z and thence to y. The sum of

those two rotations cannot be less than the rotation directly from x to y.

2.4.2 Adjusted Cosine Similarity

Although the prejudices of individuals can be certainly amended by Cosine

similarity, but only to distinguish the individual differences between the

different dimensional cannot measure the value of each dimension, it would

lead to such a situation, for example, the content ratings by 5 stars, two user

X and Y, on the two resources ratings are respectively (1, 2) and (4, 5), using

the results of the cosine similarity is 0.98, both are very similar. But with

the score of X, it seems X don’t like these two resources, and Y. The reason

for this situation is that likes it more the distance metric is a measure of

space between each points’ absolute distance with each location coordinates

directly; and the cosine similarity measure relies on space vector angle and is

reﬂ ected in the direction of the difference, not location. So the adjust cosine

similarity appeared. All dimension values are subtracted from an average

value, such as X and Y scoring average is 3, so after adjustment for (-2, -1)

and (1,2), then the cosine similarity calculation, -0.8, similarity is negative

and the difference is not small, but clearly more in line with the reality.

Based on the above exposition, computing similarity using basic cosine

measure in item-based case has one important drawback—the difference in

rating scale between different users are not taken into account. The adjusted

cosine similarity offsets this drawback by subtracting the corresponding

user average from each co-rated pair. Formally, the similarity between items

i and j using this scheme is given by

sim(i, j) =

()( )

() ()

ui u j

uU uU

RRR R

RR R R

∈

∈∈

−−

∑

∑∑

(2.4.2)

Here

–

is the average of the u-th user’s ratings.

2.4.3 Kullback-Leibler Divergence

In probability theory and information theory, the Kullback-Leibler

divergence is a nonsymmetric measure of the difference between two

probability distributions P and Q. KL measures the expected number of extra

bits required to code samples from P when using a code based on Q, rather

than using a code based on P. Typically P represents the “true” distribution

of data, observations, or a precisely calculated theoretical distribution.

The measure Q typically represents a theory, model, description, or

approximation of P.

Although it is often intuited as a metric or distance, the KL divergence

is not a true metric—for example, it is not symmetric: the KL from P to Q

is generally not the same as the KL from Q to P. However, its inﬁ nitesimal

form, speciﬁ cally its Hessian, is a metric tensor: it is the Fisher information

metric.

For probability distributions P and Q of a discrete random variable

their KL divergence is deﬁ ned to be

(P ÃÃ Q) =

()

()In .

()

∑

(2.4.3)

In words, it is the average of the logarithmic difference between the

probabilities P and Q, where the average is taken using the probabilities P.

The KL divergence is only deﬁ ned if P and Q both sum up to 1 and if Q(i) >

0 for any i is such that P(i) > 0. If the quantity 0 ln 0 appears in the formula,

it is interpreted as zero. For distributions P and Q of a continuous random

variable, KL divergence is deﬁ ned to be the integral:

(P ÃÃ Q) =

∫

– ∞

∞

()

()In ,

()

px dx

(2.4.4)

where p and q denote the densities of P and Q. More generally, if P and Q

are probability measures over a set X, and Q is absolutely continuous with

respect to P, then the Kullback-Leibler divergence from P to Q is deﬁ ned

(P ÃÃ Q) =

∫

In ,

(2.4.5)

Mathematical Foundations 35

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 2: Mathematical Foundations (3/5)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 2: Mathematical Foundations (3/5)