d ([x
1
, x
2
, · · · , x
n
] , [y
1
, y
2
, · · · , y
n
]) =
1/
1
||.
r
n
r
ii
i
xy
=
⎛⎞
⎜⎟
⎜⎟
⎝⎠
(2.3.4)
The case r = 2 is the usual L2-norm just mentioned. Another common
distance measure is the L1-norm, or Manhattan distance. There, the distance
between two points is the sum of the magnitudes of the differences in each
dimension. It is called “Manhattan distance” because it is the distance one
would have to travel between points if one were constrained to travel along
grid lines, as on the streets of a city such as Manhattan.
Another interesting distance measure is the L -norm, which is the limit
as r approaches infi nity of the Lr-norm. As r gets larger, only the dimension
with the largest difference matters, so formally, the L -norm is defi ned as
the maximum of|x
i
y
i
|over all dimensions i.
2.3.3 Minkowski Distance
The Minkowski distance is a metric on Euclidean space which can be
considered as a generalization of the Euclidean distance. The Minkowski
distance of order p between two points
P = (x
1
, x
2
, · · · , x
n
) and Q = (y
1
, y
2
, · · · , y
n
) ¢ R
n
(2.3.5)
is defi ned as:
1/
1
(| |)
n
pp
ii
i
xy
=
(2.3.6)
The Minkowski distance is a metric as a result of the Minkowski
inequality. Minkowski distance is typically used with p being 1 or 2. The
latter is the Euclidean distance, while the former is sometimes known as
the Manhattan distance. In the limiting case of p reaching infi nity we obtain
the Chebyshev distance:
1/
1
1
lim(||)max||
n
n
pp
ii ii
p
i
i
xy xy
→∞
=
=
−=
(2.3.7)
Similarly, when p reaches negative infi nity we have
1/
1
1
lim(||)min||
n
n
pp
ii ii
pi
i
xy xy
→− =
=
−=
(2.3.8)
The Minkowski distance is often used when variables are measured
on ratio scales with an absolute zero value. Variables with a wider range
can overpower the result. Even a few outliers with high values bias the
result and disregard the alikeness given by a couple of variables with a
lower upper bound.
Mathematical Foundations 31
32 Applied Data Mining
2.3.4 Chebyshev Distance
In mathematics, Chebyshev distance is a metric defi ned on a vector space
where the distance between two vectors is the greatest of their differences
along any coordinate dimension. It is also known as chessboard distance,
since in the game of chess the minimum number of moves needed by
the king to go from one square on a Chessboard to another equals the
Chebyshev distance between the centers of the squares. The Chebyshev
distance between two vectors or points p and q, with standard coordinates
p
i
and q
i
, respectively, is
D
Chebyshev
(p, q) =
max | |
ii
i
pq
(2.3.9)
This equals the limit of the L
p
metrics. In one dimension, all L
p
metrics
are equal—they are just the absolute value of the difference:
1/
1
lim( | | ) .
n
kk
ii
k
i
pq
→∞
=
(2.3.10)
Mathematically, the Chebyshev distance is a metric induced by the
supremum norm or uniform norm. It is an example of an injective metric.
In two dimensions, i.e., plane geometry, if the points p and q have Cartesian
coordinates (x
1
, y
1
) and (x
2
, y
2
) , their Chebyshev distance is
D
Chess
= max(|x
2
x
1
|, |y
2
y
1
|). (2.3.11)
In fact, Manhattan distance, Euclidean distance above and Chebyshev
distance are Minkowski distance in special conditions.
2.3.5 Mahalanobis Distance
In statistics, Mahalanobis distance is another distance measure. It is based
on correlations between variables by which different patterns can be
identifi ed and analyzed. It gauges similarity of an unknown sample set to
a known one. It differs from Euclidean distance in that it takes into account
the correlations of the data set and is scale-invariant. In other words, it is a
multivariate effect size. Formally, the Mahalanobis distance of a multivariate
vector x = (x
1
, x
2
, x
3
, · · · , x
N
)
T
from a group of values with mean µ = (µ
1
, µ
2
,
µ
3
, · · · , µ
N
)
T
and covariance matrix S is defi ned as:
D
M
(x) =
1
()().
T
xSxμμ
−−
μ
μ
(2.3.12)
Mahalanobis distance can also be defi ned as a dissimilarity measure
between two random vectors
x
and
y
of the same distribution with the
covariance matrix S:
1
(, ) ( ) ( ).
T
dxy x y S x y
=−

(2.3.13)
If the covariance matrix is the identity matrix, the Mahalanobis distance
reduces to the Euclidean distance. If the covariance matrix is diagonal,
then the resulting distance measure is called the normalized Euclidean
distance:
2
2
1
()
(, ) .
N
ii
i
i
xy
dxy
s
=
=

(2.3.14)
where s
i
is the standard deviation of the x
i
and y
i
over the sample set.
Mahalanobis’ discovery was prompted by the problem of identifying the
similarities of skulls based on measurements. And now, it is widely used
in cluster analysis and classifi cation techniques.
2.4 Similarity Measures
2.4.1 Cosine Similarity
In some applications, the classic vector space model is used generally,
such as Relevance rankings of documents in a keyword search. It can be
calculated, using the assumptions of document similarities theory, by
comparing the deviation of angles between each document vector and the
original query vector where the query is represented as same kind of vector
as the documents.
An important problem that arises when we search for similar items of
any kind is that there may be far too many pairs of items to test each pair
for their degree of similarity, even if computing the similarity of any one
pair can be made very easy. Finally, we explore notions of “similarity” that
are not expressible as inter-section of sets. This study leads us to consider
the theory of distance measures in arbitrary spaces. Cosine similarity is
often used to compare documents in text mining.
In addition, it is used to measure cohesion within clusters in the
eld of data mining. The cosine distance makes sense in spaces that have
dimensions, including Euclidean spaces and discrete versions of Euclidean
spaces, such as spaces where points are vectors with integer components or
boolean (0 or 1) components. In such a space, points may be thought of as
directions. We do not distinguish between a vector and a multiple of that
vector. Then the cosine distance between two points is the angle that the
vectors to those points make. This angle will be in the range of 0º to 180º,
regardless of how many dimensions the space has.
We can calculate the cosine distance by fi rst computing the cosine of the
angle, and then applying the arc-cosine function to translate to an angle in
the 0–180º range. Given two vectors x and y, the cosine of the angle between
Mathematical Foundations 33
34 Applied Data Mining
them is the dot product of x and y divided by the L2-norms of x and y (i.e.,
their Euclidean distances from the origin). Recall that the dot product of
vectors
x
= [x
1
, x
2
, · · · , x
n
] and
y
= [y
1
, y
2
, · · · , y
n
] is
1
n
ii
i
xy
=
, the cosine
similarity is defi ned as:
CosSim
(, )
|| || || ||
xy
xy
xy
=



(2.4.1)
We must show that the cosine similarity is indeed a distance measure.
We have defi ned that the angle of two vector is in the range of 0 to 180, no
negative similarity value is possible. Two vectors have an angle of zero if
and only if they are along the same direction but with possible different
length magnitude. Symmetry is obvious: the angle between x and y is the
same as the angle between y and x. The triangle inequality is best argued
by physical reasoning.
One way to rotate from x to y is to rotate to z and thence to y. The sum of
those two rotations cannot be less than the rotation directly from x to y.
2.4.2 Adjusted Cosine Similarity
Although the prejudices of individuals can be certainly amended by Cosine
similarity, but only to distinguish the individual differences between the
different dimensional cannot measure the value of each dimension, it would
lead to such a situation, for example, the content ratings by 5 stars, two user
X and Y, on the two resources ratings are respectively (1, 2) and (4, 5), using
the results of the cosine similarity is 0.98, both are very similar. But with
the score of X, it seems X don’t like these two resources, and Y. The reason
for this situation is that likes it more the distance metric is a measure of
space between each points’ absolute distance with each location coordinates
directly; and the cosine similarity measure relies on space vector angle and is
refl ected in the direction of the difference, not location. So the adjust cosine
similarity appeared. All dimension values are subtracted from an average
value, such as X and Y scoring average is 3, so after adjustment for (-2, -1)
and (1,2), then the cosine similarity calculation, -0.8, similarity is negative
and the difference is not small, but clearly more in line with the reality.
Based on the above exposition, computing similarity using basic cosine
measure in item-based case has one important drawback—the difference in
rating scale between different users are not taken into account. The adjusted
cosine similarity offsets this drawback by subtracting the corresponding
user average from each co-rated pair. Formally, the similarity between items
i and j using this scheme is given by
sim(i, j) =
,,
22
,,
()( )
.
() ()
uu
ui u j
uU
uu
ui u j
uU uU
RRR R
RR R R
∈∈
−−
−−
∑∑
(2.4.2)
Here
R
O
is the average of the u-th user’s ratings.
2.4.3 Kullback-Leibler Divergence
In probability theory and information theory, the Kullback-Leibler
divergence is a nonsymmetric measure of the difference between two
probability distributions P and Q. KL measures the expected number of extra
bits required to code samples from P when using a code based on Q, rather
than using a code based on P. Typically P represents the “true” distribution
of data, observations, or a precisely calculated theoretical distribution.
The measure Q typically represents a theory, model, description, or
approximation of P.
Although it is often intuited as a metric or distance, the KL divergence
is not a true metric—for example, it is not symmetric: the KL from P to Q
is generally not the same as the KL from Q to P. However, its infi nitesimal
form, specifi cally its Hessian, is a metric tensor: it is the Fisher information
metric.
For probability distributions P and Q of a discrete random variable
their KL divergence is defi ned to be
D
KL
(P ÃÃ Q) =
()
()In .
()
i
Pi
Pi
Qi
(2.4.3)
In words, it is the average of the logarithmic difference between the
probabilities P and Q, where the average is taken using the probabilities P.
The KL divergence is only defi ned if P and Q both sum up to 1 and if Q(i) >
0 for any i is such that P(i) > 0. If the quantity 0 ln 0 appears in the formula,
it is interpreted as zero. For distributions P and Q of a continuous random
variable, KL divergence is defi ned to be the integral:
D
KL
(P ÃÃ Q) =
()
()In ,
()
px
px dx
qx
(2.4.4)
where p and q denote the densities of P and Q. More generally, if P and Q
are probability measures over a set X, and Q is absolutely continuous with
respect to P, then the Kullback-Leibler divergence from P to Q is defi ned
as
D
KL
(P ÃÃ Q) =
X
In ,
dQ
dP
dP
(2.4.5)
Mathematical Foundations 35
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.98.190