x
i
= a
i,1
s
1
+ · · · + a
i,k
s
k
+ · · · + a
i,n
s
n
weighted by the mixing weights a
i,k
. The same generative model can be
written in vectorial form as x = ¹
n
k=1
s
k
a
k
, where the observed random vector
x is represented by the basis vectors a
k
= (a
1,k
, · · · , a
m,k
)
T
. The basis vectors a
k
form the columns of the mixing matrix A = (a
1
, · · · , a
n
) and the generative
formula can be written as x = As, where s = (s
1
, · · · , s
n
)
T
. Given the model
and realizations (samples) x
1
, · · · , x
N
of the random vector x, the task is
to estimate both the mixing matrix A and the sources s. This is done by
adaptively calculating the vectors
Y
and setting up a cost function which
either maximizes the non-gaussianity of the calculated s
k
= (
Y
T
× x) or
minimizes the mutual information. In some cases, a prior knowledge of the
probability distributions of the sources can be used in the cost function. The
original sources s can be recovered by multiplying the observed signals x
with the inverse of the mixing matrix W = A
−1
, also known as the unmixing
matrix. Here it is assumed that the mixing matrix is square (n = m). If the
number of basis vectors is greater than the dimensionality of the observed
vectors, n > m, the task is overcomplete but is still solvable with the pseudo
inverse. In Linear noisy ICA model, with the added assumption of zeromean
and uncorrelated Gaussian noise n ~ N(0, diag(¹)), the ICA model takes the
form x = As + n. And in Non-linear ICA model, the mixing of the sources
does not need to be linear. Using a nonlinear mixing function f(·|θ) with
parameters θ, non-linear ICA model is x = f(s|θ) + n.
2.5.3 Non-negative Matrix Factorization
Non-negative matrix factorization (NMF) is a group of algorithms in
multivariate analysis and linear algebra where a matrix, X, is factorized
into (usually) two matrices, W and H: nmf(X) WH.
Factorization of matrices is generally non-unique, and a number
of different methods of doing so have been developed (e.g., principal
component analysis and singular value decomposition) by incorporating
different constraints; non-negative matrix factorization differs from these
methods in that it enforces the constraint that the factors W and H must be
non-negative, i.e., all elements must be equal to or greater than zero.
Let matrix V be the product of the matrices W and H such that:
WH = V
Matrix multiplication can be implemented as linear combinations of
column vectors in W with coeffi cients supplied by cell values in H. Each
column in V can be computed as follows:
v
i
=
1
N
ji j
j
Hw
=
Mathematical Foundations 41
42 Applied Data Mining
where N is the number of columns in W, v
i
is the ith column vector of the
product matrix V, H
ji
is the cell value in the jth row and ith column of the
matrix H, w
j
is the jth column of the matrix W. When multiplying matrices
the factor matrices can be of signifi cantly lower rank than the product matrix
and it is this property that forms the basis of NMF. If we can factorize a
matrix into factors of signifi cantly lower rank than the original matrix, then
the column vectors of the fi rst factor matrix can be considered as spanning
vectors of the vector space defi ned by the original matrix.
Here is an example based on a text-mining application:
Let the input matrix (the matrix to be factored) be V with 10000 rows
and 500 columns where words are in rows and documents are in
columns. In other words, we have 500 documents indexed by 10000
words. It follows that a column vector v in V represents a document.
Assume we ask the algorithm to fi nd 10 features in order to generate a
features matrix W with 10000 rows and 10 columns and a coeffi cients
matrix H with 10 rows and 500 columns.
The product of W and H is a matrix with 10000 rows and 500 columns,
the same shape as the input matrix V and, if the factorization worked,
also a reasonable approximation to the input matrix V.
From the treatment of matrix multiplication above it follows that each
column in the product matrix WH is a linear combination of the 10
column vectors in the features matrix W with coeffi cients supplied by
the coeffi cients matrix H.
This last point is the basis of NMF because we can consider each original
document in our example as being built from a small set of hidden features.
NMF generates these features.
It is useful to think of each feature (column vector) in the features matrix
W as a document archetype comprising a set of words where each word’s
cell value defi nes the word’s rank in the feature: The higher a word’s cell
value the higher the word’s rank in the feature. A column in the coeffi cients
matrix H represents an original document with a cell value defi ning the
document’s rank for a feature. This follows because each row in H represents
a feature. We can now reconstruct a document (column vector) from our
input matrix by a linear combination of our features (column vectors in
W) where each feature is weighted by the feature’s cell value from the
document’s column in H.
2.5.4 Singular Value Decomposition
A number of data sets are naturally described in matrix form. Examples
range from microarrays to collaborative fi ltering data, to the set of pairwise
distances of a cloud of points. In many of these examples, singular value
decomposition (SVD) provides an effi cient way to construct a low-rank
approximation thus achieving both dimensionality reduction, and effective
denoizing. SVD is also an important tool in the design of approximate linear
algebra algorithms for massive data sets.
In linear algebra, the singular value decomposition (SVD) is a
factorization of a real or complex matrix, with many useful applications in
signal processing and statistics. Formally, the singular value decomposition
of an mn real or complex matrix M is a factorization of the form
M = U¹V*,
where U is an m × m real or complex unitary matrix, is an m × n rectangular
diagonal matrix with non-negative real numbers on the diagonal, and V*
(the conjugate transpose of V) is an n × n real or complex unitary matrix.
The diagonal entries ¹
i,i
of ¹ are known as the singular values of M. The
m columns of U and the n columns of V are called the left-singular vectors
and right-singular vectors of M, respectively.
The singular value decomposition and the eigendecomposition are
closely related. Namely:
The left-singular vectors of M are eigenvectors of MM*
The right-singular vectors of M are eigenvectors of MM*
The non-zero-singular values of M (found on the diagonal entries of
¹) are the square roots of the non-zero eigenvalues of both M*M and
MM*.
Applications which employ the SVD include computing the pseudo
inverse, least squares fi tting of data, matrix approximation, and determining
the rank, range and null space of a matrix.
2.6 Chapter Summary
In this chapter, we have systematically presented the mathematical
foundations used in this book. We start with data organization and
distribution followed by the intensive discussion on distance and similarity
measures. This chapter also covers the important issue of dimensionality
reduction approaches that is commonly used in vector space models. The
aim of this chapter is to lay down a solid foundation for readers to better
understand the techniques and algorithms mentioned in later chapters.
Mathematical Foundations 43
44 Applied Data Mining
References
[1] Chandrika Kamath. Scientifi c Data Mining: A Practical Perspective, Lawrence Livermore
National Laboratory, Livermore, California, 2009.
[2] S. Brin and L. Page. Anatomy of a Large-scale Hypertextual Web Search engine, Proc. 7th
Intl. World-Wide-Web Conference, pp. 107C117, 1998.
[3] Norman Lloyd Johnson, Samuel Kotz, N. Balakrishnan. Continuous Univariate
Distributions, John Wiley and Sons, 2005.
[4] N. Samuel Kotz, Balakrishnan and Norman L. Johnson. Continuous Multivariate
Distributions, Models and Applications, John Wiley and Sons, 2000.
[5] Anand Rajaraman and Jeffrey D. Ullman. Mining of Massive Datasets, Palo Alto, CA, pp.
74C79, 2011.
[6] J. C. Christopher, Burges. Dimension Reduction, Now Publishers Inc., 2010.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.83.7