Fei Wang
IBM T. J. Watson Research Center
Yorktown Heights, NY
[email protected]
Distance metric learning is a fundamental problem in data mining and knowledge discovery, and it is of key importance for many real world applications. For example, information retrieval utilizes the learned distance metric to measure the relevance between the candidate data and the query; clinical decision support uses the learned distance metric to measure pairwise patient similarity [19, 23, 24]; pattern recognition can use the learned distance metric to match most similar patterns. In a broader sense, distance metric learning lies in the heart of many data classification problems. As long as a proper distance metric is learned, we can always adopt k-Nearest Neighbor (kNN) classifier [4] to classify the data. In recent years, many studies have demonstrated [12 , 27, 29], either theoretically or empirically, that learning a good distance metric can greatly improve the performance of data classification tasks.
In this survey, we will give an overview of the existing supervised distance metric learning approaches and point out their strengths and limitations, as well as present challenges and future research directions. We focus on supervised algorithms because they are under the same setting as data classification. We will categorize those algorithms from the aspect of linear/nonlinear, local/global, transductive/inductive, and also the computational technology involved.
In the rest of this chapter, we will first introduce the definition of distance metric in Section 18.2. Then we will overview existing supervised metric learning algorithms in Section 18.3, followed by discussions and conclusions in Section 18.5. Table 18.1 summarizes the notations and symbols that will be used throughout the paper.
The Meanings of Various Symbols That Will be Used Throughout This Chapter
Symbol |
Meaning |
---|---|
n |
number of data |
d |
data dimensionality |
xi |
the i-th data vector |
X |
data matrix |
M |
precision matrix of the generalized Mahalanobis distance |
wi |
the i-th projection vector |
W |
projection matrix |
Ni |
the neighborhood of xi |
ϕ(·) |
nonlinear mapping used in kernel methods |
K |
kernel matrix |
L |
Laplacian matrix |
Before describing different types of distance metric learning algorithms, we first define necessary notations and concepts on distance metric learning.
Throughout the paper, we use χ to represent a set of data points. If x, y, z ∈ χ are data vectors with the same dimensionality, we call D : χ × χ → ℝ a Distance Metric if it satisfies the following four properties:1
If we relax the coincidence condition to if x = y ⇒ D(x, y) = 0, then D is called a Pseudo Metric. There are many well-known distance metrics, such as the Euclidean and Mahalanobis distance below:
D(x,y)=√(x−y)⊤(x−y).(18.1)
D(x,y)=√(x−y)⊤S(x−y)(18.2)
where S is the inverse of the data covariance matrix (also referred to as the precision matrix).3
Most of the recent distance metric learning algorithms can be viewed as learning a generalized Mahalanobis distance defined as below:
Generalized Mahalanobis Distance (GMD). A GMD measures the distance between data vectors x and y by
D(x,y)=√(x−y)⊤M(x−y)(18.3)
where M is some arbitrary Symmetric Positive Semi-Definite (SPSD) matrix.
The major goal of learning a GMD is to learn a proper M. As M is SPSD, we can decompose M as M = UΛU⊤ with eigenvalue decomposition, where U is a matrix collecting all eigenvectors of M, and Λ is a diagonal matrix with all eigenvalues of M on its diagonal line. Let W = UΛ1/2, then we have
D(x,y)=√(x−y)⊤WW⊤(x−y)=√(W⊤(x−y))⊤(W⊤(x−y))=√(˜x−˜y)⊤(˜x−˜y)(18.4)
where ˜x=W⊤x. Therefore GMD is equivalent to the Euclidean distance in some projected space. Based on this observation, we define distance metric learning as follows:
Distance Metric Learning. The problem of learning a distance function D for a pair of data points x and y is to learn a mapping function f, such that f(x) and f(y) will be in the Euclidean space and D(x,y)=||f(x)−f(y)||, where || ⋅ || is the l2 norms.
With this definition, we can also categorize whether a distance metric learning algorithm is linear or nonlinear based on whether the projection is linear or nonlinear.
In this section we will review the existing supervised distance metric learning algorithms, which learn distance metrics on both data points and their supervision information, which is usually in the form of (1) Data labels, which indicate the class information each training data point belongs to. The assumption is that distance between data points with the same label should be closer to distance between data points from different labels. (2) Pairwise constraints indicate whether a pair of data points should belong to the same class (must-links) or not (cannot-links). Before going into the details of those algorithms, we first categorize those supervised methodologies into different types according to their characteristics, which is shown in Table 18.2.
Supervised Distance Metric Learning Algorithms
Note: ED stands for Eigenvalue Decomposition, QP stands for Quadratic Programming, GD stands for Gradient Descent, and all other abbreviations can be found in the main text.
Linear Discriminant Analysis (LDA) [6] is one of the most popular supervised linear embedding methods. It seeks for the projection directions under which the data from different classes are well separated. More concretely, supposing that the data set belongs to C different classes, LDA defines the compactness matrix and scatterness matrix as
∑C=1C∑c1nc∑xi∈c(xi−ˉxc)(xi−ˉxc)⊤(18.5)
∑S=1C∑c(ˉxc−ˉx)(ˉxc−ˉx)⊤.(18.6)
The goal of LDA is to find a W that maximizes
minW⊤W=Itr(W⊤∑CW)tr(W⊤∑SW).(18.7)
By expanding the numerator and denominator of the above expression, we can observe that the numerator corresponds to the sum of distances between each data point to its class center after projection, and the denominator represents the sum of distances between every class center to the entire data center after projection. Therefore, minimizing the objective will maximize the between-class scatterness while minimizing the within-class scatterness after projection. Solving problem (18.7) directly is hard, some researchers [8, 10] have done research on this topic. LDA is a linear and global method. The learned distance between xi and xj is the Euclidean distance between W⊤xi and W⊤xj.
Kernelization of LDA: Similar to the case of PCA, we can extend LDA to the nonlinear case via the kernel trick, which is called Kernel Discriminant Analysis (KDA)) [14]. After mapping the data into the feature space using ϕ, we can compute the compactness and scatterness matrices as
∑ϕC=1C∑c1nc∑xi∈c(ϕ(xi)−ˉϕc)(ϕ(xi)−ˉϕc)⊤(18.8)
∑S=1C∑c(ˉϕc−ˉϕ)(ˉϕc−ˉϕ)⊤.(18.9)
Suppose the projection matrix we want to get is wϕ in the feature space, then with the representer theorem
Wϕ=Φα(18.10)
where Φ=[ϕ(x1),ϕ(x2),⋯,ϕ(xn)] and α is the coefficient vector over all ϕ(xi) for 1≤i≤n. We define K=Φ⊤Φ as the kernel matrix.
Then
(Wϕ)⊤∑ϕCWϕ=α⊤[1CC∑c=11nc∑xi∈cΦ⊤(ϕ(xi)−ˉϕc)(ϕ(xi)−ˉϕc)⊤Φ]α=α⊤[1CC∑c=11nc∑xi∈c(K.i−ˉK.c)⊤]α=α⊤MCα(18.11)
where K.i = and ˉK.c=1nc∑xi∈cK.i.
(Wϕ)⊤∑ϕSWϕ=α⊤[1C∑c(ˉK.c−ˉK.*)(ˉK.c−ˉK.*)⊤]α=α⊤MSα(18.12)
where K.*=1nc∑ni=1K.i Therefore we can get α by solving
minα⊤α=Itr(α⊤MCα)tr(α⊤MSα).(18.13)
MMDA can be viewed as an extension of the Support Vector Machine (SVM) algorithm [21]. Supposing data points X=[x1,x2,⋯,xn] come from two classes, the label of xi is li∈{−1,+1}. As a recap, the goal of SVM is to find the maximum-margin hyperplane that divides the points with li = 1 from those with li = −1 (thus it is a supervised method). Any hyperplane can be written as a point x satisfying
w⊤x−b=0(18.14)
where b/||w|| corresponds to the distance of the hyperplane from the origin. SVM aims to choose the w and b to maximize the distance between the parallel hyperplanes that are as far apart as possible while still separating the data, which is usually referred to as the margin of the two classes. These hyperplanes can be described by the equations
w⊤x−b=1orw⊤x−b=−1.(18.15)
The distance between the two parallel hyperplanes is 2/||w||. Then if the data from two classes are clearly separated, the goal of SVM is to solve the following optimization problem to find the hyperplane that maximizes the margin between two classes
minw,b12||w||2(18.16)s.t.li(w⊤xi−b)≥1(∀i=1,2,⋯,n).
However in reality the two classes may not be perfectly separable, i.e., there might be some overlapping between them. Then we need soft margin SVM, which aims at solving
minw,b,ξ12||w||2+Cn∑i=1ξi(18.17)s.t.li(w⊤xi−b)≥1−ξi(∀i=1,2,⋯,n)
where {ξi}≥0 are slack variables used to penalize the margin on the overlapping region.
MMDA aims to solve more than one projection directions, which aims to solve the following optimization problem
minW,b,ξr≥012d∑r=1||wr||2+Cnd∑r=1n∑i=1ξri(18.18)s.t.∀i=1,…,n, r=1,…,dli((wr)Txi+b)≥1−ξri,WTW = I.
Therefore MMDA is a global and linear approach. One can also apply the kernel trick to make it nonlinear; the details can be found in [11]. The learned distance between xi and xj is just the Euclidean distance between W⊤xi and W⊤xj.
Both LDA and MMDA use data labels as the supervision information. As we introduced at the beginning of Section 18.3, another type of supervision information we considered is pairwise constraints. The data label information is more strict in the sense that we can convert data labels into pairwise constraints, but not vice versa.
One of the earliest researches that makes use of pairwise constraints for learning a proper distance metric is the Learning with Side-Information (LSI) approach [28]. We denote the set of must-link constraints as M and the set of cannot-link constraints as C; then the goal of LSI is to solve the following optimization problem
maxM∑(xi,xj)∈C(xi−xj)⊤M(xi−xj)(18.19)s.t.∑(xu,xv)∈M(xu−xv)⊤M(xu−xv)≤1M≽0.
This is a quadratic optimization problem and [28] proposed an iterative projected gradient ascent method to solve it. As M is positive semi-definite, we can always factorize it as M=WW⊤. Thus LSI is a global and linear approach. The learned distance formulation is exactly the general Mahalanobis distance with precision matrix M.
Relevant Component Analysis (RCA) [18] is another representative distance metric learning algorithm utilizing pairwise data constraints. The goal of RCA is to find a transformation that amplifies relevant variability and suppresses irrelevant variability. We consider that data variability is correlated with a specific task if the removal of this variability from the data deteriorates (on average) the results of clustering or retrieval. Variability is irrelevant if it is maintained in the data but not correlated with the specific task [18]. We also define small clusters called chunklets, which are connected components derived by all the must-links. The specific steps involved in RCA include:
C=1pk∑j=1nj∑i=1(xji−ˉmj)(xji−ˉmj)⊤.(18.20)
Therefore, RCA is a global, linear approach.
Information theoretic objective is one mechanism to develop a supervised distance metric. Information Theoretic Metric Learning (ITML) [3] is one such representative algorithm. Suppose we have an initial generalized Mahalanobis distance parameterized by precision matrix M0, a set M of must-link constraints, and a set C of cannot-link constraints. ITML solves the following optimization problem
minM≽0dlogdet(M,M0)(18.21)s.t.(xi−xj)⊤M(xi−xj)≥l,(xi,xj)∈C(xu−xv)⊤M(xu−xv)⋝u,(xu,xv)∈M
where
dlogdet(M,M0)=tr(MM−10)−logdet(MM−10)−n(18.22)
where dlogdet is the LogDet divergence, which is also known as Stein’s loss. It can be shown that Stein’s loss is the unique scale invariant loss-function for which the uniform minimum variance unbiased estimator is also a minimum risk equivariant estimator [3]. The authors in [3] also proposed an efficient Bregman projection approach to solve problem (18.21). ITML is a global and linear approach. The learned distance metric is the Mahalanobis distance with precision matrix M.
All the supervised approaches we introduced above are global methods. Next we will also introduce several representative local supervised metric learning algorithms. First we will overview Neighborhood Component Analysis (NCA) [7]. Similar as in SNE, described in unsupervised metric learning, each point xi selects another point xj as its neighbor with some probability pij, and inherits its class label from the point it selects. NCA defines the probability that point i selects point j as a neighbor:
pij=exp(−||W⊤xi−W⊤xj||2)∑k≠iexp(−||W⊤xi−W⊤xk||2).(18.23)
Under this stochastic selection rule, NCA computes the probability that point i will be correctly classified
pi=∑j∈ℒipij(18.24)
where ℒi={j|li=lj} that is the set of points in the same class as point i.
The objective NCA maximizes is the expected number of points correctly classified under this scheme:
J(W)=∑ipi=∑i∑j∈ℒipij(18.25)
[7] proposed a truncated gradient descent approach to minimize J(W). NCA is a local and linear approach. The learned distance between xi and xj is the Euclidean distance between W⊤xi and W⊤xj.
Average Neighborhood Margin Maximization (ANMM) [25] is another local supervised metric learning approach, which aims to find projection directions where the local class discriminability is maximized. To define local discriminability, [25] first defines the following two types of neighborhoods:
Definition 1(Homogeneous Neighborhoods). For a data point xi, its ξ nearest homogeneous neighborhood Nio is the set of ξ most similar4 data, which are in the same class with xi.
Definition 2(Heterogeneous Neighborhoods). For a data point xi, its ζ nearest heterogeneous neighborhood Nie is the set of ζ most similar data, which are not in the same class with xi.
Then the average neighborhood margin γi for xi is defined as
γi=∑k:xk∈Nei||yi−yk|||Nei|−∑j:xj∈Noi||yi−yj|||Noi|,
where |⋅| represents the cardinality of a set. This margin measures the difference between the average distance from xi to the data points in its heterogeneous neighborhood and the average distance from it to the data points in its homogeneous neighborhood. The maximization of such a margin can push the data points whose labels are different from xi away from xi while pulling the data points having the same class label with xi towards xi.
Therefore, the total average neighborhood margin can be defined as
γ=∑iγi=∑i(∑k:xk∈Nei||yi−yk||2|Nei|−∑j:xj∈Noi||yi−yj||2|Noi|)(18.26)
and the ANMM criterion is to maximize γ. By replacing yi=W⊤xi, [25] obtains the optimal W by performing eigenvalue decomposition of some discriminability matrix. Thus ANMM is a local and linear approach. The learned distance between xi and xj is the Euclidean distance between WTxi and WTxj. The authors in [25] also proposed a kernelized version of ANMM to handle nonlinear data called KANMM, thus KANMM is a local and nonlinear approach.
The last local supervised metric learning approach we want to introduce is the Large Margin Nearest Neighbor Classifier (LMNN) [26]. The goal of LMNN is similar to that of ANMM, i.e., to pull the data with the same labels closer while pushing data with different labels far apart. LMNN deploys a different margin formulation. Specifically, LMNN defines the pull energy term as
εpull=∑xj∈Noi||W⊤(xi−xj)||2(18.27)
which is the sum of pairwise distances between a data point xi and the data point in xi’s homogeneous neighborhood after projection. LMNN defines the push energy as
εpull=∑i∑xj∈Noi∑l(1−δil)[1+||W⊤(xi−xj)||2−||W⊤(xi−xl)||2]+(18.28)
where δil = 1 is the labels of xi and xl are the same, and δil = 0 otherwise. The intuition is to require that the data points from different classes should be at least separated from it by the distance 1. This formulation is very similar to the margin formulation in multiclass SVM [2] LMNN also pushes the data with different labels to at least distance 1 from its homogeneous neighborhood. The goal of LMNN is to minimize
ε = μεpull+(1−μ)εpush.(18.29)
The authors in [26] proposed a semi-definite programming technique to solve for M = WWT. Thus LMNN is a local and linear approach. The learned distance between xi and xj is the Euclidean distance between WTxi and WTxj.
Although supervised metric learning has successfully been applied in many applications, it is human labor intensive and time consuming to get the supervision information. Also most of the above mentioned approaches require eigenvalue decomposition or quadratic optimization. In this section we will briefly review two advanced topics in metric learning: semi-supervised approaches and online learning.
Semi-supervised approaches aim to learn a proper distance metric from the data where the supervision information is only available on a small portion of the data. Those algorithms utilize both data with and without supervision information in the learning process. Therefore one straightforward way one can think of to construct a semi-supervised algorithm is to deploy an objective in the form of λ1ℒ(D)+λ2U(D), where U(D) is constructed on the entire data, and L(D) is constructed on the data with supervision information only. Finally, we put some constraints on the learned distance metric to balance both parts.
Laplacian Regularized Metric Learning (LMRL) [9] is one semi-supervised distance metric learning approach. LRML adopts data smoothness term as the unsupervised term U(D);in terms of the supervised term, LRML chooses an ANMM type of objective as L(D). The optimization problem that LRML aims to solve is
minMt︸U(D)+γ1t2−γ2t3︸ℒ(D)(18.30)s.t.t1≤tM≽0
where the smoothness term is
t1=∑i,j||W⊤xi−W⊤xj||2ωij=tr(W⊤XLX⊤W)=tr(XLX⊤M)(18.31)
where M=WW⊤. The supervised terms consisting of compactness and scatterness are
t2=∑(xi,xj)∈M||W⊤xi−W⊤xj||2=tr [M ∑(xi,xj)∈M(xi−xj)(xi−xj)⊤](18.32)
t3=∑(xi,xj)∈C||W⊤xi−W⊤xj||2=tr [M ∑(xi,xj)∈C(xi−xj)(xi−xj)⊤](18.33)
where M and C are the sets of must-links and cannot-links, respectively.
[9] proposed a semi-definite programming approach for solving problem (18.30). LRML is a mixture of local (its unsupervised part) and global (its supervised part) approach, and it is linear. The learned distance is the Mahalanobis distance with precision matrix M.
Similarly, Constraint Margin Maximization (CMM) [22] selects the same supervised term as LRML, but a different PCA-type unsupervised term in its objective. Specifically, the optimization problem CMM aims to solve is
maxWt4︸U(D)−γ1t2−γ2t3︸ℒ(D)(18.34)s.t.W⊤W=I
where the unsupervised term is t4=tr(W⊤XΣX⊤W) is the PCA-like term. Note that before we apply CMM, all data points need to be centered, i.e., their mean should be subtracted from the data matrix. The intuition of CMM is to maximally unfold the data points in the projection space while at the same time satisfying those pairwise constraints. The authors in [22] showed that the optimal W can be obtained by eigenvalue decomposition to some matrix. Therefore CMM is a global and linear approach. Wang et. al. [22] also showed how to derive its kernelized version for handling nonlinear data. The learned distance between xi and xj is the Euclidean distance between W⊤xi and W⊤xj.
Most of the distance metric learning approaches involve expensive optimization procedures such as eigen-decomposition and semi-definite programming. One way to make those algorithms more efficient is the online learning [16] strategy, which incorporates the data points into the learning process in a sequential manner. More concretely, online learning updates the learned distance metric iteratively. At each iteration, only one or a small batch of data are involved. Another scenario where the online learning strategy can be naturally applied is to learn distance metrics for streaming data, where the data are coming in a streaming fashion so that the distance metric needs to be updated iteratively. Next we present two examples of online distance metric learning approaches.
Pseudo-Metric Online Learning (POLA) [17] falls into the category of supervised metric learning with pairwise constraints. More specifically, there is a must-link constraint set M and a cannot-link constraint set C. POLA assigns a label lij for each pair (xi,xj) in M or C, such that if (xi,xj)∈M, lij=1. if (xi,xj)∈C, lij=−1. Then it introduces a threshold b and constructs the following constraints
∀(xi,xj)∈M, lij=1⇒(xi−xj)⊤M(xi−xj)≤b−1(18.35)
∀(xi,xj)∈C, lij=−1⇒(xi−xj)⊤M(xi−xj)≥b−1(18.36)
which can be unified as
lij[b−(xi−xj)⊤M(xi−xj)]≥1(18.37)
Note that this formulation is similar to the constraint in standard SVM. Then the objective of POLA is the follow hinge loss
Jij(M,b)=maxC1ij&C2(0,lij[(xi−xj)⊤M(xi−xj)−b]+1)(18.38)
The two constraint sets are defined as
POLA operates in an iterative way: First POLA initializes M as a zero matrix, then at each step, it randomly picks one data pair from the constraint set (either M or C), and then does projections on Cij1 and C2 alternatively. By projecting M and b onto Cij1, POLA gets the updating rules for M and b as
where
By projecting (M, b) onto C2, POLA updates M as
where and are the smallest eigenvalue-eigenvector pair of . Therefore, POLA incorporates the data in constraint sets in a sequential manner.
Another technique we want to review here is the Online Information Theoretic Metric Learning (OITML) approach [3]. This method also falls into the category of supervised metric learning. It is the online version of the ITML approach we introduced in Section 18.3.5. Suppose at time t + 1, we need to randomly pick a pair of data from the constraint set, and minimize the following objective
where and is the logdet divergence. [3] showed that Mt+1 can be updated with the following rule
where
POLA and OITML are only two examples of online distance metric learning.
Till now we have introduced the state-of-the-art supervised distance metric learning algorithms of recent years. All of those supervised metric learning approaches we listed above require some pre-specified free parameters, and all of them involve some expensive computational procedures such as eigenvalue decomposition or semi-definite programming. One exception is the ITML approach, as it deploys a Bregman projection strategy that may make the solution relatively efficient. However, ITML is sensitive to the initial choice of M0, which makes it difficult to apply in the case when we do not have enough prior knowledge. In practice, according to the type of supervision information provided, we can select a proper supervised metric learning approach that can handle that supervision information. However, in many real world applications, it may be expensive and time consuming to get that supervision information. Next we survey semi-supervised approaches that can leverage both labeled and unlabeled information. For the future of distance metric learning research, we believe the following directions are promising.
[1] Yair Censor. Parallel Optimization: Theory, algorithms, and applications. Oxford University Press, 1997.
[2] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2:265–292, 2001.
[3] Jason V. Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S. Dhillon. Information-theoretic metric learning. In Proceedings of International Conference on Machine Learning, pages 209–216, 2007.
[4] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification (2nd Edition). Wiley-Interscience, 2nd edition, 2001.
[5] Charles Elkan. Bilinear models of affinity. Personal note, 2011.
[6] Keinosuke Fukunaga. Introduction to Statistical Pattern Recognition, Second Edition (Computer Science and Scientific Computing Series). Academic Press, 1990.
[7] Jacob Goldberger, Sam Roweis, Geoff Hinton, and Ruslan Salakhutdinov. Neighborhood component analysis. In Advances in Neural Information Processing Systems, 17:513–520, 2004.
[8] Y. Guo, S. Li, J. Yang, T. Shu, and L. Wu. A generalized foley-sammon transform based on generalized fisher discriminant criterion and its application to face recognition. Pattern Recognition Letters, 24(1–3): 147–158, 2003.
[9] Steven C. H. Hoi, Wei Liu, and Shih-Fu Chang. Semi-supervised distance metric learning for collaborative image retrieval. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1–7, 2008.
[10] Yangqing Jia, Feiping Nie, and Changshui Zhang. Trace ratio problem revisited. IEEE Transactions on Neural Networks, 20(4):729–735, 2009.
[11] András Kocsor, Kornel Kovaács, and Csaba Szepesvári. Margin maximizing discriminant analysis. In Proceedings of European Conference on Machine Learning, volume 3201 of Lecture Notes in Computer Science, pages 227–238. Springer, 2004.
[12] Brian Kulis. Metric learning. Tutorial at International Conference on Machine Learning, 2010.
[13] Zhen Li, Liangliang Cao, Shiyu Chang, John R. Smith, and Thomas S. Huang. Beyond mahalanobis distance: Learning second-order discriminant function for people verification. In Prcoeedings of Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, pages 45–50, 2012.
[14] S. Mika, G. Ratsch, J. Weston, B. Schölkopf, and K. R. Müllers. Fisher discriminant analysis with kernels. Neural Networks for Signal Processing IX, 1999. Proceedings of the 1999 IEEE Signal Processing Society Workshop, pages 41–48, 1999.
[15] Jagdish J. Modi. Parallel algorithms and matrix computation. Oxford University Press, 1989.
[16] Shai Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. The Hebrew University of Jerusalem. Phd thesis, July 2007.
[17] Shai Shalev-Shwartz, Yoram Singer, and Andrew Y. Ng. Online and batch learning of pseudo-metrics. In Proceedings of International Conference on Machine Learning, pages 94–101, 2004.
[18] N. Shental, T. Hertz, D. Weinshall, and M. Pavel. Adjustment learning and relevant component analysis. In Proceedings ofEuropean Conference on Computer Vision, pages 776–790, 2002.
[19] Jimeng Sun, Daby Sow, Jianying Hu, and Shahram Ebadollahi. Localized supervised metric learning on temporal physiological data. In ICPR, pages 4149–4152, 2010.
[20] Ivor W. Tsang, Pak ming Cheung, and James T. Kwok. Kernel relevant component analysis for distance metric learning. In IEEE International Joint Conference on Neural Networks (IJCNN, pages 954–959, 2005.
[21] Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer New York Inc., New York, NY, 1995.
[22] Fei Wang, Shouchun Chen, Changshui Zhang, and Tao Li. Semi-supervised metric learning by maximizing constraint margin, CIKM Conference, pages 1457–1458, 2008.
[23] Fei Wang, Jimeng Sun, and Shahram Ebadollahi. Integrating distance metrics learned from multiple experts and its application in patient similarity assessment. In SDM, 2011.
[24] Fei Wang, Jimeng Sun, Jianying Hu, and Shahram Ebadollahi. imet: Interactive metric learning in healthcare applications. In SDM, 2011.
[25] Fei Wang and Changshui Zhang. Feature extraction by maximizing the average neighborhood margin. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2007.
[26] Kilian Q. Weinberger, John Blitzer, and Lawrence K. Saul. Distance metric learning for large margin nearest neighbor classification. In Advances in Neural Information Processing Systems, 2005.
[27] Michael Werman, Ofir Pele, and Brian Kulis. Distance functions and metric learning. Tutorial at European Conference on Computer Vision, 2010.
[28] Eric P. Xing, Andrew Y. Ng, Michael I. Jordan, and Stuart Russell. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems 15, 15:505–512, 2002.
[29] Liu Yang and Rong Jin. Distance metric learning: A comprehensive survey. Technical report, Department of Computer Science and Engineering, Michigan State University, 2006.
1http://en.wikipedia.org/wiki/Metric_(mathematics)
2http://en.wikipedia.org/wiki/Mahalanobis_distance
3http://en.wikipedia.org/wiki/Covariance_matrix
4In this paper two data vectors are considered to be similar if the Euclidean distance between them is small; two data tensors are considered to be similar if the Frobenius norm of their difference tensor is small.
3.133.134.17