Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 18

Distance Metric Learning for Data Classification

Fei Wang

IBM T. J. Watson Research Center
Yorktown Heights, NY
[email protected]

18.1 Introduction

Distance metric learning is a fundamental problem in data mining and knowledge discovery, and it is of key importance for many real world applications. For example, information retrieval utilizes the learned distance metric to measure the relevance between the candidate data and the query; clinical decision support uses the learned distance metric to measure pairwise patient similarity [19, 23, 24]; pattern recognition can use the learned distance metric to match most similar patterns. In a broader sense, distance metric learning lies in the heart of many data classification problems. As long as a proper distance metric is learned, we can always adopt k-Nearest Neighbor (kNN) classifier [4] to classify the data. In recent years, many studies have demonstrated [12 , 27, 29], either theoretically or empirically, that learning a good distance metric can greatly improve the performance of data classification tasks.

In this survey, we will give an overview of the existing supervised distance metric learning approaches and point out their strengths and limitations, as well as present challenges and future research directions. We focus on supervised algorithms because they are under the same setting as data classification. We will categorize those algorithms from the aspect of linear/nonlinear, local/global, transductive/inductive, and also the computational technology involved.

In the rest of this chapter, we will first introduce the definition of distance metric in Section 18.2. Then we will overview existing supervised metric learning algorithms in Section 18.3, followed by discussions and conclusions in Section 18.5. Table 18.1 summarizes the notations and symbols that will be used throughout the paper.

Table 18.1

The Meanings of Various Symbols That Will be Used Throughout This Chapter

Symbol	Meaning
n	number of data
d	data dimensionality
xi	the i-th data vector
X	data matrix
M	precision matrix of the generalized Mahalanobis distance
wi	the i-th projection vector
W	projection matrix
Ni	the neighborhood of xi
ϕ(·)	nonlinear mapping used in kernel methods
K	kernel matrix
L	Laplacian matrix

18.2 The Definition of Distance Metric Learning

Before describing different types of distance metric learning algorithms, we first define necessary notations and concepts on distance metric learning.

Throughout the paper, we use χ to represent a set of data points. If x, y, z ∈ χ are data vectors with the same dimensionality, we call D : χ × χ → ℝ a Distance Metric if it satisfies the following four properties:1

Nonnegativity: D(x, y) ≥ 0
Coincidence: D(x, y) = 0 if and only if x = y
Symmetry: D(x, y) = D(y, x)
Subadditivity: D(x, y) + D(y, z) ≥ D(x, z)

If we relax the coincidence condition to if x = y ⇒ D(x, y) = 0, then D is called a Pseudo Metric. There are many well-known distance metrics, such as the Euclidean and Mahalanobis distance below:

Euclidean distance, which measures the distance between x and y by
$\begin{matrix} D (x,y) = \sqrt{{(x - y)}^{⊤} (x - y)} . & (18.1) \end{matrix}$
Mahalanobis distance,2 which measures the distance between x and y by
$\begin{matrix} D (x,y) = \sqrt{{(x - y)}^{⊤} S (x - y)} & (18.2) \end{matrix}$

where S is the inverse of the data covariance matrix (also referred to as the precision matrix).3

Most of the recent distance metric learning algorithms can be viewed as learning a generalized Mahalanobis distance defined as below:

Generalized Mahalanobis Distance (GMD). A GMD measures the distance between data vectors x and y by

$\begin{matrix} D (x,y) = \sqrt{{(x - y)}^{⊤} M (x - y)} & (18.3) \end{matrix}$

where M is some arbitrary Symmetric Positive Semi-Definite (SPSD) matrix.

The major goal of learning a GMD is to learn a proper M. As M is SPSD, we can decompose M as M = UΛU⊤ with eigenvalue decomposition, where U is a matrix collecting all eigenvectors of M, and Λ is a diagonal matrix with all eigenvalues of M on its diagonal line. Let W = UΛ1/2, then we have

$\begin{array}{l} D (x,y) & = \sqrt{{(x - y)}^{^{⊤}} {WW}^{^{⊤}} (x - y)} = \sqrt{{(W^{^{⊤}} (x - y))}^{^{⊤}} (W^{^{⊤}} (x - y))} \\ = \sqrt{{(\tilde{x} - \tilde{y})}^{^{⊤}} (\tilde{x} - \tilde{y})} & (18.4) \end{array}$

where $\tilde{x} = W^{⊤} x$ . Therefore GMD is equivalent to the Euclidean distance in some projected space. Based on this observation, we define distance metric learning as follows:

Distance Metric Learning. The problem of learning a distance function D for a pair of data points x and y is to learn a mapping function f, such that f(x) and f(y) will be in the Euclidean space and $D (x,y) = || f (x) - f (y) ||$ , where $|| \cdot ||$ is the l2 norms.

With this definition, we can also categorize whether a distance metric learning algorithm is linear or nonlinear based on whether the projection is linear or nonlinear.

18.3 Supervised Distance Metric Learning Algorithms

In this section we will review the existing supervised distance metric learning algorithms, which learn distance metrics on both data points and their supervision information, which is usually in the form of (1) Data labels, which indicate the class information each training data point belongs to. The assumption is that distance between data points with the same label should be closer to distance between data points from different labels. (2) Pairwise constraints indicate whether a pair of data points should belong to the same class (must-links) or not (cannot-links). Before going into the details of those algorithms, we first categorize those supervised methodologies into different types according to their characteristics, which is shown in Table 18.2.

Table 18.2

Supervised Distance Metric Learning Algorithms

	Linear	Nonlinear	Local	Global	ED	QP	GD	Other Optimization
NCA [7]	√		√				√
ANMM [25]	√		√		√
KANMM [25]		√	√		√
LMNN [26]	√		√			√
KLMNN [26]		√	√			√
LDA [6]	√			√	√
KLDA [14]		√		√	√
LSI [28]	√			√			√
ITML [3]	√			√				√
MMDA [11]	√			√	√
KMMDA [11]		√		√		√
RCA [18]	√			√	√
KRCA [20]		√		√	√

Note: ED stands for Eigenvalue Decomposition, QP stands for Quadratic Programming, GD stands for Gradient Descent, and all other abbreviations can be found in the main text.

18.3.1 Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) [6] is one of the most popular supervised linear embedding methods. It seeks for the projection directions under which the data from different classes are well separated. More concretely, supposing that the data set belongs to C different classes, LDA defines the compactness matrix and scatterness matrix as

$\begin{matrix} \sum_{C} = \frac{1}{C} \sum_{c} \frac{1}{n_{c}} \sum_{x_{i} \in c} (x_{i} - {\bar{x}}_{c}) {(x_{i} - {\bar{x}}_{c})}^{^{⊤}} & (18.5) \end{matrix}$

$\begin{matrix} \sum_{S} = \frac{1}{C} \sum_{c} ({\bar{x}}_{c} - \bar{x}) {({\bar{x}}_{c} - \bar{x})}^{^{⊤}} . & (18.6) \end{matrix}$

The goal of LDA is to find a W that maximizes

$\begin{matrix} min_{W^{^{⊤}} W=I} \frac{t r (W^{^{⊤}} \sum_{C} W)}{t r (W^{^{⊤}} \sum_{S} W)} . & (18.7) \end{matrix}$

By expanding the numerator and denominator of the above expression, we can observe that the numerator corresponds to the sum of distances between each data point to its class center after projection, and the denominator represents the sum of distances between every class center to the entire data center after projection. Therefore, minimizing the objective will maximize the between-class scatterness while minimizing the within-class scatterness after projection. Solving problem (18.7) directly is hard, some researchers [8, 10] have done research on this topic. LDA is a linear and global method. The learned distance between xi and xj is the Euclidean distance between $W^{⊤} x_{i}$ and $W^{⊤} x_{j}$ .

Kernelization of LDA: Similar to the case of PCA, we can extend LDA to the nonlinear case via the kernel trick, which is called Kernel Discriminant Analysis (KDA)) [14]. After mapping the data into the feature space using ϕ, we can compute the compactness and scatterness matrices as

$\begin{matrix} \sum_{C}^{ϕ} = \frac{1}{C} \sum_{c} \frac{1}{n_{c}} \sum_{x_{i} \in c} (ϕ (x_{i}) - {\bar{ϕ}}_{c}) {(ϕ (x_{i}) - {\bar{ϕ}}_{c})}^{^{⊤}} & (18.8) \end{matrix}$

$\begin{matrix} \sum_{S} = \frac{1}{C} \sum_{c} ({\bar{ϕ}}_{c} - \bar{ϕ}) {({\bar{ϕ}}_{c} - \bar{ϕ})}^{^{⊤}} . & (18.9) \end{matrix}$

Suppose the projection matrix we want to get is wϕ in the feature space, then with the representer theorem

$\begin{matrix} W^{ϕ} = Φα & (18.10) \end{matrix}$

where $Φ = [ϕ (x_{1}), ϕ (x_{2}), \dots, ϕ (x_{n})]$ and α is the coefficient vector over all $ϕ (x_{i})$ for $1 \leq i \leq n$ . We define ${K =Φ}^{⊤} Φ$ as the kernel matrix.

Then

$\begin{array}{l} {(W^{ϕ})}^{^{⊤}} \sum_{C}^{ϕ} W^{ϕ} & = & α^{^{⊤}} [\frac{1}{C} \sum_{c = 1}^{C} \frac{1}{n_{c}} \sum_{x_{i} \in c} Φ^{^{⊤}} (ϕ (x_{i}) - {\bar{ϕ}}_{c}) {(ϕ (x_{i}) - {\bar{ϕ}}_{c})}^{^{⊤}} Φ] α \\ = & α^{^{⊤}} [\frac{1}{C} \sum_{c = 1}^{C} \frac{1}{n_{c}} \sum_{x_{i} \in c} {(K ._{i} - \bar{K} ._{c})}^{^{⊤}}] α \\ = & α^{^{⊤}} M_{C} α & (18.11) \end{array}$

where $K ._{i}$ = and $\bar{K} ._{c} = \frac{1}{n_{c}} \sum_{x_{i} \in c} K ._{i}$ .

$\begin{array}{l} {(W^{ϕ})}^{^{⊤}} \sum_{S}^{ϕ} W^{ϕ} & = & α^{^{⊤}} [\frac{1}{C} \sum_{c} (\bar{K} ._{c} - \bar{K} ._{*}) {(\bar{K} ._{c} - \bar{K} ._{*})}^{^{⊤}}] α \\ = & α^{^{⊤}} M_{S} α & (18.12) \end{array}$

where $K ._{*} = \frac{1}{n_{c}} \sum_{i = 1}^{n} K ._{i}$ Therefore we can get α by solving

$\begin{matrix} min_{α^{^{⊤}} α=I} \frac{t r (α^{^{⊤}} M_{C} α)}{t r (α^{^{⊤}} M_{S} α)} . & (18.13) \end{matrix}$

18.3.2 Margin Maximizing Discriminant Analysis (MMDA)

MMDA can be viewed as an extension of the Support Vector Machine (SVM) algorithm [21]. Supposing data points $X= [x_{1}, x_{2}, \dots, x_{n}]$ come from two classes, the label of xi is $l_{i} \in {- 1, + 1}$ . As a recap, the goal of SVM is to find the maximum-margin hyperplane that divides the points with li = 1 from those with li = −1 (thus it is a supervised method). Any hyperplane can be written as a point x satisfying

$\begin{matrix} w^{^{⊤}} x - b = 0 & (18.14) \end{matrix}$

where $b / || w ||$ corresponds to the distance of the hyperplane from the origin. SVM aims to choose the w and b to maximize the distance between the parallel hyperplanes that are as far apart as possible while still separating the data, which is usually referred to as the margin of the two classes. These hyperplanes can be described by the equations

$\begin{array}{l} w^{^{⊤}} x - b = 1 & or & w^{^{⊤}} x - b = - 1. & (18.15) \end{array}$

The distance between the two parallel hyperplanes is $2 / || w ||$ . Then if the data from two classes are clearly separated, the goal of SVM is to solve the following optimization problem to find the hyperplane that maximizes the margin between two classes

$\begin{array}{l} \min_{w, b} & \frac{1}{2} {|| w ||}^{2} & (18.16) \\ s . t . & l_{i} (w^{^{⊤}} x_{i} - b) \geq 1 (\forall_{i} = 1, 2, \dots, n) . \end{array}$

However in reality the two classes may not be perfectly separable, i.e., there might be some overlapping between them. Then we need soft margin SVM, which aims at solving

$\begin{array}{l} {min}_{w, b,ξ} & \frac{1}{2} {|| w ||}^{2} + C \sum_{i = 1}^{n} ξ_{i} & (18.17) \\ s . t . & l_{i} (w^{^{⊤}} x_{i} - b) \geq 1 - ξ_{i} (\forall i = 1, 2, \dots, n) \end{array}$

where ${ξ_{i}} \geq 0$ are slack variables used to penalize the margin on the overlapping region.

MMDA aims to solve more than one projection directions, which aims to solve the following optimization problem

$\begin{array}{l} min_{{W,b,ξ}_{r} \geq 0} & \frac{1}{2} \sum_{r = 1}^{d} {|| w_{r} ||}^{2} + \frac{C}{n} \sum_{r = 1}^{d} \sum_{i = 1}^{n} ξ_{r i} & (18.18) \\ s . t . & \forall i = 1, \dots, n, r = 1, \dots, d \\ l_{i} ({(w^{r})}^{T} x_{i} + b) \geq 1 - ξ_{r i,} \\ W^{T} W = I . \end{array}$

Therefore MMDA is a global and linear approach. One can also apply the kernel trick to make it nonlinear; the details can be found in [11]. The learned distance between xi and xj is just the Euclidean distance between $W^{⊤} x_{i}$ and $W^{⊤} x_{j}$ .

18.3.3 Learning with Side Information (LSI)

Both LDA and MMDA use data labels as the supervision information. As we introduced at the beginning of Section 18.3, another type of supervision information we considered is pairwise constraints. The data label information is more strict in the sense that we can convert data labels into pairwise constraints, but not vice versa.

One of the earliest researches that makes use of pairwise constraints for learning a proper distance metric is the Learning with Side-Information (LSI) approach [28]. We denote the set of must-link constraints as M and the set of cannot-link constraints as C; then the goal of LSI is to solve the following optimization problem

$\begin{array}{l} {max}_{M} & \sum_{(x_{i}, x_{j}) \in C} {(x_{i} - x_{j})}^{^{⊤}} M (x_{i} - x_{j}) & (18.19) \\ s . t . & \sum_{(x_{u}, x_{v}) \in M} {(x_{u} - x_{v})}^{^{⊤}} M (x_{u} - x_{v}) \leq 1 \\ M ≽ 0 . \end{array}$

This is a quadratic optimization problem and [28] proposed an iterative projected gradient ascent method to solve it. As M is positive semi-definite, we can always factorize it as ${M=WW}^{⊤}$ . Thus LSI is a global and linear approach. The learned distance formulation is exactly the general Mahalanobis distance with precision matrix M.

18.3.4 Relevant Component Analysis (RCA)

Relevant Component Analysis (RCA) [18] is another representative distance metric learning algorithm utilizing pairwise data constraints. The goal of RCA is to find a transformation that amplifies relevant variability and suppresses irrelevant variability. We consider that data variability is correlated with a specific task if the removal of this variability from the data deteriorates (on average) the results of clustering or retrieval. Variability is irrelevant if it is maintained in the data but not correlated with the specific task [18]. We also define small clusters called chunklets, which are connected components derived by all the must-links. The specific steps involved in RCA include:

Construct chunklets according to equivalence (must-link) constraints, such that the data in each chunklet are connected by must-link constraints pairwisely.
Assume a total of p points in k chunklets, where chunklet j consists of points ${x_{ji}}_{i = 1}^{n_{j}}$ and its mean is mj . RCA computes the following weighted within-chunklet covariance matrix:
$\begin{matrix} C = \frac{1}{p} \sum_{j = 1}^{k} \sum_{i = 1}^{n_{j}} (x_{j i} - {\bar{m}}_{j}) {(x_{j i} - {\bar{m}}_{j})}^{^{⊤}} . & (18.20) \end{matrix}$
Compute the whitening transformation W = C1/2, and apply it to the original data points: $\tilde{x} = Wx$ Alternatively, use the inverse of C as the precision matrix of a generalized Mahalanobis distance.

Therefore, RCA is a global, linear approach.

18.3.5 Information Theoretic Metric Learning (ITML)

Information theoretic objective is one mechanism to develop a supervised distance metric. Information Theoretic Metric Learning (ITML) [3] is one such representative algorithm. Suppose we have an initial generalized Mahalanobis distance parameterized by precision matrix M0, a set M of must-link constraints, and a set C of cannot-link constraints. ITML solves the following optimization problem

$\begin{array}{l} {min}_{M ≽ 0} & d_{log det} ({M,M}_{0}) & (18.21) \\ s . t . & {(x_{i} - x_{j})}^{^{⊤}} M (x_{i} - x_{j}) \geq l, (x_{i} {,x}_{j}) \in C \\ {(x_{u} - x_{v})}^{^{⊤}} M (x_{u} - x_{v}) ⋝ u, (x_{u} {,x}_{v}) \in M \end{array}$

where

$\begin{matrix} d_{logdet} (M, M_{0}) = t r ({MM}_{0}^{- 1}) - log det ({MM}_{0}^{- 1}) - n & (18.22) \end{matrix}$

where dlogdet is the LogDet divergence, which is also known as Stein’s loss. It can be shown that Stein’s loss is the unique scale invariant loss-function for which the uniform minimum variance unbiased estimator is also a minimum risk equivariant estimator [3]. The authors in [3] also proposed an efficient Bregman projection approach to solve problem (18.21). ITML is a global and linear approach. The learned distance metric is the Mahalanobis distance with precision matrix M.

18.3.6 Neighborhood Component Analysis (NCA)

All the supervised approaches we introduced above are global methods. Next we will also introduce several representative local supervised metric learning algorithms. First we will overview Neighborhood Component Analysis (NCA) [7]. Similar as in SNE, described in unsupervised metric learning, each point xi selects another point xj as its neighbor with some probability pij, and inherits its class label from the point it selects. NCA defines the probability that point i selects point j as a neighbor:

$\begin{matrix} p_{i j} = \frac{exp (- {|| W^{^{⊤}} x_{i} - W^{^{⊤}} x_{j} ||}^{2})}{\sum_{k \neq i} exp (- {|| W^{^{⊤}} x_{i} - W^{^{⊤}} x_{k} ||}^{2})} . & (18.23) \end{matrix}$

Under this stochastic selection rule, NCA computes the probability that point i will be correctly classified

$\begin{matrix} p_{i} = \sum_{j \in ℒ_{i}} p_{i j} & (18.24) \end{matrix}$

where $ℒ_{i} = {j | l_{i} = l_{j}}$ that is the set of points in the same class as point i.

The objective NCA maximizes is the expected number of points correctly classified under this scheme:

$\begin{matrix} J (W) = \sum_{i} p_{i} = \sum_{i} \sum_{j \in ℒ_{i}} p_{i j} & (18.25) \end{matrix}$

[7] proposed a truncated gradient descent approach to minimize J(W). NCA is a local and linear approach. The learned distance between xi and xj is the Euclidean distance between $W^{⊤} x_{i}$ and $W^{⊤} x_{j}$ .

18.3.7 Average Neighborhood Margin Maximization (ANMM)

Average Neighborhood Margin Maximization (ANMM) [25] is another local supervised metric learning approach, which aims to find projection directions where the local class discriminability is maximized. To define local discriminability, [25] first defines the following two types of neighborhoods:

Definition 1(Homogeneous Neighborhoods). For a data point xi, its ξ nearest homogeneous neighborhood Nio is the set of ξ most similar4 data, which are in the same class with xi.

Definition 2(Heterogeneous Neighborhoods). For a data point xi, its ζ nearest heterogeneous neighborhood Nie is the set of ζ most similar data, which are not in the same class with xi.

Then the average neighborhood margin γi for xi is defined as

$γ_{i} = \sum_{k : x_{k} \in N_{i}^{e}} \frac{|| y_{i} - y_{k} ||}{| N_{i}^{e} |} - \sum_{j : x_{j} \in N_{i}^{o}} \frac{|| y_{i} - y_{j} ||}{| N_{i}^{o} |},$

where |⋅| represents the cardinality of a set. This margin measures the difference between the average distance from xi to the data points in its heterogeneous neighborhood and the average distance from it to the data points in its homogeneous neighborhood. The maximization of such a margin can push the data points whose labels are different from xi away from xi while pulling the data points having the same class label with xi towards xi.

Therefore, the total average neighborhood margin can be defined as

$\begin{matrix} γ = \sum_{i} γ_{i} = \sum_{i} (\sum_{k : x_{k} \in N_{i}^{e}} \frac{{|| y_{i} - y_{k} ||}^{2}}{| N_{i}^{e} |} - \sum_{j : x_{j} \in N_{i}^{o}} \frac{{|| y_{i} - y_{j} ||}^{2}}{| N_{i}^{o} |}) & (18.26) \end{matrix}$

and the ANMM criterion is to maximize γ. By replacing $y_{i} {=W}^{⊤} x_{i}$ , [25] obtains the optimal W by performing eigenvalue decomposition of some discriminability matrix. Thus ANMM is a local and linear approach. The learned distance between xi and xj is the Euclidean distance between WTxi and WTxj. The authors in [25] also proposed a kernelized version of ANMM to handle nonlinear data called KANMM, thus KANMM is a local and nonlinear approach.

18.3.8 Large Margin Nearest Neighbor Classifier (LMNN)

The last local supervised metric learning approach we want to introduce is the Large Margin Nearest Neighbor Classifier (LMNN) [26]. The goal of LMNN is similar to that of ANMM, i.e., to pull the data with the same labels closer while pushing data with different labels far apart. LMNN deploys a different margin formulation. Specifically, LMNN defines the pull energy term as

$\begin{matrix} ε_{p u l l} = \sum_{x_{j} \in N_{i}^{o}} {|| W^{^{⊤}} (x_{i} - x_{j}) ||}^{2} & (18.27) \end{matrix}$

which is the sum of pairwise distances between a data point xi and the data point in xi’s homogeneous neighborhood after projection. LMNN defines the push energy as

$\begin{matrix} ε_{p u l l} = \sum_{i} \sum_{x_{j} \in N_{i}^{o}} \sum_{l} (1 - δ_{i l}) {[1 + {|| W^{^{⊤}} (x_{i} - x_{j}) ||}^{2} - {|| W^{^{⊤}} (x_{i} - x_{l}) ||}^{2}]}_{+} & (18.28) \end{matrix}$

where δil = 1 is the labels of xi and xl are the same, and δil = 0 otherwise. The intuition is to require that the data points from different classes should be at least separated from it by the distance 1. This formulation is very similar to the margin formulation in multiclass SVM [2] LMNN also pushes the data with different labels to at least distance 1 from its homogeneous neighborhood. The goal of LMNN is to minimize

$\begin{matrix} ε = μ ε_{p u l l} + (1 - μ) ε_{p u s h} . & (18.29) \end{matrix}$

The authors in [26] proposed a semi-definite programming technique to solve for M = WWT. Thus LMNN is a local and linear approach. The learned distance between xi and xj is the Euclidean distance between WTxi and WTxj.

18.4 Advanced Topics

Although supervised metric learning has successfully been applied in many applications, it is human labor intensive and time consuming to get the supervision information. Also most of the above mentioned approaches require eigenvalue decomposition or quadratic optimization. In this section we will briefly review two advanced topics in metric learning: semi-supervised approaches and online learning.

18.4.1 Semi-Supervised Metric Learning

Semi-supervised approaches aim to learn a proper distance metric from the data where the supervision information is only available on a small portion of the data. Those algorithms utilize both data with and without supervision information in the learning process. Therefore one straightforward way one can think of to construct a semi-supervised algorithm is to deploy an objective in the form of $λ_{1} ℒ (D) + λ_{2} U (D)$ , where U(D) is constructed on the entire data, and L(D) is constructed on the data with supervision information only. Finally, we put some constraints on the learned distance metric to balance both parts.

18.4.1.1 Laplacian Regularized Metric Learning (LRML)

Laplacian Regularized Metric Learning (LMRL) [9] is one semi-supervised distance metric learning approach. LRML adopts data smoothness term as the unsupervised term U(D);in terms of the supervised term, LRML chooses an ANMM type of objective as L(D). The optimization problem that LRML aims to solve is

$\begin{array}{l} \min_{M} & \underset{U (D)}{\underset{︸}{t}} + \underset{ℒ (D)}{\underset{︸}{γ_{1} t_{2} - γ_{2} t_{3}}} & (18.30) \\ s . t . & t_{1} \leq t \\ M ≽ 0 \end{array}$

where the smoothness term is

$\begin{array}{l} t_{1} = \sum_{i, j} {|| W^{⊤} x_{i} - W^{⊤} x_{j} ||}^{2} ω i j = t r (W^{⊤} {XLX}^{⊤} W) = t r ({XLX}^{⊤} M) & (18.31) \end{array}$

where ${M=WW}^{⊤}$ . The supervised terms consisting of compactness and scatterness are

$\begin{array}{l} t_{2} = \sum_{(x_{_{i},} x_{j}) \in M} {|| W^{⊤} x_{i} - W^{⊤} x_{j} ||}^{2} = t r [M \sum_{(x_{_{i},} x_{j}) \in M} (x_{i} - x_{j}) (x_{i} - x_{j}) ⊤] & (18.32) \end{array}$

$\begin{array}{l} t_{3} = \sum_{(x_{_{i},} x_{j}) \in C} {|| W^{⊤} x_{i} - W^{⊤} x_{j} ||}^{2} = t r [M \sum_{(x_{_{i},} x_{j}) \in C} (x_{i} - x_{j}) {(x_{i} - x_{j})}^{⊤}] & (18.33) \end{array}$

where M and C are the sets of must-links and cannot-links, respectively.

[9] proposed a semi-definite programming approach for solving problem (18.30). LRML is a mixture of local (its unsupervised part) and global (its supervised part) approach, and it is linear. The learned distance is the Mahalanobis distance with precision matrix M.

18.4.1.2 Constraint Margin Maximization (CMM)

Similarly, Constraint Margin Maximization (CMM) [22] selects the same supervised term as LRML, but a different PCA-type unsupervised term in its objective. Specifically, the optimization problem CMM aims to solve is

$\begin{array}{l} \max_{W} & \underset{U (D)}{\underset{︸}{t_{4}}} - \underset{ℒ (D)}{\underset{︸}{γ_{1} t_{2} - γ_{2} t_{3}}} & (18.34) \\ s . t . & W^{⊤} W = I \end{array}$

where the unsupervised term is $t_{4} = t r (W^{⊤} X Σ X^{⊤} W)$ is the PCA-like term. Note that before we apply CMM, all data points need to be centered, i.e., their mean should be subtracted from the data matrix. The intuition of CMM is to maximally unfold the data points in the projection space while at the same time satisfying those pairwise constraints. The authors in [22] showed that the optimal W can be obtained by eigenvalue decomposition to some matrix. Therefore CMM is a global and linear approach. Wang et. al. [22] also showed how to derive its kernelized version for handling nonlinear data. The learned distance between xi and xj is the Euclidean distance between $W^{⊤} x_{i}$ and $W^{⊤} x_{j}$ .

18.4.2 Online Learning

Most of the distance metric learning approaches involve expensive optimization procedures such as eigen-decomposition and semi-definite programming. One way to make those algorithms more efficient is the online learning [16] strategy, which incorporates the data points into the learning process in a sequential manner. More concretely, online learning updates the learned distance metric iteratively. At each iteration, only one or a small batch of data are involved. Another scenario where the online learning strategy can be naturally applied is to learn distance metrics for streaming data, where the data are coming in a streaming fashion so that the distance metric needs to be updated iteratively. Next we present two examples of online distance metric learning approaches.

18.4.2.1 Pseudo-Metric Online Learning Algorithm (POLA)

Pseudo-Metric Online Learning (POLA) [17] falls into the category of supervised metric learning with pairwise constraints. More specifically, there is a must-link constraint set M and a cannot-link constraint set C. POLA assigns a label lij for each pair (xi,xj) in M or C, such that if $(x_{i}, x_{j}) \in M, l_{i j} = 1$ . if $(x_{i}, x_{j}) \in C, l_{i j} = - 1$ . Then it introduces a threshold b and constructs the following constraints

$\begin{array}{l} \forall (x_{i}, x_{j}) \in M, l_{i j} = 1 & \Rightarrow & {(x_{i} - x_{j})}^{⊤} M (x_{i} - x_{j}) \leq b - 1 & (18.35) \end{array}$

$\begin{array}{l} \forall (x_{i}, x_{j}) \in C, l_{i j} = - 1 & \Rightarrow & {(x_{i} - x_{j})}^{⊤} M (x_{i} - x_{j}) \geq b - 1 & (18.36) \end{array}$

which can be unified as

$\begin{matrix} l_{i j} [b - {(x_{i} - x_{j})}^{⊤} M (x_{i} - x_{j})] \geq 1 & (18.37) \end{matrix}$

Note that this formulation is similar to the constraint in standard SVM. Then the objective of POLA is the follow hinge loss

$\begin{array}{l} J_{i j} (M, b) & = & \max_{C_{i j}^{1} & C^{2}} (0, l_{i j} [{(x_{i} - x_{j})}^{⊤} M (x_{i} - x_{j}) - b] + 1) & (18.38) \end{array}$

The two constraint sets are defined as

$\begin{array}{l} C_{i j}^{1} & = & {(M, b) \in ℝ^{n^{2} \times 1} : J_{i j} (M, b) = 0} & (18.39) \end{array}$

$\begin{array}{l} C^{2} & = & {(M, b) \in ℝ^{n^{2} \times 1} : M ≽ 0, b \geq 1} & (18.40) \end{array}$

POLA operates in an iterative way: First POLA initializes M as a zero matrix, then at each step, it randomly picks one data pair from the constraint set (either M or C), and then does projections on Cij1 and C2 alternatively. By projecting M and b onto Cij1, POLA gets the updating rules for M and b as

$\begin{array}{l} \hat{M} & = & M - l_{i j} α_{i j} u_{i j} u_{i j}^{⊤} & (18.41) \end{array}$

$\begin{array}{l} \hat{b} & = & b + α_{i j} l_{i j} & (18.42) \end{array}$

where

$\begin{array}{l} u_{i j} & = & x_{i} - x_{j} & (18.43) \end{array}$

$\begin{array}{l} α_{i j} & = & \frac{J_{i j} (M, b)}{{|| u_{i j} ||}_{2}^{4} + 1} & (18.44) \end{array}$

By projecting (M, b) onto C2, POLA updates M as

$\begin{matrix} \hat{M} = M - λ μ μ^{⊤} & (18.45) \end{matrix}$

where $λ=min {\tilde{λ}, 0}$ and $(\tilde{λ}, μ)$ are the smallest eigenvalue-eigenvector pair of $\hat{M}$ . Therefore, POLA incorporates the data in constraint sets in a sequential manner.

18.4.2.2 Online Information Theoretic Metric Learning (OITML)

Another technique we want to review here is the Online Information Theoretic Metric Learning (OITML) approach [3]. This method also falls into the category of supervised metric learning. It is the online version of the ITML approach we introduced in Section 18.3.5. Suppose at time t + 1, we need to randomly pick a pair of data from the constraint set, and minimize the following objective

$\begin{matrix} M_{t + 1} = \arg min_{M} R (M, M_{t}) + η_{t} l (D_{t}, {\hat{D}}_{t}) & (18.46) \end{matrix}$

where ${\hat{D}}_{t} = {(x_{i} - x_{j})}^{⊤} M (x_{i} - x_{j})$ and $R ({M,M}_{t}) = d_{\log \det} ({M,M}_{t})$ is the logdet divergence. [3] showed that Mt+1 can be updated with the following rule

$\begin{matrix} M_{t+1} \leftarrow M_{t} - \frac{2 η_{t} (D_{t} - {\hat{D}}_{t}) M_{t} (x_{i} - x_{j}) {(x_{i} - x_{j})}^{⊤} M_{t}}{1 + 2 η_{t} (D_{t} - {\hat{D}}_{t}) {(x_{i} - x_{j})}^{⊤} M_{t} (x_{i} - x_{j})} & (18.47) \end{matrix}$

where

$\begin{matrix} η_{t} = {\begin{array}{l} η_{0}, D_{t} - {\hat{D}}_{t} \leq 0 \\ \min {η_{0}, \frac{1}{2 (D_{t} - {\hat{D}}_{t})} (\frac{1}{{(x_{i} - x_{j})}^{⊤} {({I+M}_{t}^{- 1} - I)}^{- 1} (x_{i} - x_{j})})}, otherwise \end{array} & (18.48) \end{matrix}$

POLA and OITML are only two examples of online distance metric learning.

18.5 Conclusions and Discussions

Till now we have introduced the state-of-the-art supervised distance metric learning algorithms of recent years. All of those supervised metric learning approaches we listed above require some pre-specified free parameters, and all of them involve some expensive computational procedures such as eigenvalue decomposition or semi-definite programming. One exception is the ITML approach, as it deploys a Bregman projection strategy that may make the solution relatively efficient. However, ITML is sensitive to the initial choice of M0, which makes it difficult to apply in the case when we do not have enough prior knowledge. In practice, according to the type of supervision information provided, we can select a proper supervised metric learning approach that can handle that supervision information. However, in many real world applications, it may be expensive and time consuming to get that supervision information. Next we survey semi-supervised approaches that can leverage both labeled and unlabeled information. For the future of distance metric learning research, we believe the following directions are promising.

Large scale distance metric learning. Most of the existing distance metric learning approaches involve computationally expensive procedures. How can we make distance metric learning efficient and practical on large-scale data? Promising solutions include online learning or distributed learning. We have introduced the most recent works on online distance metric learning in Section 4.1. For parallel/distributed distance metric algorithms, as the major computational techniques involved are eigenvalue decomposition and quadratic programming, we can adopt parallel matrix computation/optimization algorithms [1, 15] to make distance metric learning more scalable and efficient.
Empirical evaluations. Although a lot of distance metric learning algorithms have been proposed, there is still a lack of systematic comparison and proof points on the utility of many distance metric learning algorithms in real world applications. Such empirical discussion will be helpful to showcase the practical value of distance metric learning algorithms. Some recent works have started developing and applying distance metric learning on healthcare for measuring similarity among patients [19, 23, 24].
General formulation. As can be seen from this survey, most of the existing distance metric learning algorithms suppose the learned distance metric is Euclidean in the projected space. Such an assumption may not be sufficient for real world applications as there is no guarantee that Euclidean distance is most appropriate to describe the pairwise data relationships. There are already some initial effort towards this direction [5, 13], and this direction is definitely worth exploring.

Bibliography

[1] Yair Censor. Parallel Optimization: Theory, algorithms, and applications. Oxford University Press, 1997.

[2] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2:265–292, 2001.

[3] Jason V. Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S. Dhillon. Information-theoretic metric learning. In Proceedings of International Conference on Machine Learning, pages 209–216, 2007.

[4] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification (2nd Edition). Wiley-Interscience, 2nd edition, 2001.

[5] Charles Elkan. Bilinear models of affinity. Personal note, 2011.

[6] Keinosuke Fukunaga. Introduction to Statistical Pattern Recognition, Second Edition (Computer Science and Scientific Computing Series). Academic Press, 1990.

[7] Jacob Goldberger, Sam Roweis, Geoff Hinton, and Ruslan Salakhutdinov. Neighborhood component analysis. In Advances in Neural Information Processing Systems, 17:513–520, 2004.

[8] Y. Guo, S. Li, J. Yang, T. Shu, and L. Wu. A generalized foley-sammon transform based on generalized fisher discriminant criterion and its application to face recognition. Pattern Recognition Letters, 24(1–3): 147–158, 2003.

[9] Steven C. H. Hoi, Wei Liu, and Shih-Fu Chang. Semi-supervised distance metric learning for collaborative image retrieval. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1–7, 2008.

[10] Yangqing Jia, Feiping Nie, and Changshui Zhang. Trace ratio problem revisited. IEEE Transactions on Neural Networks, 20(4):729–735, 2009.

[11] András Kocsor, Kornel Kovaács, and Csaba Szepesvári. Margin maximizing discriminant analysis. In Proceedings of European Conference on Machine Learning, volume 3201 of Lecture Notes in Computer Science, pages 227–238. Springer, 2004.

[12] Brian Kulis. Metric learning. Tutorial at International Conference on Machine Learning, 2010.

[13] Zhen Li, Liangliang Cao, Shiyu Chang, John R. Smith, and Thomas S. Huang. Beyond mahalanobis distance: Learning second-order discriminant function for people verification. In Prcoeedings of Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, pages 45–50, 2012.

[14] S. Mika, G. Ratsch, J. Weston, B. Schölkopf, and K. R. Müllers. Fisher discriminant analysis with kernels. Neural Networks for Signal Processing IX, 1999. Proceedings of the 1999 IEEE Signal Processing Society Workshop, pages 41–48, 1999.

[15] Jagdish J. Modi. Parallel algorithms and matrix computation. Oxford University Press, 1989.

[16] Shai Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. The Hebrew University of Jerusalem. Phd thesis, July 2007.

[17] Shai Shalev-Shwartz, Yoram Singer, and Andrew Y. Ng. Online and batch learning of pseudo-metrics. In Proceedings of International Conference on Machine Learning, pages 94–101, 2004.

[18] N. Shental, T. Hertz, D. Weinshall, and M. Pavel. Adjustment learning and relevant component analysis. In Proceedings ofEuropean Conference on Computer Vision, pages 776–790, 2002.

[19] Jimeng Sun, Daby Sow, Jianying Hu, and Shahram Ebadollahi. Localized supervised metric learning on temporal physiological data. In ICPR, pages 4149–4152, 2010.

[20] Ivor W. Tsang, Pak ming Cheung, and James T. Kwok. Kernel relevant component analysis for distance metric learning. In IEEE International Joint Conference on Neural Networks (IJCNN, pages 954–959, 2005.

[21] Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer New York Inc., New York, NY, 1995.

[22] Fei Wang, Shouchun Chen, Changshui Zhang, and Tao Li. Semi-supervised metric learning by maximizing constraint margin, CIKM Conference, pages 1457–1458, 2008.

[23] Fei Wang, Jimeng Sun, and Shahram Ebadollahi. Integrating distance metrics learned from multiple experts and its application in patient similarity assessment. In SDM, 2011.

[24] Fei Wang, Jimeng Sun, Jianying Hu, and Shahram Ebadollahi. imet: Interactive metric learning in healthcare applications. In SDM, 2011.

[25] Fei Wang and Changshui Zhang. Feature extraction by maximizing the average neighborhood margin. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2007.

[26] Kilian Q. Weinberger, John Blitzer, and Lawrence K. Saul. Distance metric learning for large margin nearest neighbor classification. In Advances in Neural Information Processing Systems, 2005.

[27] Michael Werman, Ofir Pele, and Brian Kulis. Distance functions and metric learning. Tutorial at European Conference on Computer Vision, 2010.

[28] Eric P. Xing, Andrew Y. Ng, Michael I. Jordan, and Stuart Russell. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems 15, 15:505–512, 2002.

[29] Liu Yang and Rong Jin. Distance metric learning: A comprehensive survey. Technical report, Department of Computer Science and Engineering, Michigan State University, 2006.

1http://en.wikipedia.org/wiki/Metric_(mathematics)

2http://en.wikipedia.org/wiki/Mahalanobis_distance

3http://en.wikipedia.org/wiki/Covariance_matrix

4In this paper two data vectors are considered to be similar if the Euclidean distance between them is small; two data tensors are considered to be similar if the Frobenius norm of their difference tensor is small.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 18 Distance Metric Learning for Data Classification

Create new playlist

Sign In

Sign Up

Distance Metric Learning for Data Classification

18.1 Introduction

18.2 The Definition of Distance Metric Learning

18.3 Supervised Distance Metric Learning Algorithms

18.3.1 Linear Discriminant Analysis (LDA)

18.3.2 Margin Maximizing Discriminant Analysis (MMDA)

18.3.3 Learning with Side Information (LSI)

18.3.4 Relevant Component Analysis (RCA)

18.3.5 Information Theoretic Metric Learning (ITML)

18.3.6 Neighborhood Component Analysis (NCA)

18.3.7 Average Neighborhood Margin Maximization (ANMM)

18.3.8 Large Margin Nearest Neighbor Classifier (LMNN)

18.4 Advanced Topics

18.4.1 Semi-Supervised Metric Learning

18.4.1.1 Laplacian Regularized Metric Learning (LRML)

18.4.1.2 Constraint Margin Maximization (CMM)

18.4.2 Online Learning

18.4.2.1 Pseudo-Metric Online Learning Algorithm (POLA)

18.4.2.2 Online Information Theoretic Metric Learning (OITML)

18.5 Conclusions and Discussions

Bibliography

Table of Contents for
Chapter 18 Distance Metric Learning for Data Classification