Zhangyang Wang⁎; Ding Liu† ⁎Department of Computer Science and Engineering, Texas A&M University, College Station, TX, United States
†Beckman Institute for Advanced Science and Technology, Urbana, IL, United States
Deep learning has achieved prevailing success in a wide domain of machine learning and computer vision fields. On the other hand, sparsity and low-rankness have been popular regularizations in classical machine learning. This section is intended as a brief introduction to the basics if deep learning, and then focuses on its inherent connections to the concepts of sparsity and low-rankness.
Sparsity; Low rank; Deep learning
Machine learning makes computers learn from data without explicitly programming them. However, classical machine learning algorithms often find it challenging to extract semantic features directly from raw data, e.g., due to the well-known “semantic gap” [1], which calls for the assistance from domain experts to hand-craft many well-engineered feature representations, on which the machine learning models operate more effectively. In contrast, the recently popular deep learning relies on multilayer neural networks to derive semantically meaningful representations, by building multiple simple features to represent a sophisticated concept. Deep learning requires less hand-engineered features and expert knowledge. Taking image classification as an example [2], a deep learning-based image classification system represents an object by gradually extracting edges, textures, and structures, from lower to middle-level hidden layers, which becomes more and more associated with the target semantic concept as the model grows deeper. Driven by the emergence of big data and hardware acceleration, the intricacy of data can be extracted with higher and more abstract level representation from raw inputs, gaining more power for deep learning to solve complicated, even traditionally intractable problems. Deep learning has achieved tremendous success in visual object recognition [2–5], face recognition and verification [6,7], object detection [8–11], image restoration and enhancement [12–17], clustering [18], emotion recognition [19], aesthetics and style recognition [20–23], scene understanding [24,25], speech recognition [26], machine translation [27], image synthesis [28], and even playing Go [29] and poker [30].
A basic neural network is composed of a set of perceptrons (artificial neurons), each of which maps inputs to output values with a simple activation function. Among recent deep neural network architectures, convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are the two main streams, differing in their connectivity patterns. CNNs deploy convolution operations on hidden layers for weight sharing and parameter reduction. CNNs can extract local information from grid-like input data, and have mainly shown successes in computer vision and image processing, with many popular instances such as LeNet [31], AlexNet [2], VGG [32], GoogLeNet [33], and ResNet [34]. RNNs are dedicated to processing sequential input data with variable length. RNNs produce an output at each time step. The hidden neuron at each time step is calculated based on input data and hidden neurons at the previous time step. To avoid vanishing/exploding gradients of RNNs in long term dependency, long short-term memory (LSTM) [35] and gated recurrent unit (GRU) [36] with controllable gates are widely used in practical applications. Interested readers are referred to a comprehensive deep learning textbook [37].
In signal processing, the classical way to represent a multidimensional signal is to express it as a linear combination of the components in a (chosen in advance and also learned) basis. The goal of linearly transforming a signal with respect to a basis is to have a more predictable pattern in the resultant linear coefficients. With an appropriate basis, such coefficients often exhibit some desired characteristics for signals. One important observation is that, for most natural signals such as image and audio, most of the coefficients are zero or close to zero if the basis is properly selected: the technique is usually termed as sparse coding, and the basis is called the dictionary [38]. A sparse prior can have many interpretations in various contexts, such as smoothness, feature selection, etc. Ensured by theories from compressive sensing [39], under certain favorable conditions, the sparse solutions can be reliably obtained using the -norm, instead of the more straightforward but intractable -norm. Beyond the element-wise sparsity model, more elaborate structured sparse models have also been developed [40,41]. The learning of basis (called dictionary) further boosts the power of sparse coding [42–44].
More generally, the sparsity belongs to the well-received principle of parsimony, i.e., preferring a simple representation to a more complex one. The sparsity level (number of nonzero elements) is a natural measure of representation complexity of vector-valued features. In the case of matrix-valued features, the matrix rank provides another notion of parsimony, assuming high-dimensional data lies close to a low-dimensional subspace or manifold. Similarly to sparse optimization, a series of works have shown that rank minimization can be achieved through convex optimization [45] or efficient heuristics [46], paving the path to high-dimensional data analysis such as video processing [47–52].
Beyond their proven success in conventional machine learning algorithms, the sparse and low-rank structures are widely found to be effective for regularizing deep learning, for improving model generalization, training behaviors, data efficiency [53], and/or compactness [54]. For example, adding (or ) decay term limits the weights of the neurons. Another popular tool to avoid overfitting, dropout [2], is a simple regularization approach that improves the generalization of deep networks, by randomly putting hidden neurons to zero in the training stage, which could be viewed as a stochastic form of enforcing sparsity. Besides, the inherent sparse properties of both deep network weights and activations have also been widely observed and utilized for compressing deep models [55] and improving their energy efficiency [56,57]. As for low-rankness, much research has also been devoted to learning low-rank convolutional filters [58] and network compression [59].
Our focus of this book is to explore a deeper structural connection between sparse/low-rank models and deep models. While many examples will be detailed in the remainder of the book, we here briefly state the main idea. We start from the following regularized regression form, which represents a large family of feature learning models, such as ridge regression, sparse coding, and low-rank representation
Here denotes the input data, is the feature to learn, and is the representation basis. Function : defines the form of feature representation. The regularization term : further incorporates the problem-specific prior knowledge. Not surprisingly, many instances of Eq. (1.1) could be solved by a similar class of iterative algorithms
where denotes the intermediate output of the kth iteration, , and are linear (or convolutional) operators, while is a simple nonlinear operator. Equation (1.2) could be expressed by a recursive system, whose fixed point is expected to be the solution a of Eq. (1.1). Furthermore, the recursive system could be unfolded and truncated to k iterations, to construct a (k+1)-layer feed-forward network. Without any further tuning, the resulting architecture will output a k-iteration approximation of the exact solution a. We use the sparse coding model [38] as a popular instance of Eq. (1.1), which corresponds to , , with by default. Then, the concrete function forms are given as (u is a vector and is its ith element)
where is an element-wise soft shrinkage function. The unfolded and truncated version of Eq. (1.3) was first proposed in [60], called the learned iterative shrinkage and thresholding algorithm (LISTA). Recent works [61,18,62–64] followed LISTA and developed various models, and many jointly optimized the unfolded model with discriminative tasks [65].
A simple but interesting variant comes out by enforcing , , and , , Eq. (1.2) could be adapted to solve the nonnegative sparse coding problem
A by-product of applying nonnegativity is that the original sparsity coefficient λ now occurs in as the bias term of this layer, rather than appearing in as in Eq. (1.3). As a result, in Eq. (1.4) now has exactly the same form as the popular neuron of rectified linear unit (ReLU) [2]. We further make an aggressive approximation of Eq. (1.4), by setting and assuming , and have
Note that even if a nonzero is assumed, it could be absorbed into the bias term −λ. Equation (1.5) is exactly a fully-connected layer followed by ReLU neurons, one of the most standard building blocks in existing deep models. Convolutional layers could be derived similarly by looking at a convolutional sparse coding model [66] rather than a linear one. Such a hidden structural resemblance reveals the potential to bridge many sparse and low-rank models with current successful deep models, potentially enhancing the generalization, compactness and interpretability of the latter.
In the remainder of this book, Chapter 2 will first introduce the bi-level sparse coding model, using the example of hyperspectral image classification. Chapters 3, 4 and 5 will then present three concrete examples (classification, superresolution, and clustering), to show how (bi-level) sparse coding models could be naturally converted to and trained as deep networks. From Chapter 6 to Chapter 9, we will delve into the extensive applications of deep learning aided by sparsity and low-rankness, in signal processing, dimensionality reduction, action recognition, style recognition and kinship understanding, respectively.
18.218.78.102