Chapter 1

Introduction

Zhangyang WangDing Liu    Department of Computer Science and Engineering, Texas A&M University, College Station, TX, United States
Beckman Institute for Advanced Science and Technology, Urbana, IL, United States

Abstract

Deep learning has achieved prevailing success in a wide domain of machine learning and computer vision fields. On the other hand, sparsity and low-rankness have been popular regularizations in classical machine learning. This section is intended as a brief introduction to the basics if deep learning, and then focuses on its inherent connections to the concepts of sparsity and low-rankness.

Keywords

Sparsity; Low rank; Deep learning

1.1 Basics of Deep Learning

Machine learning makes computers learn from data without explicitly programming them. However, classical machine learning algorithms often find it challenging to extract semantic features directly from raw data, e.g., due to the well-known “semantic gap” [1], which calls for the assistance from domain experts to hand-craft many well-engineered feature representations, on which the machine learning models operate more effectively. In contrast, the recently popular deep learning relies on multilayer neural networks to derive semantically meaningful representations, by building multiple simple features to represent a sophisticated concept. Deep learning requires less hand-engineered features and expert knowledge. Taking image classification as an example [2], a deep learning-based image classification system represents an object by gradually extracting edges, textures, and structures, from lower to middle-level hidden layers, which becomes more and more associated with the target semantic concept as the model grows deeper. Driven by the emergence of big data and hardware acceleration, the intricacy of data can be extracted with higher and more abstract level representation from raw inputs, gaining more power for deep learning to solve complicated, even traditionally intractable problems. Deep learning has achieved tremendous success in visual object recognition [25], face recognition and verification [6,7], object detection [811], image restoration and enhancement [1217], clustering [18], emotion recognition [19], aesthetics and style recognition [2023], scene understanding [24,25], speech recognition [26], machine translation [27], image synthesis [28], and even playing Go [29] and poker [30].

A basic neural network is composed of a set of perceptrons (artificial neurons), each of which maps inputs to output values with a simple activation function. Among recent deep neural network architectures, convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are the two main streams, differing in their connectivity patterns. CNNs deploy convolution operations on hidden layers for weight sharing and parameter reduction. CNNs can extract local information from grid-like input data, and have mainly shown successes in computer vision and image processing, with many popular instances such as LeNet [31], AlexNet [2], VGG [32], GoogLeNet [33], and ResNet [34]. RNNs are dedicated to processing sequential input data with variable length. RNNs produce an output at each time step. The hidden neuron at each time step is calculated based on input data and hidden neurons at the previous time step. To avoid vanishing/exploding gradients of RNNs in long term dependency, long short-term memory (LSTM) [35] and gated recurrent unit (GRU) [36] with controllable gates are widely used in practical applications. Interested readers are referred to a comprehensive deep learning textbook [37].

1.2 Basics of Sparsity and Low-Rankness

In signal processing, the classical way to represent a multidimensional signal is to express it as a linear combination of the components in a (chosen in advance and also learned) basis. The goal of linearly transforming a signal with respect to a basis is to have a more predictable pattern in the resultant linear coefficients. With an appropriate basis, such coefficients often exhibit some desired characteristics for signals. One important observation is that, for most natural signals such as image and audio, most of the coefficients are zero or close to zero if the basis is properly selected: the technique is usually termed as sparse coding, and the basis is called the dictionary [38]. A sparse prior can have many interpretations in various contexts, such as smoothness, feature selection, etc. Ensured by theories from compressive sensing [39], under certain favorable conditions, the sparse solutions can be reliably obtained using the 1Image-norm, instead of the more straightforward but intractable 0Image-norm. Beyond the element-wise sparsity model, more elaborate structured sparse models have also been developed [40,41]. The learning of basis (called dictionary) further boosts the power of sparse coding [4244].

More generally, the sparsity belongs to the well-received principle of parsimony, i.e., preferring a simple representation to a more complex one. The sparsity level (number of nonzero elements) is a natural measure of representation complexity of vector-valued features. In the case of matrix-valued features, the matrix rank provides another notion of parsimony, assuming high-dimensional data lies close to a low-dimensional subspace or manifold. Similarly to sparse optimization, a series of works have shown that rank minimization can be achieved through convex optimization [45] or efficient heuristics [46], paving the path to high-dimensional data analysis such as video processing [4752].

1.3 Connecting Deep Learning to Sparsity and Low-Rankness

Beyond their proven success in conventional machine learning algorithms, the sparse and low-rank structures are widely found to be effective for regularizing deep learning, for improving model generalization, training behaviors, data efficiency [53], and/or compactness [54]. For example, adding 1Image (or 2Image) decay term limits the weights of the neurons. Another popular tool to avoid overfitting, dropout [2], is a simple regularization approach that improves the generalization of deep networks, by randomly putting hidden neurons to zero in the training stage, which could be viewed as a stochastic form of enforcing sparsity. Besides, the inherent sparse properties of both deep network weights and activations have also been widely observed and utilized for compressing deep models [55] and improving their energy efficiency [56,57]. As for low-rankness, much research has also been devoted to learning low-rank convolutional filters [58] and network compression [59].

Our focus of this book is to explore a deeper structural connection between sparse/low-rank models and deep models. While many examples will be detailed in the remainder of the book, we here briefly state the main idea. We start from the following regularized regression form, which represents a large family of feature learning models, such as ridge regression, sparse coding, and low-rank representation

a=argmina12||xΦ(D,a)||F2+Ψ(a).

Image (1.1)

Here xRnImage denotes the input data, aRmImage is the feature to learn, and DRn×mImage is the representation basis. Function Φ(D,a)Image: Rn×m×RmRnImage defines the form of feature representation. The regularization term Ψ(a)Image: RmRImage further incorporates the problem-specific prior knowledge. Not surprisingly, many instances of Eq. (1.1) could be solved by a similar class of iterative algorithms

zk+1=N(L1(x)+L2(zk)),

Image (1.2)

where zkRmImage denotes the intermediate output of the kth iteration, k=0,1,Image, L1Image and L2Image are linear (or convolutional) operators, while NImage is a simple nonlinear operator. Equation (1.2) could be expressed by a recursive system, whose fixed point is expected to be the solution a of Eq. (1.1). Furthermore, the recursive system could be unfolded and truncated to k iterations, to construct a (k+1)-layer feed-forward network. Without any further tuning, the resulting architecture will output a k-iteration approximation of the exact solution a. We use the sparse coding model [38] as a popular instance of Eq. (1.1), which corresponds to Ψ(a)=λ||a||1Image, Φ(D,a)=DaImage, with ||D||2=1Image by default. Then, the concrete function forms are given as (u is a vector and uiImage is its ith element)

L1(x)=DTx,L2(zk)=(IDTD)zk,N(u)i=sign(ui)(|ui|λ)+,

Image (1.3)

where NImage is an element-wise soft shrinkage function. The unfolded and truncated version of Eq. (1.3) was first proposed in [60], called the learned iterative shrinkage and thresholding algorithm (LISTA). Recent works [61,18,6264] followed LISTA and developed various models, and many jointly optimized the unfolded model with discriminative tasks [65].

A simple but interesting variant comes out by enforcing Ψ(a)=λ||a||1Image, a0Image, and Φ(D,a)=DaImage, ||D||2=1Image, Eq. (1.2) could be adapted to solve the nonnegative sparse coding problem

L1(x)=DTxλ,L2(zk)=(IDTD)zk,N(u)i=max(ui,0).

Image (1.4)

A by-product of applying nonnegativity is that the original sparsity coefficient λ now occurs in L1Image as the bias term of this layer, rather than appearing in NImage as in Eq. (1.3). As a result, NImage in Eq. (1.4) now has exactly the same form as the popular neuron of rectified linear unit (ReLU) [2]. We further make an aggressive approximation of Eq. (1.4), by setting k=0Image and assuming z0=0Image, and have

z=N(DTxλ).

Image (1.5)

Note that even if a nonzero z0Image is assumed, it could be absorbed into the bias term −λ. Equation (1.5) is exactly a fully-connected layer followed by ReLU neurons, one of the most standard building blocks in existing deep models. Convolutional layers could be derived similarly by looking at a convolutional sparse coding model [66] rather than a linear one. Such a hidden structural resemblance reveals the potential to bridge many sparse and low-rank models with current successful deep models, potentially enhancing the generalization, compactness and interpretability of the latter.

1.4 Organization

In the remainder of this book, Chapter 2 will first introduce the bi-level sparse coding model, using the example of hyperspectral image classification. Chapters 3, 4 and 5 will then present three concrete examples (classification, superresolution, and clustering), to show how (bi-level) sparse coding models could be naturally converted to and trained as deep networks. From Chapter 6 to Chapter 9, we will delve into the extensive applications of deep learning aided by sparsity and low-rankness, in signal processing, dimensionality reduction, action recognition, style recognition and kinship understanding, respectively.

References

[1] R. Zhao, W.I. Grosky, Narrowing the semantic gap-improved text-based web document retrieval using visual features, IEEE Transactions on Multimedia 2002;4(2):189–200.

[2] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, NIPS. 2012.

[3] Z. Wang, S. Chang, Y. Yang, D. Liu, T.S. Huang, Studying very low resolution recognition using deep networks, Proceedings of the IEEE conference on computer vision and pattern recognition. 2016:4792–4800.

[4] D. Liu, B. Cheng, Z. Wang, H. Zhang, T.S. Huang, Enhance visual recognition under adverse conditions via deep networks, arXiv preprint arXiv:1712.07732; 2017.

[5] Z. Wu, Z. Wang, Z. Wang, H. Jin, Towards privacy-preserving visual recognition via adversarial training: a pilot study, arXiv preprint arXiv:1807.08379; 2018.

[6] N. Bodla, J. Zheng, H. Xu, J. Chen, C.D. Castillo, R. Chellappa, Deep heterogeneous feature fusion for template-based face recognition, 2017 IEEE winter conference on applications of computer vision, WACV 2017. Santa Rosa, CA, USA, March 24–31, 2017. 2017:586–595.

[7] R. Ranjan, A. Bansal, H. Xu, S. Sankaranarayanan, J. Chen, C.D. Castillo, et al., Crystal loss and quality pooling for unconstrained face verification and recognition, CoRR 2018. arXiv:1804.01159 [abs].

[8] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, Advances in neural information processing systems. 2015:91–99.

[9] J. Yu, Y. Jiang, Z. Wang, Z. Cao, T. Huang, Unitbox: an advanced object detection network, Proceedings of the 2016 ACM on multimedia conference. ACM; 2016:516–520.

[10] J. Gao, Q. Wang, Y. Yuan, Embedding structured contour and location prior in siamesed fully convolutional networks for road detection, Robotics and automation (ICRA), 2017 IEEE international conference on. IEEE; 2017:219–224.

[11] H. Xu, X. Lv, X. Wang, Z. Ren, N. Bodla, R. Chellappa, Deep regionlets for object detection, The European conference on computer vision (ECCV). 2018.

[12] R. Timofte, E. Agustsson, L. Van Gool, M.H. Yang, L. Zhang, B. Lim, et al., NTIRE 2017 challenge on single image super-resolution: methods and results, Computer vision and pattern recognition workshops (CVPRW), 2017 IEEE conference on. IEEE; 2017:1110–1121.

[13] B. Li, X. Peng, Z. Wang, J. Xu, D. Feng, AOD-Net: all-in-one dehazing network, Proceedings of the IEEE international conference on computer vision. 2017:4770–4778.

[14] B. Li, X. Peng, Z. Wang, J. Xu, D. Feng, An all-in-one network for dehazing and beyond, arXiv preprint arXiv:1707.06543; 2017.

[15] B. Li, X. Peng, Z. Wang, J. Xu, D. Feng, End-to-end united video dehazing and detection, arXiv preprint arXiv:1709.03919; 2017.

[16] D. Liu, B. Wen, J. Jiao, X. Liu, Z. Wang, T.S. Huang, Connecting image denoising and high-level vision tasks via deep learning, arXiv preprint arXiv:1809.01826; 2018.

[17] R. Prabhu, X. Yu, Z. Wang, D. Liu, A. Jiang, U-finger: multi-scale dilated convolutional network for fingerprint image denoising and inpainting, arXiv preprint arXiv:1807.10993; 2018.

[18] Z. Wang, S. Chang, J. Zhou, M. Wang, T.S. Huang, Learning a task-specific deep architecture for clustering, SDM 2016.

[19] B. Cheng, Z. Wang, Z. Zhang, Z. Li, D. Liu, J. Yang, et al., Robust emotion recognition from low quality and low bit rate video: a deep learning approach, arXiv preprint arXiv:1709.03126; 2017.

[20] Z. Wang, J. Yang, H. Jin, E. Shechtman, A. Agarwala, J. Brandt, et al., DeepFont: identify your font from an image, Proceedings of the 23rd ACM international conference on multimedia. ACM; 2015:451–459.

[21] Z. Wang, J. Yang, H. Jin, E. Shechtman, A. Agarwala, J. Brandt, et al., Real-world font recognition using deep network and domain adaptation, arXiv preprint arXiv:1504.00028; 2015.

[22] Z. Wang, S. Chang, F. Dolcos, D. Beck, D. Liu, T.S. Huang, Brain-inspired deep networks for image aesthetics assessment, arXiv preprint arXiv:1601.04155; 2016.

[23] T.S. Huang, J. Brandt, A. Agarwala, E. Shechtman, Z. Wang, H. Jin, et al., Deep learning for font recognition and retrieval, Applied cloud deep semantic recognition. Auerbach Publications; 2018:109–130.

[24] C. Farabet, C. Couprie, L. Najman, Y. LeCun, Learning hierarchical features for scene labeling, IEEE Transactions on Pattern Analysis and Machine Intelligence 2013;35(8):1915–1929.

[25] Q. Wang, J. Gao, Y. Yuan, A joint convolutional neural networks and context transfer for street scenes labeling, IEEE Transactions on Intelligent Transportation Systems 2017.

[26] G. Saon, H.K.J. Kuo, S. Rennie, M. Picheny, The IBM 2015 English conversational telephone speech recognition system, arXiv preprint arXiv:1505.05899; 2015.

[27] I. Sutskever, O. Vinyals, Q.V. Le, Sequence to sequence learning with neural networks, Advances in neural information processing systems. 2014:3104–3112.

[28] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, et al., Generative adversarial nets, Advances in neural information processing systems. 2014:2672–2680.

[29] D. Silver, A. Huang, C.J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, et al., Mastering the game of go with deep neural networks and tree search, Nature 2016;529(7587):484–489.

[30] M. Moravčík, M. Schmid, N. Burch, V. Lisỳ, D. Morrill, N. Bard, et al., DeepStack: expert-level artificial intelligence in no-limit poker, arXiv preprint arXiv:1701.01724; 2017.

[31] Y. LeCun, et al., LeNet-5, convolutional neural networks, URL: http://yann.lecun.com/exdb/lenet; 2015.

[32] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556; 2014.

[33] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, et al., Going deeper with convolutions, Proceedings of the IEEE conference on computer vision and pattern recognition. 2015:1–9.

[34] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, Proceedings of the IEEE conference on computer vision and pattern recognition. 2016:770–778.

[35] F.A. Gers, J. Schmidhuber, F. Cummins, Learning to forget: continual prediction with LSTM, Neural Computation 2000;12(10):2451–2471.

[36] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, arXiv preprint arXiv:1412.3555; 2014.

[37] I. Goodfellow, Y. Bengio, A. Courville, Deep learning. MIT Press; 2016.

[38] Z. Wang, J. Yang, H. Zhang, Z. Wang, Y. Yang, D. Liu, et al., Sparse coding and its applications in computer vision. World Scientific; 2015.

[39] R.G. Baraniuk, Compressive sensing [lecture notes], IEEE Signal Processing Magazine 2007;24(4):118–121.

[40] J. Huang, T. Zhang, D. Metaxas, Learning with structured sparsity, Journal of Machine Learning Research Nov. 2011;12:3371–3412.

[41] H. Xu, J. Zheng, A. Alavi, R. Chellappa, Template regularized sparse coding for face verification, 23rd International conference on pattern recognition, ICPR 2016. Cancún, Mexico, December 4–8, 2016. 2016:1448–1454.

[42] H. Xu, J. Zheng, A. Alavi, R. Chellappa, Cross-domain visual recognition via domain adaptive dictionary learning, CoRR 2018. arXiv:1804.04687 [abs].

[43] H. Xu, J. Zheng, R. Chellappa, Bridging the domain shift by domain adaptive dictionary learning, Proceedings of the British machine vision conference 2015, BMVC 2015. Swansea, UK, September 7–10, 2015. 2015 p. 96.1–96.12.

[44] H. Xu, J. Zheng, A. Alavi, R. Chellappa, Learning a structured dictionary for video-based face recognition, 2016 IEEE winter conference on applications of computer vision, WACV 2016. Lake Placid, NY, USA, March 7–10, 2016. 2016:1–9.

[45] E.J. Candès, X. Li, Y. Ma, J. Wright, Robust principal component analysis? Journal of the ACM (JACM) 2011;58(3):11.

[46] Z. Wen, W. Yin, Y. Zhang, Solving a low-rank factorization model for matrix completion by a nonlinear successive over-relaxation algorithm, Mathematical Programming Computation 2012:1–29.

[47] Z. Wang, H. Li, Q. Ling, W. Li, Robust temporal-spatial decomposition and its applications in video processing, IEEE Transactions on Circuits and Systems for Video Technology 2013;23(3):387–400.

[48] H. Li, Z. Lu, Z. Wang, Q. Ling, W. Li, Detection of blotch and scratch in video based on video decomposition, IEEE Transactions on Circuits and Systems for Video Technology 2013;23(11):1887–1900.

[49] Z. Yu, H. Li, Z. Wang, Z. Hu, C.W. Chen, Multi-level video frame interpolation: exploiting the interaction among different levels, IEEE Transactions on Circuits and Systems for Video Technology 2013;23(7):1235–1248.

[50] Z. Yu, Z. Wang, Z. Hu, H. Li, Q. Ling, Video error concealment via total variation regularized matrix completion, Image processing (ICIP), 2012 19th IEEE international conference on. IEEE; 2012:1633–1636.

[51] Z. Yu, Z. Wang, Z. Hu, Q. Ling, H. Li, Video frame interpolation using 3-d total variation regularized completion, Image processing (ICIP), 2012 19th IEEE international conference on. IEEE; 2012:857–860.

[52] Z. Wang, H. Li, Q. Ling, W. Li, Mixed Gaussian-impulse video noise removal via temporal-spatial decomposition, Circuits and systems (ISCAS), 2012 IEEE international symposium on. IEEE; 2012:1851–1854.

[53] X. Zhang, Z. Wang, D. Liu, Q. Ling, DADA: deep adversarial data augmentation for extremely low data regime classification, arXiv preprint arXiv:1809.00981; 2018.

[54] J. Wu, Y. Wang, Z. Wu, Z. Wang, A. Veeraraghavan, Y. Lin, Deep k-means: re-training and parameter sharing with harder cluster assignments for compressing deep convolutions, arXiv preprint arXiv:1806.09228; 2018.

[55] S. Han, H. Mao, W.J. Dally, Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding, arXiv preprint arXiv:1510.00149; 2015.

[56] B. Liu, M. Wang, H. Foroosh, M. Tappen, M. Pensky, Sparse convolutional neural networks, Proceedings of the IEEE conference on computer vision and pattern recognition. 2015:806–814.

[57] Y. Lin, C. Sakr, Y. Kim, N. Shanbhag, PredictiveNet: an energy-efficient convolutional neural network via zero prediction, Circuits and systems (ISCAS), 2017 IEEE international symposium on. IEEE; 2017:1–4.

[58] Y. Ioannou, D. Robertson, J. Shotton, R. Cipolla, A. Criminisi, Training CNNs with low-rank filters for efficient image classification, arXiv preprint arXiv:1511.06744; 2015.

[59] T.N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, B. Ramabhadran, Low-rank matrix factorization for deep neural network training with high-dimensional output targets, Acoustics, speech and signal processing (ICASSP), 2013 IEEE international conference on. IEEE; 2013:6655–6659.

[60] K. Gregor, Y. LeCun, Learning fast approximations of sparse coding, ICML. 2010.

[61] Z. Wang, Q. Ling, T. Huang, Learning deep 0Image encoders, AAAI 2016.

[62] Z. Wang, S. Chang, D. Liu, Q. Ling, T.S. Huang, D3: deep dual-domain based fast restoration of jpeg-compressed images, IEEE CVPR. 2016.

[63] Wang Z, Yang Y, Chang S, Ling Q, Huang TS. Learning a deep Image encoder for hashing. 2016.

[64] D. Liu, Z. Wang, B. Wen, J. Yang, W. Han, T.S. Huang, Robust single image super-resolution via deep networks with sparse prior, IEEE TIP 2016.

[65] A. Coates, A.Y. Ng, The importance of encoding versus training with sparse coding and vector quantization, ICML. 2011.

[66] B. Wohlberg, Efficient convolutional sparse coding, ICASSP. IEEE; 2014.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.78.102