Ding Liu⁎; Thomas S. Huang† ⁎Beckman Institute for Advanced Science and Technology, Urbana, IL, United States
†Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Champaign, IL, United States
Recently, deep learning has been successfully applied in numerous areas of computer vision, including low-level image restoration problems. For single image super-resolution (SR), which is an ill-posed problem that tries to recover a high-resolution image from its low-resolution observation, a number of models based on deep neural networks have been proposed and obtained superior performance that overshadows all previously handcrafted models. To regularize the solution of the problem, older methods have focused on using good priors from natural images such as sparse representation, or directly learning the priors from a large data set with models such as deep neural networks. In this chapter, we argue that domain expertise represented by the conventional sparse coding model is still valuable, and it can be combined with the key ingredients of deep learning to achieve further improvements. We demonstrate that a sparse coding model particularly designed for super-resolution can be incarnated as a neural network with the merit of end-to-end optimization over training data. The network has a cascaded structure which boosts the SR performance for both fixed and incremental scaling factors. The interpretation of the network based on sparse representation leads to much more efficient and effective training, as well as reduced model size. Moreover, we develop training and testing schemes that can be extended for robust handling of images with additional degradation such as noise and blurring. A subjective assessment is conducted and analyzed in order to thoroughly evaluate various SR techniques. Our proposed model is tested on several benchmark datasets, and it significantly outperforms existing state-of-the-art methods for various scaling factors both quantitatively and qualitatively.
Furthermore, we propose the method of learning a mixture of SR inference modules in a unified framework to tackle the problem of single image SR. Specifically, a number of SR inference modules specialized in different image local patterns are first independently applied on the LR image to obtain various HR estimates, and the resultant HR estimates are adaptively aggregated to form the final HR image. By selecting neural networks as the SR inference module, the whole procedure can be incorporated into a unified network and optimized jointly. Extensive experiments are conducted to investigate the relation between restoration performance and various network architecture designs. Compared with other recent image SR approaches, this method consistently achieves superior restoration results on a wide range of images while allowing more flexible design choices.
Sparse coding; Image super-resolution; Deep learning; Neural network
Single image super-resolution (SR) aims at obtaining a high-resolution (HR) image from a low-resolution (LR) input image by inferring all the missing high frequency contents. With the known variables in LR images greatly outnumbered by the unknowns in HR images, SR is a highly ill-posed problem and the current techniques are far from being satisfactory for many real applications [1,2], such as surveillance, medical imaging and consumer photo editing [3].
To regularize the solution of SR, people have exploited various priors of natural images. Analytical priors, such as bicubic interpolation, work well for smooth regions; while image models based on statistics of edges [4] and gradients [5] can recover sharper structures. Sparse priors are utilized in the patch-based methods [6–8]. HR patch candidates are recovered from similar examples in the LR image itself at different locations and across different scales [9,10].
More recently, inspired by the success achieved by deep learning [11] in other computer vision tasks, people began to use neural networks with deep architecture for image SR. Multiple layers of collaborative auto-encoders are stacked together in [12,13] for robust matching of self-similar patches. Deep convolutional neural networks (CNN) [14] and deconvolutional networks [15] are designed that directly learn the nonlinear mapping from the LR space to HR space in a way similar to coupled sparse coding [7]. As these deep networks allow end-to-end training of all the model components between LR input and HR output, significant improvements have been observed over their shadow counterparts.
The networks in [12,14] are built with generic architectures, which means all their knowledge about SR is learned from training data. On the other hand, people's domain expertise for the SR problem, such as natural image prior and image degradation model, is largely ignored in deep learning based approaches. It is then worthwhile to investigate whether domain expertise can be used to design better deep model architectures, or whether deep learning can be leveraged to improve the quality of handcrafted models.
In this section, we extend the conventional sparse coding model [6] using several key ideas from deep learning, and show that domain expertise is complementary to large learning capacity in further improving SR performance. First, based on the learned iterative shrinkage and thresholding algorithm (LISTA) [16], we implement a feed-forward neural network in which each layer strictly correspond to one step in the processing flow of sparse coding based image SR. In this way, the sparse representation prior is effectively encoded in our network structure; at the same time, all the components of sparse coding can be trained jointly through back-propagation. This simple model, which is named sparse coding based network (SCN), achieves notable improvement over the generic CNN model [14] in terms of both recovery accuracy and human perception, and yet has a compact model size. Moreover, with the correct understanding of each layer's physical meaning, we have a more principled way to initialize the parameters of SCN, which helps to improve optimization speed and quality.
A single network is only able to perform image SR by a particular scaling factor. In [14], different networks are trained for different scaling factors. In this section, we propose a cascade of multiple SCNs to achieve SR for arbitrary factors. This approach, motivated by the self-similarity based SR approach [9], not only increases the scaling flexibility of our model, but also reduces artifacts for large scaling factors. Moreover, inspired by the multi-pass scheme of image denoising [17], we demonstrate that the SR results can be further enhanced by cascading multiple SCNs for SR of a fixed scaling factor. A cascade of SCNs (CSCN) can also benefit from the end-to-end training of deep network with a specially designed multi-scale cost function.
In practical SR scenarios, the real LR measurements usually suffer from various types of corruption, such as noise and blurring. Sometimes the degradation process is even too complicated or unclear. We propose several schemes using our SCN to robustly handle such practical SR cases. When the degradation mechanism is unknown, we fine-tune the generic SCN with the requirement of only a small amount of real training data and manage to adapt our model to the new scenario. When the forward model for LR generation is clear, we propose an iterative SR scheme incorporating SCN with additional regularization based on priors from the degradation mechanism.
Subjective assessment is important to the SR technology because commercial products equipped with such technology are usually evaluated subjectively by the end users. In order to thoroughly compare our model with other prevailing SR methods, we conduct a systematic subjective evaluation of these methods, in which the assessment results are statistically analyzed and one score is given for each method.
In the following, we first review the literature in Sect. 4.1.2 and introduce the SCN model in Sect. 4.1.3. Then the cascade scheme of SCN models is detailed in Sect. 4.1.4. The method for robustly handling images with additional degradation such as noise and blurring is discussed in Sect. 4.1.5. Implementation details are provided in Sect. 4.1.6. Extensive experimental results are reported in Sect. 4.1.7, and the subjective evaluation is described in Sect. 4.1.8. Finally, conclusion and future work are presented in Sect. 4.1.9.
Single image SR is the task of recovering an HR image from only one LR observation. A comprehensive review can be found in [18]. Generally, existing methods can be classified into three categories: interpolation based [19], image statistics based [4,20], and example based methods [6,21].
Interpolation based methods include bilinear, bicubic and Lanczos filtering [19], which usually run very fast because of the low algorithm complexity. However, the simplicity of these methods leads to the failure of modeling the complex mapping between the LR feature space and the corresponding HR feature space, generating overly-smoothed unsatisfactory regions.
Image statistics based methods utilize the statistical edge information to reconstruct HR images [4,20]. They rely on the priors of edge statistics in images while facing the shortcoming of losing high-frequency detail information, especially in the case of large upscaling factors.
The current most popular and successful approaches are built on example based learning techniques, which aim to learn the correspondence between the LR feature space and HR feature space through a large number of representative exemplar pairs. The pioneer work in this area includes [22].
Given the origin of exemplar pairs, these methods can be further categorized into three classes: self-example based [9,10], external-example based methods [6,21] and the joint of them [23]. Self-example based methods only exploit the single input LR image as references, and extract exemplar pairs merely from the LR image across different scales to predict the HR image. Such methods usually work well on the images containing repetitive patterns or textures but lack the richness of image structures outside the input image and thus fail to generate satisfactory prediction for images of other classes. Huang et al. [24] extend this idea by building self-dictionaries for handling geometric transformations.
External-example based methods first utilize the exemplar pairs extracted from a large external dataset, in order to learn the universal image characteristics between the LR feature space and HR feature space, and then apply the learned mapping for SR. Usually, representative patches from external datasets are compactly embodied in pre-trained dictionaries. One representative approach is the sparse coding based method [6,7]. For example, in [7] two coupled dictionaries are trained for the LR feature space and HR patch feature space, respectively, such that the LR patch over LR dictionary and its corresponding HR patch over HR dictionary share the same sparse representation. Although it is able to capture the universal LR–HR correspondence from external datasets and recover fine details and sharpened edges, it suffers from the high computational cost when solving complicated nonlinear optimization problems.
Timofte et al. [21,25] propose a neighboring embedding approach for SR, and formulate the problem as a least squares optimization with norm regularization, which drastically reduces the computation complexity compared with [6,7]. Neighboring embedding approaches approximate HR patches as a weighted average of similar training patches in a low dimensional manifold.
Random forest is built for SR without dictionary learning in [26,27]. Such an approach achieves fast inference time but usually suffers from the huge model size.
In this section, we first introduce the background of sparse coding for image SR and its network implementation. Then we illustrate the design of our proposed sparse coding based network and its advantage over previous models.
The sparse representation based SR method [6] models the transform from each local patch in the bicubic-upscaled LR image to the corresponding patch in the HR image. The dimension is not necessarily the same as when image features other than raw pixel are used to represent patch y. It is assumed that the LR (HR) patch y (x) can be represented with respect to an overcomplete dictionary () using some sparse linear coefficients () , which are known as sparse code. Since the degradation process from x to y is nearly linear, the patch pair can share the same sparse code if the dictionaries and are defined properly. Therefore, for an input LR patch y, the HR patch can be recovered as
where denotes the norm, which is convex and sparsity-inducing, and λ is a regularization coefficient.
In order to learn the dictionary pair (), the goal is to minimize the recovery error of x and y, and thus the loss function L in [7] is defined as
where γ () balances the two reconstruction errors. Then the optimal dictionary pair can be found by minimizing the empirical expectation of (4.2) over all the training LR/HR pairs,
Since the objective function in (4.2) is highly nonconvex, the dictionary pair is usually learned alternatively while keeping one of them fixed [7]. The authors of [28,23] also incorporated patch-level self-similarity with dictionary pair learning.
There is an intimate connection between sparse coding and neural network, which has been well studied in [29,16]. A feed-forward neural network as illustrated in Fig. 4.1 is proposed in [16] to efficiently approximate the sparse code α of input signal y as it would be obtained by solving (4.1) for a given dictionary . The network has a finite number of recurrent stages, each of which updates the intermediate sparse code according to
where is an element-wise shrinkage function defined as with positive thresholds θ.
Different from the iterative shrinkage and thresholding algorithm (ISTA) [30,31] which finds an analytical relationship between network parameters (weights W, S and thresholds θ) and sparse coding parameters ( and λ), the authors of [16] learn all the network parameters from training data using a back-propagation algorithm called learned ISTA (LISTA). In this way, a good approximation of the underlying sparse code can be obtained within a fixed number of recurrent stages.
Given the fact that sparse coding can be effectively implemented with an LISTA network, it is straightforward to build a multi-layer neural network that mimics the processing flow of the sparse coding based SR method [6]. Similar to most patch-based SR methods, our sparse coding based network (SCN) takes the bicubic-upscaled LR image as input, and outputs the full HR image . Fig. 4.2 shows the main network structure, and each layer is described in the following.
The input image first goes through a convolutional layer H which extracts features for each LR patch. There are filters of spatial size in this layer, so that our input patch size is and its feature representation y has dimensions.
Each LR patch y is then fed into an LISTA network with k recurrent stages to obtain its sparse code . Each stage of LISTA consists of two linear layers parameterized by and , and a nonlinear neuron layer with activation function . The activation thresholds are also to be updated during training, which complicates the learning algorithm. To restrict all the tunable parameters in our linear layers, we do a simple trick to rewrite the activation function as
Equation (4.5) indicates that the original neuron with an adjustable threshold can be decomposed into two linear scaling layers and a unit-threshold neuron, as shown in Fig. 4.2(top-right). The weights of the two scaling layers are diagonal matrices defined by θ and its element-wise reciprocal, respectively.
The sparse code α is then multiplied with HR dictionary in the next linear layer, reconstructing HR patch x of size .
In the final layer G, all the recovered patches are put back to the corresponding positions in the HR image . This is realized via a convolutional filter of channels with spatial size . The size is determined as the number of neighboring patches that overlap with the same pixel in each spatial direction. The filter will assign appropriate weights to the overlapped recoveries from different patches and take their weighted average as the final prediction in .
As illustrated in Fig. 4.2(bottom), after some simple reorganizations of the layer connections, the network described above has some adjacent linear layers which can be merged into a single layer. This helps to reduce the computation load and redundant parameters in the network. The layers H and G are not merged because we apply additional nonlinear normalization operations on patches y and x, which will be detailed in Sect. 4.1.6.
Thus, there are a total of 5 trainable layers in our network: 2 convolutional layers H and G, and 3 linear layers shown as gray boxes in Fig. 4.2. The k recurrent layers share the same weights and are therefore conceptually regarded as one. Note that all the linear layers are actually implemented as convolutional layers applied on each patch with filter spatial size of , a structure similar to the network in network [32]. Also note that all these layers have only weights but no biases (zero biases).
Mean squared error (MSE) is employed as the cost function to train the network, and our optimization objective can be expressed as
where and form the ith pair of LR/HR training data, and denotes the HR image for predicted using the SCN model with parameter set Θ. All the parameters are optimized through the standard back-propagation algorithm. Although it is possible to use other cost terms that are more correlated with human visual perception than MSE, our experimental results show that simply minimizing MSE leads to improvement in subjective quality.
The construction of our SCN follows exactly each step in the sparse coding based SR method [6]. If the network parameters are set according to the dictionaries learned in [6], it can reproduce almost the same results. However, after training, SCN learns a more complex regression function and can no longer be converted to an equivalent sparse coding model. The advantage of SCN comes from its ability to jointly optimize all the layer parameters from end to end; while in [6] some variables are manually designed and some are optimized individually by fixing all the others.
Technically, our network is also a CNN and it has similar layers as the CNN model proposed in [14] for patch extraction and reconstruction. The key difference is that we have an LISTA subnetwork specifically designed to enforce sparse representation prior; while in [14] a generic rectified linear unit (ReLU) [33] is used for nonlinear mapping. Since SCN is designed based on our domain knowledge in sparse coding, we are able to obtain a better interpretation of the filter responses and have a better way to initialize the filter parameters in training. We will see in the experiments that all these contribute to better SR results, faster training speed and smaller model size than a vanilla CNN.
In this section, we investigate two different network cascade techniques in order to fully exploit our SCN model in SR applications.
First, we observe that the SR results can be further improved by cascading multiple SCNs trained for the same objective in (4.6), which is inspired by the multi-pass scheme in [17]. The only difference for training these SCNs is to replace the bicubic interpolated input by its latest HR estimate, while the target output remains the same.
The first SCN plays as a function approximator to model the nonlinear mapping from the bicubic upscaled image to the ground-truth image. The following SCN plays as another function approximator, with the starting point changed to a better estimate: the output of its previous SCN.
In other words, the cascade of SCNs as a whole can be considered as a new deeper network having more powerful learning capability, which is able to better approximate the mapping between the LR inputs to the HR counterparts, and these SCNs can be trained jointly to pursue even better SR performance.
Like most SR models learned from external training examples, the SCN discussed previously can only upscale images by a fixed factor. A separate model needs to be trained for each scaling factor to achieve the best performance, which limits the flexibility and scalability in practical use. One way to overcome this difficulty is to repeatedly enlarge the image by a fixed scale until the resulting HR image reaches a desired size. This practice is commonly adopted in the self-similarity based methods [9,10,12], but is not so popular in other cases for the fear of error accumulation during repetitive upscaling.
In our case, however, it is observed that a cascade of SCNs trained for small scaling factors can generate even better SR results than a single SCN trained for a large scaling factor, especially when the target scaling factor is large (greater than 2). This is illustrated by the example in Fig. 4.3. Here an input image is magnified 4 times in two ways: with a single SCN×4 model through the processing flow (A) → (B) → (D); and with a cascade of two SCN×2 models through (A) → (C) → (E). It can be seen that the input to the second cascaded SCN×2 in (C) is already sharper and contains less artifacts than the bicubic×4 input to the single SCN×4 in (B), which naturally leads to the better final result in (E) than that in (D).
To get a better understanding of the above observation, we can draw a loose analogy between the SR process and a communication system. Bicubic interpolation is like a noisy channel through which an image is “transmitted” from the LR domain to HR domain. And our SCN model (or any SR algorithm) behaves as a receiver which recovers clean signals from noisy observations. A cascade of SCNs is then like a set of relay stations that enhance signal-to-noise ratio before the signal becomes too weak for further transmission. Therefore, cascading will work only when each SCN can restore enough useful information to compensate for the new artifacts it introduces as well as the magnified artifacts from previous stages.
Taking into account the two aforementioned cascade techniques, we can consider the cascade of all SCNs as a deeper network (CSCN), in which the final output of the consecutive SCNs of the same ground truth is connected to the input of the next SCN with bicubic interpolation in the between. To construct the cascade, besides stacking several SCNs trained individually with respect to (4.6), we can also optimize all of them jointly as shown in Fig. 4.4. Without loss of generality, we assume that each stage in Sect. 4.1.4.2 has the same scaling factor s. Let () denote the output image of the jth SCN in the kth stage upscaled by a total of times. In the same stage, each output of SCNs is compared with the associated ground-truth image according to the MSE cost, leading to a multi-scale objective function:
where i denotes the data index, and denote the SCN index. For simplicity of notation, specially denotes the bicubic interpolated image of the final output in the th stage upscaled by a total of times. This multi-scale objective function makes full use of the supervision information in all scales, sharing a similar idea as heterogeneous networks [34]. All the layer parameters in (4.7) could be optimized from end to end by back-propagation. The SCNs share the same training objective can be trained simultaneously, taking advantage of the merit of deep learning. For the SCNs with different training objectives, we use a greedy algorithm here to train them sequentially from the beginning of the cascade so that we do not need to care about the gradient of bicubic layers. Applying back-propagation through a bicubic layer or its trainable surrogate will be considered in a future work.
Most of recent SR works generate the LR images for both training and testing by downscaling HR images using bicubic interpolation [6,21]. However, this assumption of the forward model may not always hold in practice. For example, real LR measurements are usually blurred, or corrupted with noise. Sometimes, the LR generation mechanism may be complicated, or even unknown. We now investigate a practical SR problem, and propose two approaches to handle such non-ideal LR measurements, using the generic SCN. In the case that the underlying mechanism of the real LR generation is unclear or complicated, we propose a data-driven approach by fine-tuning the learned generic SCN with a limited number of real LR measurements, as well as their corresponding HR counterparts. On the other hand, if the real training samples are unavailable but the LR generation mechanism is clear, we formulate this inverse problem as the regularized HR image reconstruction problem which can be solved using iterative methods. The proposed methods demonstrate the robustness of our SCN model to different SR scenarios. In the following, we elaborate the details of these two approaches.
Deep learning models can be efficiently transferred from one task to another by reusing the intermediate representation in the original neural network [35]. This method has been proven successful on a number of high-level vision tasks, even if there is a limited amount of training data in the new task [36].
The success of super-resolution algorithms usually highly depends on the accuracy of the model of the imaging process. When the underlying mechanism of the generation of LR images is not clear, we can take advantage of the aforementioned merit of deep learning models by learning our model in a data-driven manner, to adapt it for a particular task. Specifically, we start training from the generic SCN model while using very limited amount of training data from a new SR scenario, and manage to adapt it to the new SR scenario and obtain promising results. In this way, it is demonstrated that the SCN has strong capability of learning complex mappings between the non-ideal LR measurements and their HR counterparts, as well as the high flexibility of adapting to various SR tasks.
The second approach considers the case that the mechanism of generating the real LR images is relatively simple and clear, indicating the training data is always available if we synthesize LR images with the known degradation process. We propose an iterative SR scheme which incorporates the generic SCN model with additional regularization based on task-related priors (e.g., the known kernel for deblurring, or the data sparsity for denoising). In this section, we specifically discuss handling blurred and noisy LR measurements in details as examples, though the iterative SR methods can be generalized to other practical imaging models.
The real LR images can be generated with various types of blurring. Directly applying the generic SCN model is obviously not optimal. Instead, with the known blurring kernel, we propose to estimate the regularized version of the HR image based on the directly upscaled image by the learned SCN as follows:
where is the original blurred LR input, and the operators B and D are blurring and subsampling, respectively. Similar to the previous work [6], we use back-projection to iteratively estimate the regularized HR input on which our model can perform better. Specifically, given the regularized estimate at iteration , we estimate a less blurred LR image by downsampling using bicubic interpolation. The upscaled by learned SCN serves the regularizer for the ith iteration as follows:
Here we use a penalty method to form an unconstrained problem. The upscaled HR image can be computed as . The same process is repeated until convergence. We have applied the proposed iterative scheme to LR images generated from Gaussian blurring and subsampling as an example. The empirical performance is illustrated in Sect. 4.1.7.
Noise is a ubiquitous cause of corruption in image acquisition. State-of-the-art image denoising methods usually adopt priors such as patch similarity [37], patch sparsity [38,17], or both [39], as a regularizer in image restoration. In this section, we propose a regularized noisy image upscaling scheme, for specifically handling noisy LR images, in order to obtain improved SR quality. Though any denoising algorithm can be used in our proposed scheme, here we apply spatial similarity combined with transform domain image patch group-sparsity as our regularizer [39], to form the regularized iterative SR problem as an example.
Similar to the method in Sect. 4.1.5.2, we iteratively estimate the less noisy HR image from the denoised LR image. Given the denoised LR estimate at iteration , we directly upscale it, using the learned generic SCN, to obtain the HR image . It is then downsampled using bicubic interpolation, to generate the LR image , which is used in the fidelity term in the ith iteration of LR image denoising. The same process is repeated until convergence. The iterative LR image denoising problem is formulated as follows:
where the operator generates the 3D vectorized tensor, which groups the jth overlapping patch from the LR image I, together with the spatially similar patches within its neighborhood by block matching [39]. The codes of the patch groups in the domain of 3D sparsifying transform are sparse, which is enforced by the norm penalty [40]. The weight τ controls the sparsity level, which normally depends on the remaining noise level in [41,40].
In (4.10), we use the patch group sparsity as our denoising regularizer. The 3D sparsifying transform can be one of commonly used analytical transforms, such as discrete cosine transform (DCT) or wavelets. The state-of-the-art BM3D denoising algorithm [39] is based on such an approach, but further improved by more sophisticated engineering stages. In order to achieve the best practical SR quality, we demonstrate the empirical performance comparison using BM3D as the regularizer in Sect. 4.1.7. Additionally, our proposed iterative method is a general practical SR framework, which is not dedicated to SCN. One can conveniently extend it to other SR methods, which generate in the ith iteration. A performance comparison of these methods is given in Sect. 4.1.7.
We determine the number of nodes in each layer of our SCN mainly according to the corresponding settings used in sparse coding [7]. Unless otherwise stated, we use input LR patch size , LR feature dimension , dictionary size , output HR patch size , and patch aggregation filter size . All the convolution layers have a stride of 1. Each LR patch y is normalized by its mean and variance, and the same mean and variance are used to restore the final HR patch x. We crop regions from each image to obtain fixed-sized input samples to the network, which produces outputs of size .
To reduce the number of parameters, we implement the LR patch extraction layer H as the combination of two layers: the first layer has 4 trainable filters, each of which is shifted to 25 fixed positions by the second layer. Similarly, the patch combination layer G is also split into a fixed layer which aligns pixels in overlapping patches and a trainable layer whose weights are used to combine overlapping pixels. In this way, the number of parameters in these two layers are reduced by more than an order, and there is no observable loss in performance.
We employ a standard stochastic gradient descent algorithm to train our networks with mini-batch size of 64. Based on the understanding of each layer's role in sparse coding, we use Harr-like gradient filters to initialize layer H, and use uniform weights to initialize layer G. All the remaining three linear layers are related to the dictionary pair in sparse coding. To initialize them, we first randomly set and with Gaussian noise, and then find the corresponding layer weights as in ISTA [30]:
where , and denote the weights of the three subsequent layers after layer H; L is the upper bound on the largest eigenvalue of , and C is the threshold value before normalization. We empirically set .
The proposed models are all trained using the CUDA ConvNet package [11] on a workstation with 12 Intel Xeon 2.67 GHz CPUs and 1 GTX680 GPU. Training an SCN usually takes less than one day. Note that this package is customized for classification networks, and its efficiency can be further optimized for our SCN model.
In testing, to make the entire image covered by output samples, we crop input samples with overlap and extend the boundary of original image by reflection. Note we shave the image border in the same way as [14] for objective evaluations to ensure fair comparison. Only the luminance channel is processed with our method, and bicubic interpolation is applied to the chrominance channels, as their high frequency components are less noticeable to human eyes. To achieve arbitrary scaling factors using CSCN, we upscale an image by a factor of 2 repeatedly until it is at least as large as the desired size. Then a bicubic interpolation is used to downscale it to the target resolution if necessary.
When reporting our best results in Sect. 4.1.7.2, we also use the multi-view testing strategy commonly employed in image classification. For patch-based image SR, multi-view testing is implicitly used when predictions from multiple overlapping patches are averaged. Here, besides sampling overlapping patches, we also add more views by flipping and transposing the patch. Such strategy is found to improve SR performance for general algorithms at the sheer cost of computation.
We evaluate and compare the performance of our models using the same data and protocols as in [21], which are commonly adopted in SR literature. All our models are learned from a training set with 91 images, and tested on Set5 [42], Set14 [43] and BSD100 [44] which contain 5, 14 and 100 images, respectively. We have also trained on other different larger data sets, and observe marginal performance change (around 0.1 dB). The original images are downsized by bicubic interpolation to generate LR–HR image pairs for both training and evaluation. The training data are augmented with translation, rotation and scaling.
We first visualize the four filters learned in the first layer H in Fig. 4.5. The filter patterns do not change much from the initial first- and second-order gradient operators. Some additional small coefficients are introduced in a highly structured form that capture richer high frequency details.
The performance of several networks during training is measured on Set5 in Fig. 4.6. Our SCN improves significantly over sparse coding (SC) [7], as it leverages data more effectively with end-to-end training. The SCN initialized according to (4.11) can converge faster and better than the same model with random initialization, which indicates that the understanding of SCN based on sparse coding can help its optimization. We also train a CNN model [14] of the same size as SCN, but find its convergence speed much slower. It is reported in [14] that training a CNN takes back-propagations (equivalent to mini-batches here). To achieve the same performance as CNN, our SCN requires less than 1% back-propagations.
The network size of SCN is mainly determined by the dictionary size n. Besides the default value , we have tried other sizes and plot their performance versus the number of network parameters in Fig. 4.7. The PSNR of SCN does not drop too much as n decreases from 128 to 64, but the model size and computation time can be reduced significantly, as shown in Table 4.1. Fig. 4.7 also shows the performance of CNN with various sizes. Our smallest SCN can achieve higher PSNR than the largest model (CNN-L) in [45] while only using about 20% of parameters.
Table 4.1
Time consumption for SCN to upscale the “baby” image from 256 × 256 to 512 × 512 using different dictionary size n
n | 64 | 96 | 128 | 256 | 512 |
---|---|---|---|---|---|
time (s) | 0.159 | 0.192 | 0.230 | 0.445 | 1.214 |
Different numbers of recurrent stages k have been tested for SCN, and we find increasing k from 1 to 3 only improves performance by less than 0.1 dB. As a tradeoff between speed and accuracy, we use throughout the section.
In Table 4.2, different network structures with cascade for scalable SR in Sect. 4.1.4.2 (in each row) are compared at different scaling factors (in each column). SCN×a denotes the model trained with fixed scaling factor a without any cascade technique. For a fixed a, we use SCN×a as a basic module and apply it one or more times to super-resolve images for different upscaling factors, which is shown in each row of Table 4.2. It is observed that SCN×2 can perform as well as the scale-specific model for small scaling factor (1.5), and much better for large scaling factors (3 and 4). Note that the cascade of SCN×1.5 does not lead to good results since artifacts quickly get amplified through many repetitive upscalings. Therefore, we use SCN×2 as the default building block for CSCN, and drop the notation ×2 when there is no ambiguity. The last row in Table 4.2 shows that a CSCN trained using the multi-scale objective in (4.7) can further improve the SR results for scaling factors 3 and 4, as the second SCN in the cascade is trained to be robust to the artifacts generated by the first one.
Table 4.2
PSNR of different network cascading schemes on Set5, evaluated for different scaling factors in each column
scaling factor | ×1.5 | ×2 | ×3 | ×4 |
---|---|---|---|---|
SCN×1.5 | 40.14 | 36.41 | 30.33 | 29.02 |
SCN×2 | 40.15 | 36.93 | 32.99 | 30.70 |
SCN×3 | 39.88 | 36.76 | 32.87 | 30.63 |
SCN×4 | 39.69 | 36.54 | 32.76 | 30.55 |
CSCN | 40.15 | 36.93 | 33.10 | 30.86 |
As shown in [45], the amount of training data plays an important role in the field of deep learning. In order to evaluate the effect of various amount of data on training CSCN, we change the training set from a relatively small set of 91 images (Set91) [21] to two other sets: the 199 out of 200 training images2 in BSD500 dataset (BSD200) [44], and a subset of 7500 images from the ILSVRC2013 dataset [71]. A model of exactly the same architecture without any cascade is trained on each data set, and another 100 images from the ILSVRC2013 dataset are included as an additional test set. From Table 4.3, we can observe that the CSCN trained on BSD200 consistently outperforms its counterpart trained on Set91 by around 0.1 dB on all test data. However, the performance of the model trained on ILSVRC2013 is slightly different from the one trained on BSD200, which shows the saturation of the performance as the amount of training data increases. The inferior quality of images in ILSVRC2013 may be a hurdle to further improve the performance. Therefore, our method is robust to training data and can benefit marginally from a larger set of training images.
Table 4.3
Effect of various training sets on the PSNR of ×2 upscaling with single view SCN
Training Set | Test Set | |||
---|---|---|---|---|
Set5 | Set14 | BSD100 | ILSVRC (100) | |
Set91 | 36.93 | 32.56 | 31.40 | 32.13 |
BSD200 | 36.97 | 32.69 | 31.55 | 32.27 |
ILSVRC (7.5k) | 36.84 | 32.67 | 31.51 | 32.31 |
We compare the proposed CSCN with other recent SR methods on all the images in Set5, Set14 and BSD100 for different scaling factors. Table 4.4 shows the PSNR and structural similarity (SSIM) [46] for adjusted anchored neighborhood regression (A+) [25], CNN [14], CNN trained with larger model size and much more data (CNN-L) [45], the proposed CSCN, and CSCN with our multi-view testing (CSCN-MV). We do not list other methods [7,21,43,47,24] whose performance is worse than A+ or CNN-L.
Table 4.4
PSNR (SSIM) comparison on three test data sets among different methods. indicates the best and indicates the second best performance. The performance gain of our best model over all the others' best is shown in the last row. (For interpretation of the colors in the tables, the reader is referred to the web version of this chapter)
Data Set | Set5 | Set14 | BSD100 | ||||||
---|---|---|---|---|---|---|---|---|---|
Upscaling | ×2 | ×3 | ×4 | ×2 | ×3 | ×4 | ×2 | ×3 | ×4 |
A+ [25] | 36.55 | 32.59 | 30.29 | 32.28 | 29.13 | 27.33 | 30.78 | 28.18 | 26.77 |
(0.9544) | (0.9088) | (0.8603) | (0.9056) | (0.8188) | (0.7491) | (0.8773) | (0.7808) | (0.7085) | |
CNN [14] | 36.34 | 32.39 | 30.09 | 32.18 | 29.00 | 27.20 | 31.11 | 28.20 | 26.70 |
(0.9521) | (0.9033) | (0.8530) | (0.9039) | (0.8145) | (0.7413) | (0.8835) | (0.7794) | (0.7018) | |
CNN-L [45] | 36.66 | 32.75 | 30.49 | 32.45 | 29.30 | 27.50 | 31.36 | 28.41 | 26.90 |
(0.9542) | (0.9090) | (0.8628) | (0.9067) | (0.8215) | (0.7513) | (0.8879) | (0.7863) | (0.7103) | |
CSCN | |||||||||
CSCN-MV | |||||||||
Our | 0.55 | 0.59 | 0.65 | 0.35 | 0.27 | 0.31 | 0.24 | 0.19 | 0.24 |
Improvement | (0.0029) | (0.0083) | (0.0161) | (0.0034) | (0.0048) | (0.0106) | (0.0036) | (0.0042) | (0.0088) |
It can be seen from Table 4.4 that CSCN performs consistently better than all previous methods in both PSNR and SSIM, and with multi-view testing the results can be further improved. CNN-L improves over CNN by increasing model parameters and training data. However, it is still not as good as CSCN which is trained with a much smaller size and on a much smaller data set. Clearly, the better model structure of CSCN makes it less dependent on model capacity and training data in improving performance. Our models are generally more advantageous for large scaling factors due to the cascade structure. A larger performance gain is observed on Set5 than the other two test sets because Set5 has more similar statistics as the training set.
The visual qualities of the SR results generated by sparse coding (SC) [7], CNN and CSCN are compared in Fig. 4.8. Our approach produces image patterns with shaper boundaries and richer textures, and is free of the ringing artifacts observable in the other two methods.
Fig. 4.9 shows the SR results on the “chip” image compared among more methods including the self-example based method (SE) [10] and the deep network cascade (DNC) [12]. SE and DNC can generate very sharp edges on this image, but also introduce artifacts and blurs on corners and fine structures due to the lack of self-similar patches. On the contrary, the CSCN method recovers all the structures of the characters without any distortion.
We evaluate the performance of the proposed practical SR methods in Sect. 4.1.5, by providing the empirical results of several experiments for the two aforementioned approaches.
The proposed method in Sect. 4.1.5.1 is data-driven, and thus the generic SCN can be easily adapted for a particular task, with a small amount of training samples. We demonstrate the performance of this method in the application of enlarging low-DPI scanned document images with heavy noise. We first obtain several pairs of LR and HR images by scanning a document under two settings of 150 and 300 DPI. Then we fine-tune our generic CSCN model using only one pair of scanned images for a few iterations. Fig. 4.11 illustrates the visualization of the upscaled image from the 150 DPI scanned image. As shown by the SR results in Fig. 4.11, the CSCN before adaptation is very sensitive to LR measurement corruption, so the enlarged texts in (B) are much more corrupted than they are in the nearest neighbor upscaled image (A). However, the adapted CSCN model removes almost all the artifacts and can restore clear texts in (C), which is promising for practical applications such as quality enhancement of online scanned books and restoration of legacy documents.
We now show experimental results of practical SR for blurred and noisy LR images, using the proposed regularized iterative methods in Sect. 4.1.5.2. We first compare the SR performance on blurry images using the proposed method in Sect. 4.1.5.2 with several other recent methods [50,48,49], using the same test images and settings. All these methods are designed for blurry LR input, while our model is trained on sharp LR input. As shown in Table 4.5, our model achieves much better results than the competitors. Note the speed of our model is also much faster than the conventional sparse coding based methods.
Table 4.5
PSNR of ×3 upscaling on LR images with different blurring kernels
Kernel | Gaussian | Gaussian | ||||
---|---|---|---|---|---|---|
Method | CSR [48] | NLM [49] | SCN | CSR [48] | GSC [50] | SCN |
Butterfly | 27.87 | 26.93 | 28.70 | 28.19 | 25.48 | 29.03 |
Parrots | 30.17 | 29.93 | 30.75 | 30.68 | 29.20 | 30.83 |
Parthenon | 26.89 | – | 27.06 | 27.23 | 26.44 | 27.40 |
Bike | 24.41 | 24.38 | 24.81 | 24.72 | 23.78 | 25.11 |
Flower | 29.14 | 28.86 | 29.50 | 29.54 | 28.30 | 29.78 |
Girl | 33.59 | 33.44 | 33.57 | 33.68 | 33.13 | 33.65 |
Hat | 31.09 | 30.81 | 31.32 | 31.33 | 30.29 | 31.62 |
Leaves | 26.99 | 26.47 | 27.45 | 27.60 | 24.78 | 27.87 |
Plants | 33.92 | 33.27 | 34.35 | 34.00 | 32.33 | 34.53 |
Raccoon | 29.09 | – | 28.99 | 29.29 | 28.81 | 29.16 |
Average | 29.32 | 29.26 | 29.65 | 29.63 | 28.25 | 29.90 |
To test the performance of upscaling noisy LR images, we simulate additive Gaussian noise for the LR input images at 4 different noise levels () as the noisy input images. We compare the practical SR results in Set5 obtained from the following algorithms: directly using SCN, our proposed iterative SCN method using BM3D as denoising regularizer (iterative BM3D-SCN), and fine-tuning SCN with additional noisy training pairs. Note that knowing the underlying corruption model of real LR image (e.g., noise distribution or blurring kernel), one can always synthesize real training pairs for fine-tuning the generic SCN. In other words, once the iterative SR method is feasible, one can always apply our proposed data-driven method for SR alternatively. However, the converse is not true. Therefore, the knowledge of the corruption model of real measurements can be considered as a stronger assumption, compared to providing real training image pairs. Correspondingly, the SR performances of these two methods are evaluated when both can be applied. We also provide the results of methods directly using another generic SR model: CNN-L [45], and the similar iterative SR method involving CNN-L (iterative BM3D-CNN-L).
The practical SR results are listed in Table 4.6. We observed the improved PSNR using our proposed regularized iterative SR method over all noise levels. The proposed iterative BM3D-SCN achieves much higher PSNR than the method of directly using SCN. The performance gap (in terms of SR PSNR) between iterative BM3D-SCN and direct SCN becomes larger, as the noise level increases. Similar observation can be made when comparing iterative BM3D-CNN-L and direct CNN-L. Compared to the method of fine-tuning SCN, the iterative BM3D-SCN method demonstrates better empirical performance, with 0.3 dB improvement on average. The iterative BM3D-CNN-L method provides comparable results as the iterative BM3D-SCN method, which demonstrates that our proposed regularized iterative SCN scheme can be easily extended for other SR methods, and is able to effectively handle noisy LR measurements.
Table 4.6
PSNR values for ×2 upscaling noisy LR images in Set5 by directly using SCN (Direct SCN), directly using CNN-L (Direct CNN-L), SCN after fine-tuning on new noisy training data (Fine-tuning SCN), the iterative method of BM3D & SCN (Iterative BM3D-SCN), and the iterative method of BM3D & CNN-L (Iterative BM3D-CNN-L)
σ | 5 | 10 | 15 | 20 |
---|---|---|---|---|
Direct SCN | 30.23 | 25.11 | 21.81 | 19.45 |
Direct CNN-L | 30.47 | 25.32 | 21.91 | 19.46 |
Fine-tuning SCN | 33.03 | 31.00 | 29.46 | 28.44 |
Iterative BM3D-SCN | 33.51 | 31.22 | 29.65 | 28.61 |
Iterative BM3D-CNN-L | 33.42 | 31.16 | 29.62 | 28.59 |
An example of upscaling noisy LR images using the aforementioned methods is demonstrated in Fig. 4.10. Both fine-tuning SCN and iterative BM3D-SCN are able to significantly suppress the additive noise, while many artifacts induced by noise are observed in the SR result of direct SCN. It is notable that the fine-tuning SCN method performs better recovering the texture and the iterative BM3D-SCN method is preferable in smooth regions.
Subjective perception is an important metric to evaluate SR techniques for commercial use, other than the quantitative evaluation. In order to more thoroughly compare various SR methods and quantify the subjective perception, we utilize an online platform for subjective evaluation of SR results from several methods [23], including bicubic, SC [7], SE [10], self-example regression (SER) [51], CNN [14] and CSCN. Each participant is invited to conduct several pair-wise comparisons of SR results from different methods. The SR methods of displayed SR images in each pair are randomly selected. Ground-truth HR images are also included when they are available as references. For each pair, the participant needs to select the better one in terms of perceptual quality. A snapshot of our evaluation webpage3 is shown in Fig. 4.12.
Specifically, there are SR results over 6 images with different scaling factors: “kid”×4, “chip”×4, “statue”×4, “lion”×3, “temple”×3 and “train”×3. The images are shown in Fig. 4.13. All the visual comparison results are then summarized into a winning matrix for 7 methods (including ground truth). A Bradley–Terry [52] model is calculated based to these results and the subjective score is estimated for each method according to this model. In the Bradley–Terry model, the probability that an object X is favored over Y is assumed to be
where and are the subjective scores for X and Y. The scores s for all the objects can be jointly estimated by maximizing the log-likelihood of the pairwise comparison observations:
where is the th element in the winning matrix W, meaning the number of times when method i is favored over method j. We use the Newton–Raphson method to solve Eq. (4.13) and set the score for ground truth method as 1 to avoid the scale ambiguity.
Now we describe the detailed experiment results. We have a total of 270 participants giving 720 pairwise comparisons over six images with different scaling factors, which are shown in Fig. 4.13. Not every participant completed all the comparisons but their partial responses are still useful.
Fig. 4.14 shows the estimated scores for the six SR methods in our evaluation, with the score for ground truth method normalized to 1. As expected, all the SR methods have much lower scores than ground-truth, showing the great challenge in SR problem. The bicubic interpolation is significantly worse than other SR methods. The proposed CSCN method outperforms other previous state-of-the-art methods by a large margin, demonstrating its superior visual quality. It should be noted that the visual difference between some image pairs is very subtle. Nevertheless, the human subjects are able to perceive such difference when seeing the two images side by side, and therefore make consistent ratings. The CNN model becomes less competitive in the subjective evaluation than it is in PSNR comparison. This indicates that the visually appealing image appearance produced by CSCN should be attributed to the regularization from sparse representation, which cannot be easily learned by merely minimizing reconstruction error as in CNN.
We propose a new approach for image SR by combining the strengths of sparse representation and deep network, and make considerable improvement over existing deep and shallow SR models both quantitatively and qualitatively. Besides producing outstanding SR results, the domain knowledge in the form of sparse coding can also benefit training speed and model compactness. Furthermore, we investigate the cascade of network for both fixed and incremental scaling factors so as to enhance SR performance. In addition, the robustness to real SR scenarios is discussed for handling non-ideal LR measurements. More generally, our observation is in line with other recent extensions made to CNN with better domain knowledge for different tasks.
In a future work, we will apply the SCN model to other problems where sparse coding can be useful. The interaction between deep networks for low- and high-level vision tasks, such as in [53], will also be explored. Another interesting direction to explore is video super-resolution [54], which is the task of inferring a high-resolution video sequence from a low-resolution one. This problem has drawn growing attention in both the research community and industry recently. From the research perspective, this problem is challenging because video signals vary in both temporal and spatial dimensions. In the meantime, with the prevalence of high-definition (HD) display such as HDTV in the market, there is an increasing need for converting low quality video sequences to high-definition so that they can be played on the HD displays in a visually pleasant manner.
There are two types of relation that are utilized for video SR: the intra-frame spatial relation and the inter-frame temporal relation. Neural network based models have successfully demonstrated the strong capability of modeling the spatial relation. Compared with the intra-frame spatial relation, the inter-frame temporal relation is more important for video SR, as researches of vision systems suggest that the human vision system is more sensitive to motion [55]. Thus it is essential for video SR algorithm to capture and model the effect of motion information on visual perception. Sparse priors have been shown useful for video SR [56]. We will try employing the sparse coding domain knowledge in deep network models for utilizing the temporal relation among consecutive LR video frames in the future.
The main difficulty of single image SR resides in the loss of much information in the degradation process. Since the known variables from the LR image are usually greatly outnumbered by that from the HR image, this problem is a highly ill-posed problem.
A large number of single image SR methods have been proposed in the literature, including interpolation based method [57], edge model based method [4] and example based method [58,9,6,21,45,24]. Since the former two methods usually suffer the sharp drop in restoration performance with large upscaling factors, the example based method has drawn great attention from the community recently. It usually learns the mapping from LR images to HR images in a patch-by-patch manner, with the help of sparse representation [6,23], random forest [26] and so on. The neighbor embedding method [58,21] and neural network based method [45] are two representatives of this category.
Neighbor embedding is proposed in [58,42] which estimates HR patches as a weighted average of local neighbors with the same weights as in LR feature space, based on the assumption that LR/HR patch pairs share similar local geometry in low-dimensional nonlinear manifolds. The coding coefficients are first acquired by representing each LR patch as a weighted average of local neighbors, and then the HR counterpart is estimated by the multiplication of the coding coefficients with the corresponding training HR patches. Anchored neighborhood regression (ANR) is utilized in [21] to improve the neighbor embedding methods, which partitions the feature space into a number of clusters using the learned dictionary atoms as a set of anchor points. A regressor is then learned for each cluster of patches. This approach has demonstrated superiority over the counterpart of global regression in [21]. Other variants of learning a mixture of SR regressors can be found in [25,59,60].
Recently, neural network based models have demonstrated the strong capability for single image SR [12,45,61], due to its large model capacity and the end-to-end learning strategy to get rid of hand-crafted features.
In this section, we propose a method to combine the merits of the neighborhood embedding methods and the neural network based methods via learning a mixture of neural networks for single image SR. The entire image signal space can be partitioned into several subspaces, and we dedicate one SR module to the image signals in each subspace, the synergy of which allows for a better capture of the complex relation between the LR image signal and its HR counterpart than the generic model. In order to take advantage of the end-to-end learning strategy of neural network based methods, we choose neural networks as the SR inference modules and incorporate these modules into one unified network, and design a branch in the network to predict the pixel-level weights for HR estimates from each SR module before they are adaptively aggregated to form the final HR image.
A systematic analysis of different network architectures is conducted with the focus on the relation between SR performance and various network architectures via extensive experiments, where the benefit of utilizing a mixture of SR models is demonstrated. Our proposed approach is contrasted with other current popular approaches on a large number of test images, and achieves state-of-the-arts performance consistently along with more flexibility of model design choices.
The section is organized as follows. The proposed method is introduced and explained in detail in Sect. 4.2.2. Implementation details are provided in Sect. 4.2.3. Section 4.2.4 describes our experimental results, in which we analyze thoroughly different network architectures and compare the performance of our method with other current SR methods both quantitatively and qualitatively. Finally, in Sect. 4.2.5, we conclude the section and discuss the future work. The more detailed version of this work can be found in [62].
First we give an overview of our method. The LR image serves as the input to our method. There are several SR inference modules in our method. Each of them, , is dedicated to inferring a certain class of image patches, and applied on the LR input image to predict an HR estimate. We also devise an adaptive weight module, T, to adaptively combine at the pixel-level the HR estimates from SR inference modules. When we select neural networks as the SR inference modules, all the components can be incorporated into a unified neural network and jointly learned. The final estimated HR image is adaptively aggregated from the estimates of all SR inference modules. By the multi-branch design of our network, the super-resolution performance is improved comparing with its single branch counterpart, which will be shown in Sect. 4.2.4. The overview of our method is shown in Fig. 4.15.
Now we will introduce the network architecture in detail.
SR Inference Module. Taking the LR image as input, each SR inference module is designed to better capture the complex relation between a certain class of LR image signals and its HR counterpart, while predicting an HR estimate. For the sake of inference accuracy, we choose as the SR inference module a recent sparse coding based network (SCN) in [61], which implicitly incorporates the sparse prior into neural networks via employing the learned iterative shrinkage and thresholding algorithm (LISTA), and closely mimics the sparse coding based image SR method [7]. The architecture of SCN is shown in Fig. 4.2. Note that the design of the SR inference module is not limited to SCN, and all other neural network based SR models, e.g., SRCNN [45], can work as the SR inference module as well. The output of serves as an estimate to the final HR frame.
Adaptive Weight Module. The goal of this module is to model the selectivity of the HR estimates from every SR inference module. We propose assigning pixel-wise aggregation weights of each HR estimate, and again the design of this module is open to any operation in the field of neural networks. Taking into account the computation cost and efficiency, we utilize only three convolutional layers for this module, and ReLU is applied on the filter responses to introduce nonlinearity. This module finally outputs the pixel-level weight maps for all the HR estimates.
Aggregation. Each SR inference module's output is pixel-wisely multiplied with its corresponding weight map from the adaptive weight module, and then these products are summed up to form the final estimated HR frame. If we use y to denote the LR input image, a function with parameters to represent the behavior of the adaptive weight module, and a function with parameters to represent the output of SR inference module , the final estimated HR image can be expressed as
where ⊙ denotes the point-wise multiplication.
In training, our model tries to minimize the loss between the target HR frame and the predicted output, as
where represents the output of our model, is the jth HR image and is the corresponding LR image; Θ is the set of all parameters in our model.
If we plug Eq. (4.14) into Eq. (4.15), then the cost function can be expanded as:
We conduct experiments following the protocols in [21]. Different learning based methods use different training data in the literature. We choose 91 images proposed in [6] to be consistent with [25,26,61]. These training data are augmented with translation, rotation and scaling, providing approximately 8 million training samples of pixels.
Our model is tested on three benchmark data sets, which are Set5 [42], Set14 [43] and BSD100 [44]. The ground-truth images are downscaled by bicubic interpolation to generate LR–HR image pairs for both training and testing.
Following the convention in [21,61], we convert each color image into the YCbCr colorspace and only process the luminance channel with our model; bicubic interpolation is applied to the chrominance channels, because the visual system of human is more sensitive to details in intensity than in color.
Each SR inference module adopts the network architecture of SCN, while the filters of all three convolutional layers in the adaptive weight module have the spatial size of and the numbers of filters of three layers are set to be and N, which is the number of SR inference modules.
Our network is trained on a machine with 12 Intel Xeon 2.67 GHz CPUs and one Nvidia TITAN X GPU. For the adaptive weight module, we employ a constant learning rate of and initialize the weights from Gaussian distribution, while we stick to the learning rate and the initialization method in [61] for the SR inference modules. The standard gradient descent algorithm is employed to train our network with a batch size of 64 and the momentum of 0.9.
We train our model for the upscaling factor of 2. For larger upscaling factors, we adopt the model cascade technique in [61] to apply ×2 models several times until the resulting image reaches at least as large as the desired size. The resulting image is downsized via bicubic interpolation to the target resolution if necessary.
In this section, we first analyze the architecture of our proposed model and then compare it with several other recent SR methods. Finally, we provide a runtime analysis of our approach and other competing methods.
In this section we investigate the relation between various numbers of SR inference modules and SR performance. For the sake of our analysis, we increase the number of inference modules as we decrease the module capacity of each of them, so that the total model capacity is approximately consistent and thus the comparison is fair. Since the chosen SR inference module, SCN [61], closely mimics the sparse coding based SR method, we can reduce the module capacity of each inference module by decreasing the embedded dictionary size n (i.e., the number of filters in SCN) for sparse representation. We compare the following cases:
The average Peak Signal-to-Noise Ratio (PSNR) and structural similarity (SSIM) [46] are measured to quantitatively compare the SR performance of these models over Set5, Set14 and BSD100 for various upscaling factors (), and the results are displayed in Table 4.7.
Table 4.7
PSNR (in dB) and SSIM comparisons on Set5, Set14 and BSD100 for ×2, ×3 and ×4 upscaling factors among various network architectures. indicates the best and indicates the second best performance
Benchmark | SCN | MSCN-2 | MSCN-4 | |
---|---|---|---|---|
() | () | () | ||
Set5 | ×2 | 36.93 / 0.9552 | / | / |
×3 | 33.10 / | / | / 0.9130 | |
×4 | 30.86 / | / 0.8709 | / | |
Set14 | ×2 | 32.56 / 0.9069 | / | / |
×3 | 29.41 / 0.8235 | / | / | |
×4 | 27.64 / 0.7578 | / | / | |
BSD100 | ×2 | 31.40 / 0.8884 | / | / |
×3 | 28.50 / 0.7885 | / | / | |
×4 | 27.03 / 0.7161 | / | / |
It can be observed that MSCN-2 usually outperforms the original SCN network, i.e., SCN , and MSCN-4 can achieve the best SR performance by improving the performance marginally over MSCN-2 . This demonstrates the effectiveness of our approach, namely that each SR inference model is able to super-resolve its own class of image signals better than one single generic inference model.
In order to further analyze the adaptive weight module, we select several input images, namely, butterfly, zebra, barbara, and visualize the four weight maps for every SR inference module in the network. Moreover, we record the index of the maximum weight across all weight maps at every pixel and generate a max label map. These results are displayed in Fig. 4.16.
From these visualizations it can be seen that weight map 4 shows high responses in many uniform regions, and thus mainly contributes to the low frequency regions of HR predictions. On the contrary, weight maps 1, 2 and 3 have large responses in regions with various edges and textures, and restore the high frequency details of HR predictions. These weight maps reveal that these sub-networks work in a complementary manner for constructing the final HR predictions. In the max label map, similar structures and patterns of images usually share the same label, indicating that such similar textures and patterns are favored to be super-resolved by the same inference model.
We conduct experiments on all the images in Set5, Set14 and BSD100 for different upscaling factors (×2, ×3, and ×4), to quantitatively and qualitatively compare our own approach with a number of state-of-the-art image SR methods. Table 4.8 shows the PSNR and SSIM for adjusted anchored neighborhood regression (A+) [25], SRCNN [45], RFL [26], SelfEx [24] and our proposed model, MSCN-4 , that consists of four SCN modules with . The single generic SCN without multi-view testing in [61], i.e., SCN , is also included for comparison as the baseline. Note that all the methods use the same 91 images [6] for training except SRCNN [45], which uses 395,909 images from ImageNet as training data.
Table 4.8
PSNR (SSIM) comparison on three test data sets for various upscaling factors among different methods. The best performance is indicated in and the second best performance is shown in . The performance gain of our best model over all the other models' best is shown in the last row
Data Set | Set5 | Set14 | BSD100 | ||||||
---|---|---|---|---|---|---|---|---|---|
Upscaling | ×2 | ×3 | ×4 | ×2 | ×3 | ×4 | ×2 | ×3 | ×4 |
A+ [25] | 36.55 | 32.59 | 30.29 | 32.28 | 29.13 | 27.33 | 31.21 | 28.29 | 26.82 |
(0.9544) | (0.9088) | (0.8603) | (0.9056) | (0.8188) | (0.7491) | (0.8863) | (0.7835) | (0.7087) | |
SRCNN [45] | 36.66 | 32.75 | 30.49 | 32.45 | 29.30 | 27.50 | 31.36 | 28.41 | 26.90 |
(0.9542) | (0.9090) | (0.8628) | (0.9067) | (0.8215) | (0.7513) | (0.8879) | (0.7863) | (0.7103) | |
RFL [26] | 36.54 | 32.43 | 30.14 | 32.26 | 29.05 | 27.24 | 31.16 | 28.22 | 26.75 |
(0.9537) | (0.9057) | (0.8548) | (0.9040) | (0.8164) | (0.7451) | (0.8840) | (0.7806) | (0.7054) | |
SelfEx [24] | 36.49 | 32.58 | 30.31 | 32.22 | 29.16 | 27.40 | 31.18 | 28.29 | 26.84 |
(0.9537) | (0.9093) | (0.8619) | (0.9034) | (0.8196) | (0.7518) | (0.8855) | (0.7840) | (0.7106) | |
SCN [61] | |||||||||
MSCN-4 | |||||||||
Our | 0.23 | 0.23 | 0.22 | 0.29 | 0.24 | 0.23 | 0.25 | 0.16 | 0.16 |
Improvement | (0.0013) | (0.0011) | (0.0008) | (0.0010) | (0.0034) | (0.0046) | (0.0044) | (0.0056) | (0.0068) |
It can be observed that our proposed model achieves the best SR performance consistently over three data sets for various upscaling factors. It outperforms SCN which obtains the second best results by about 0.2 dB across all the data sets, owing to the power of multiple inference modules.
We compare the visual quality of SR results among various methods in Fig. 4.17. The region inside the bounding box is zoomed in and shown for the sake of visual comparison. Our proposed model MSCN-4 is able to recover sharper edges and generate less artifacts in the SR inferences.
The inference time is an important factor of SR algorithms other than the SR performance. The relation between the SR performance and the inference time of our approach is analyzed in this section. Specifically, we measure the average inference time of different network structures in our method for upscaling factor ×2 on Set14. The inference time costs versus the PSNR values are displayed in Fig. 4.18, where several other current SR methods [24,26,45,25] are included as reference (the inference time of SRCNN is from the public slower implementation of CPU). We can see that, generally, the more modules our network has, the more inference time is needed and the better SR results are achieved. By adjusting the number of SR inference modules in our network structure, we can achieve a tradeoff between SR performance and computation complexity. However, our slowest network still has the superiority in term of inference time, compared with other previous SR methods.
In this section, we propose to jointly learn a mixture of deep networks for single image super-resolution, each of which serves as an SR inference module to handle a certain class of image signals. An adaptive weight module is designed to predict pixel-level aggregation weights of the HR estimates. Various network architectures are analyzed in terms of the SR performance and the inference time, which validates the effectiveness of our proposed model design. Extensive experiments manifest that our proposed model is able to achieve outstanding SR performance along with more flexibility of design.
Recent SR approaches increase the network depth in order to boost SR accuracy [63–67]. Kim et al. [63] proposed a very deep CNN with residual architecture to achieve outstanding SR performance, which utilizes broader contextual information with larger model capacity. Another network was designed by Kim et al. [64], which has recursive architectures with skip connection for image SR to boost performance while only exploiting a small number of model parameters. Tai et al. [65] discovered that many residual SR learning algorithms are based on either global residual learning or local residual learning, which are insufficient for very deep models. Instead, they proposed a model that applies both global and local learning while remaining parameter efficient via recursive learning. More recently, Tong et al. [67] proposed making use of Densely Connected Networks (DenseNet) [68] instead of ResNet [69] as the building block for image SR. Besides developing deeper networks, we show that increasing the number of parallel branches inside the network can achieve the same goal.
In the future, this approach of image super-resolution will be explored to facilitate other high-level vision tasks [53]. While the visual recognition research has made tremendous progress in recent years, most models are trained, applied, and evaluated on high-quality (HQ) visual data, such as the LFW [70] and ImageNet [71] benchmarks. However, in many emerging applications such as autonomous driving, intelligent video surveillance and robotics, the performances of visual sensing and analytics can be seriously endangered by different corruptions in complex unconstrained scenarios, such as limited resolution. Therefore, image super-resolution may provide one solution to feature enhancement for improving the performance of high-level vision tasks [72].
13.58.116.51