Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 6

Signal Processing

Zhangyang Wang^⁎; Ding Liu^†; Thomas S. Huang^‡ ^⁎Department of Computer Science and Engineering, Texas A&M University, College Station, TX, United States
^†Beckman Institute for Advanced Science and Technology, Urbana, IL, United States
^‡Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Champaign, IL, United States

Abstract

The first part of this chapter conducts an investigation of compressive sensing (CS) in the context of deep learning. We present a joint end-to-end optimization form of CS, and then propose a feed-forward pipeline to efficiently solve it as a deep neural network, called Deeply Optimized Compressive Sensing (DOCS). Rather than a data-driven “black box”, DOCS is designed to be reminiscent of classical CS baselines. Yet compared to traditional CS methods involving iterative recovery algorithms, DOCS dramatically boosts both the computational efficiency and the signal reconstruction performance. DOCS also outperforms several deep network approaches for CS, owing to its task-specific architecture design.

The second part of this chapter presents experiments using a deep learning model for speech denoising. We propose a very lightweight procedure that can predict clean speech spectra when presented with noisy speech inputs, and we show how various parameter choices impact the quality of the denoised signal. Through our experiments we conclude that such a structure can perform better than some comparable single-channel approaches and that it is able to generalize well across various speakers, noise types and signal-to-noise ratios.

Keywords

Compressive sensing; Speech denoising; Deep learning

6.1 Deeply Optimized Compressive Sensing

6.1.1 Background

Consider a signal $x \in R^{d}$ $x \in R^{d}$ , which admits a sparse representation $α \in R^{n}$ $α \in R^{n}$ over a dictionary $D \in R^{d \times n}$ $D \in R^{d \times n}$ , with $| | α | |_{0} ≪ n$ $| | α | |_{0} ≪ n$ . Compressive sensing (CS) [1,2] revealed the amazing fact that a signal, which is sparsely represented over an appropriate D, can be recovered from much fewer measurements, than what the Nyquist–Shannon sampling theorem requires. It implied the potential for joint signal sampling and compression, and triggered the development of many novel devices [3]. Given the sampling (or sensing) matrix $P \in R^{m \times d}$ $P \in R^{m \times d}$ with $m ≪ d$ $m ≪ d$ , CS samples $y \in R^{m}$ $y \in R^{m}$ from x as

$\begin{matrix} y = P x = P D α . \end{matrix}$ $\begin{matrix} y = P x = P D α . \end{matrix}$

(6.1)

Considering real CS cases where the acquisition and transmission of y could be noisy, the original x could be recovered from its compressed form y, by finding its sparse representation α through solving the convex problem (λ is a constant)

$\begin{matrix} \min_{α} λ | | α | |_{1} + \frac{1}{2} | | y - P D α | |^{2} . \end{matrix}$ $\begin{matrix} \min_{α} λ | | α | |_{1} + \frac{1}{2} | | y - P D α | |^{2} . \end{matrix}$

(6.2)

CS is feasible only if P and D are chosen wisely, so that (1) x could be sufficiently sparsely represented over D; and (2) P and D have a low mutual coherence [4]. The choice of D has been studied. While earlier literature adopted standard orthogonal transformations for D, such as Discrete Cosine Transform (DCT), where $d = n$ $d = n$ , an overcomplete dictionary where $d < n$ $d < n$ becomes dominant for a more sparse x. An overcomplete D can either be a prespecified set of functions (overcomplete DCT, wavelets, etc.), or designed by adapting its content to fit a given set of signal examples. Lately, the development of dictionary learning led to powerful data-driven methods to learn D from training data, which achieved the improved performance.

Another fundamental problem in CS is how to choose P so that x could be recovered with a high probability. Previously, the random projection matrix was shown to be a good choice since its columns are almost incoherent with any basis D. Rules of designing deterministic projection matrices are studied in, for example, [5–8]. Elad [5] minimized the t-averaged mutual coherence, which was suboptimal. Xu et al. [6] proposed a solver based on the Welch bound [9]. Lately, [8] developed an $ℓ_{\infty}$ $ℓ_{\infty}$ -based minimization metric.

While obtaining CS measurements is trivial (multiplication with P), the computational load is transferred to the recovery process. Equation (6.2) can be solved via either convex optimization [10] or greedy algorithms [11]. Their iterative nature leads to an inherently sequential structure and thus remarkable latency, causing a major bottleneck of the computational efficiency. Very recently, it has been advocated to replace iterative algorithms with feed-forward deep neural networks, in order for improved and much faster signal recovery [12,13]. However, existing limited works appear to largely overlook the classical CS pipeline: they instead simply applied data-driven deep models like “black boxes”, to map y back to x (or equivalently, its sparse representation α) through regression.

6.1.2 An End-to-End Optimization Model of CS

Assume we are given a targeted class of signals $X \in R^{d \times t}$ $X \in R^{d \times t}$ , where each column $x_{i} \in R^{d}$ $x_{i} \in R^{d}$ stands for an input signal. Correspondingly, $Y \in R^{m \times t}$ $Y \in R^{m \times t}$ and $A \in R^{n \times t}$ $A \in R^{n \times t}$ denote the collections of CS samples $y_{i} = {P D x}_{i}$ $y_{i} = {P D x}_{i}$ , and their sparse representations $α_{i}$ $α_{i}$ , respectively. CS samples Y from the input signal X on the encoder side, followed by reconstructing $\hat{X}$ $\hat{X}$ from Y on the decoder side, so that $\hat{X}$ $\hat{X}$ is as close to X as possible. Conceptually, such a pipeline is divided into the following four stages:

• Stage I: Representation seeks the sparsest representation $A \in R^{n \times t}$ $A \in R^{n \times t}$ of X, with regard to D.
• Stage II: Measurement obtains Y from A by $Y = P D A$ $Y = P D A$ .
• Stage III: Recovery recovers $\hat{A}$ $\hat{A}$ from Y by solving (6.2).
• Stage IV: Reconstruction reconstructs $\hat{X} = D \hat{A}$ $\hat{X} = D \hat{A}$ .

Stages I and IV are usually implemented by designing a transformation pair. We introduce an analysis dictionary $G \in R^{n \times d}$ $G \in R^{n \times d}$ , jointly with the synthesis dictionary D, so that $X = D A$ $X = D A$ and $A = G X$ $A = G X$ . Stages I and II belong to the encoder, and are usually combined to one step in practice, $Y = P X$ $Y = P X$ . While A can be treated as an “auxiliary” variable in real CS pipelines, we separately write Stages I and II in order to emphasize the inherent sparsity prior of X. Stages III and IV constitute the decoder. It is assumed that P is shared by the encoder and the decoder.¹

To the best of our knowledge, a joint end-to-end optimization of the CS encoder and decoder is absent. During the training stage, given the training data X, we express the entire CS pipeline as

$\begin{matrix} \min_{D, G, P} & | | X - D \hat{A} | |_{F}^{2} + γ | | X - D G X | |_{F}^{2}, \\ s.t. & \hat{A} \in \arg \min \frac{1}{2} | | Y - P D \hat{A} | |_{F}^{2} + λ | | \hat{A} | |_{1}, Y = P X . \end{matrix}$ $\begin{matrix} \min_{D, G, P} & | | X - D \hat{A} | |_{F}^{2} + γ | | X - D G X | |_{F}^{2}, \\ s.t. & \hat{A} \in \arg \min \frac{1}{2} | | Y - P D \hat{A} | |_{F}^{2} + λ | | \hat{A} | |_{1}, Y = P X . \end{matrix}$

(6.3)

Equation (6.3) is a bi-level optimization problem [14]. In the lower level, we incorporate Stage II as the equality constraint, and Stage III as the $ℓ_{1}$ $ℓ_{1}$ -minimization. In the upper level, the reconstruction error between the original X and the reconstruction $D \hat{A}$ $D \hat{A}$ is measured. Furthermore, the synthesis and analysis dictionary pair has to satisfy the reconstruction constraint. Their weighted sum is the overall objective of (6.3) to be minimized. The obtained D, G, and P are applied to testing, where an input x goes through Stages I–IV and is reconstructed at the end.

Many classical CS pipelines use fixed sparsifying transformation pairs, such as DCT and Inverse DCT (IDCT), as D and G, while P is some random matrix. More recently, there have been some model variants that try to learn a fraction of parameters:

• Case I: given D and G, learning P. This method is used in [6,8]. The authors did not pay much attention to sparsifying the representation α.
• Case II: given $Y = P X$ $Y = P X$ , learning D and P. This corresponds to optimizing the decoder only. Duarte-Carvajalino and Sapiro [7] proposed the coupled-SVD algorithm to jointly optimize P and D, but their formulation (Eq. (19) in [7]) did not minimize the CS reconstruction error directly. The recently-proposed deep learning-based CS models, such as in [13,12], are also categorized into this case (see Sect. 3.5 for their comparisons with DOCS).

In contrast, our primary goal is to jointly optimize the pipeline from end to end, which is supposed to outperform the above “partial” solutions. Solving the bi-level optimization (6.3) is possible [15], yet requires heavy computation, and without much theoretical guarantee.² Further, to apply the learned CS model to testing, it is still inevitable to iteratively solve (6.2) for recovering α.

6.1.3 DOCS: Feed-Forward and Jointly Optimized CS

We aim to transform the insights of (6.3) into a fully feed-forward CS pipeline, which is expected to ensure a reduced computational complexity. The general feed-forward pipeline is called Deeply Optimized Compressive Sensing (DOCS), as illustrated in Fig. 6.1A.

Figure 6.1 (A) An illustration of the feed-forward CS pipeline, where Stages I, II and IV are all simplified to matrix multiplications. Panel (B) depicts the LISTA structure for implementing the iterative sparse recovery algorithm in a feed-forward network way (with a truncation of K = 3 stages).

While Stages I, II and IV are naturally feed-forward, Stage III usually refers to iterative algorithms for solving (6.2) and makes the major computation bottleneck. Letting $M = P D \in R^{m \times n}$ $M = P D \in R^{m \times n}$ for notation simplicity, we exploit a feed-forward approximation, called Learned Iterative Shrinkage and Thresholding Algorithm (LISTA) [16]. LISTA was proposed to efficiently approximate the $ℓ_{1}$ $ℓ_{1}$ -based sparse code α of the input x. It could be implemented as a neural network, as illustrated in Fig. 6.1B. The network architecture is obtained by unfolding and truncating the iterative shrinkage and thresholding algorithm (ISTA). The network has K stages, each of which updates the intermediate sparse code $z^{k}$ $z^{k}$ ( $k = 0, \dots, K - 1$ $k = 0, \dots, K - 1$ ) according to

$\begin{matrix} z^{k + 1} = s_{θ} (W x + {S z}^{k}), where W = M^{T}, S = I - M^{T} M, θ = λ, \end{matrix}$ $\begin{matrix} z^{k + 1} = s_{θ} (W x + {S z}^{k}), where W = M^{T}, S = I - M^{T} M, θ = λ, \end{matrix}$

(6.4)

and $s_{θ}$ $s_{θ}$ is an element-wise soft shrinkage operator.³ The network parameters, W, S and θ, could be initialized from M and λ of the original model (6.2), and tuned further by back-propagation [16]. For convenience, we disentangle M and D and treat them as two separate variables during training.

After the transforming (6.2) into the LISTA form, and combining Stages I and II into one linear layer P, the four-stage CS pipeline now becomes a fully-connected, feed-forward network. Note that Stages I, II and IV are all linear, while only Stage III contains nonlinear “neurons” $s_{θ}$ $s_{θ}$ . We are certainly aware of possibilities to add non-linear neurons in Stages I, II and IV, but choose to stick to their “original forms” to be faithful to (6.2). We then tune all parameters jointly from training data, G, W, S, and D, using back-propagation.⁴ The overall loss function of DOCS is $| | X - D \hat{A} | |_{F}^{2} + γ | | X - D G X | |_{F}^{2}$ $| | X - D \hat{A} | |_{F}^{2} + γ | | X - D G X | |_{F}^{2}$ . The entire DOCS model is jointed updated by the stochastic gradient descent algorithm.

Complexity Since the training of DOCS can be performed offline, we are mainly concerned about its testing complexity. On the encoder side, the time complexity is simply $O (m d)$ $O (m d)$ . On the decoder side, there are $K + 1$ $K + 1$ trainable layers, making the time complexity $O (m n + (K - 1) n^{2} + n d)$ $O (m n + (K - 1) n^{2} + n d)$ .

Related Work There have been abundant works investigating CS models [1,2], and developing iterative algorithms [10,11]. However, few of them benefit from optimizing all the stages from end to end. It was only recently that feed-forward networks were considered for the efficient recovery. Mousavi et al. [12] applied a stacked denoising auto-encoder (SDA) to map CS measurements back to the original signals. Kulkarni et al. [13] proposed a convolutional neural network (CNN) to reconstruct images via CS, followed by an extra global enhancement step as post-processing. However, both prior works relied on off-the-shelf deep architectures that are not customized much for CS, and focused on optimizing the decoder. Lately, [17,18] started looking at encoder–decoder networks for the specific problem of video CS.

Beyond LISTA, there are also more blooming interests in bridging iterative optimization algorithms and deep learning models. In [19], the authors leveraged a similar idea on fast trainable regressors and constructed feed-forward network approximations. It was later extended in [20] to develop an algorithm of learned deterministic fixed-complexity pursuits, for sparse and low rank models. Lately, [21] modeled $ℓ_{0}$ $ℓ_{0}$ sparse approximation as feed-forward neural networks. The authors extended the strategy to the graph-regularized $ℓ_{1}$ $ℓ_{1}$ -approximation in [22], and to $ℓ_{\infty}$ $ℓ_{\infty}$ -based minimization in [23]. However, there has been no systematic study along this direction on the CS pipeline.

6.1.4 Experiments

Settings We implement all deep models with CUDA ConvNet [24], on a workstation with 12 Intel Xeon 2.67 GHz CPUs and 1 Titan X GPU. For all models, we use a batch size of 256 and a momentum of 0.9. The training is regularized by weight decay ( $ℓ_{2}$ $ℓ_{2}$ -penalty multiplier set to $5 \times 10^{- 4}$ $5 \times 10^{- 4}$ ) and dropout (ratio set to 0.5); γ is set as 5, while D and G are initialized using overcomplete DCT and IDCT, respectively. We also initialize P as a Gaussian random matrix. In addition to DOCS, we further design the following two baselines:

• Baseline I: D and G are fixed to be overcomplete DCT and IDCT, respectively. While Stage II utilizes a random P, Stage III relies on running the state-of-the-art iterative sparse recovery algorithm, TVAL3 [25].⁵ This constitutes the most basic CS baseline, with no parameters learned from data.
• Baseline II: D and G are replaced with the learned matrices by DOCS. Stages II and III remain the same as for Baseline I.

Simulation We generate $x \in R^{d}$ $x \in R^{d}$ , which ensures an inherent sparse structure, for the simulation purpose: we produce T-sparse vectors $α_{0} \in R^{n}$ $α_{0} \in R^{n}$ (e.g., only T out of n entries are nonzero). The locations of nonzero entries are chosen randomly, and their values are drawn from a uniform distribution in $[- 1, 1]$ $[- 1, 1]$ . We then generate $x = G α_{0} + n_{0}$ $x = G α_{0} + n_{0}$ , where $G \in R^{d \times n}$ $G \in R^{d \times n}$ is an overcomplete DCT dictionary [26], and $n_{0} \in R^{d}$ $n_{0} \in R^{d}$ is a random Gaussian noise with zero mean and standard deviation of 0.02. For training each model, we generate 10,000 samples for training, and 1000 for testing, on which the overall results are averaged. The default number of LISTA layers $K = 3$ $K = 3$ .

Reconstruction Error The DOCS model and two baselines are tested on three schemes: (1) $m = [6 : 2 : 16]$ $m = [6 : 2 : 16]$ ; $d = 30$ $d = 30$ ; $n = 60$ $n = 60$ ; (2) $m = [10 : 5 : 35]$ $m = [10 : 5 : 35]$ ; $d = 60$ $d = 60$ ; $n = 120$ $n = 120$ ; (3) $m = [10 : 10 : 60]$ $m = [10 : 10 : 60]$ ; $d = 90$ $d = 90$ ; $n = 180$ $n = 180$ ; T is fixed to 5 in all cases. As shown in Fig. 6.2, DOCS is able to reduce reconstruction errors dramatically, compared to the Baselines I and II. It is evident that the learned D and G lead to a more sparse representation than prefixed transformations and thus allow for more accurate recovery, by comparing Baselines I and II. Then thanks to the end-to-end tuning, DOCS is capable of producing smaller errors than Baseline II, reflecting that learned sparse recovery could potentially fit the data better.

Figure 6.2 Reconstruction error w.r.t. the number of measurements m.

Efficiency We compare the running time of different models during testing, all of which are collected in the CPU mode. The decoders of DOCS recover the original signals in no more than 1 ms, in all above test cases. It is almost 1000 times faster than the iterative recovery algorithms [25] in Baselines I and II. The speedup achieved by DOCS is not solely because of the different implementations. DOCS relies on only a fixed number of the simplest matrix multiplications, whose computational complexity grows strictly linearly with the number of measurements m, given that n, d and K are fixed. Besides, the feed-forward nature makes DOCS amenable to GPU parallelization.

We further evaluate how the reconstruction errors, testing and training times vary with different K values, for DOCS in the specific setting: $m = 60$ $m = 60$ ; $d = 90$ $d = 90$ ; $n = 180$ $n = 180$ ; $T = 30$ $T = 30$ , as demonstrated in Fig. 6.3. With end-to-end tuning, Fig. 6.3A shows that smaller K values, such as 2 and 3, perform sufficiently well with reduced complexities. Fig. 6.3B depicts the actual testing time that grows nearly linearly with K, which conforms to the model complexity.

Figure 6.3 Reconstruction error, testing time, and training time (ms) w.r.t. K.

Experiments on Image Reconstruction We use the disjoint training set (200 images) and test set (200 images) of BSD500 database [27] as our training set; its validation set (100 images) is used for validation. During training, we first divide each original image into overlapped $8 \times 8$ $8 \times 8$ patches as the input x, e.g., $d = 64$ $d = 64$ . Parameter n is fixed as 128, while $m = [4 : 4 : 16]$ $m = [4 : 4 : 16]$ . For a testing image, we sample $8 \times 8$ $8 \times 8$ blocks with a stride of 4, and apply models in a patch-wise manner. The final result is obtained via aggregating all patches, with the overlapping regions averaged. We use the 29 images in the LIVE1 dataset [28] (converted to the gray scale) to evaluate both the quantitative and qualitative performances. Parameter K is chosen to be 3.

We implement a CNN baseline based on [13],⁶ and a SDA baseline following [12].⁷ Both of them only optimize Stages III and IV. They are fed with $Y = P X$ $Y = P X$ and try to reconstruct X, where the random P is the same for training and testing. On the other hand, DOCS jointly optimizes Stages I to IV from end to end. All comparison methods are not postprocessed. The training details of CNN and SDA remain the same as their original sections, except for those specified above. Table 6.1 compares the averaged PSNR results on the LIVE1 dataset. We observe that ReconNet performs not well, possibly because its CNN-based architecture favors larger d, and also due to the absence of BM3D postprocessing. SDA obtains a better reconstruction performance than ReconNet when m grows larger. DOCS gains a 4–5 dB margin over ReconNet or SDA. Figs. 6.4 and 6.5 further compare two groups of reconstructed images visually. DOCS is able to retain more subtle features, such as the fine textures on the Monarch wings, with suppressed artifacts. In contrast, both CNN and SDA introduce many undesirable blurs and distortions.

Table 6.1

Averaged PSNR comparison (dB) on the LIVE1 dataset

m	CNN	SDA	DOCS	m	CNN	SDA	DOCS
4	20.76	20.62	22.66	12	24.62	24.40	27.58
8	23.34	23.84	26.43	16	24.86	25.42	29.05

Figure 6.4 Visual comparison of various methods on *Parrots* at m = 8. PSNR values are also shown.

Figure 6.5 Visual comparison of various methods on *Monarch* at m = 8. PSNR values are also shown.

6.1.5 Conclusion

We discuss the DOCS model in the section, which enhances CS in the context of deep learning. Our methodology could be extended to derive the deep feed-forward network solutions to many less tractable optimization problems, with both efficiency and performance gains. We advocate such task-specific design of deep networks to be considered for a wider variety of applications. Our immediate future works concern how to incorporate more domain-specific priors of CS, such as the incoherence between P and D.

6.2 Deep Learning for Speech Denoising⁸

6.2.1 Introduction

The goal of speech denoising is to produce noise-free speech signals from noisy recordings, while improving the perceived quality of the speech component and increasing its intelligibility. Here we investigate the problem of monaural source speech denoising. It is a challenging and ill-posed problem since given only one single channel of information available, an infinite number of solutions are possible. Speech denoising can be utilized in various applications where we experience the presence of background noise in communications. The accuracy of automatic speech recognition (ASR) can be enhanced by speech denoising. A number of techniques have been proposed based on different assumptions on the signal and noise characteristics, including spectral subtraction [29] statistical model-based estimation [30], Wiener filtering [31], subspace method [32] and non-negative matrix factorization (NMF) [33]. In this section we introduce a lightweight learning-based approach to remove noise from single-channel recordings using a deep neural network structure.

Neural networks as a nonlinear filter have been applied to this problem in the past, for example, the early work by [34] utilizing shallow neural networks (SNNs) for speech denoising. However, at that time constraints in computational power and size of training data resulted in relatively small neural network implementations that limited denoising performance.

Over the last few years, the development of computer hardware and advanced machine learning algorithms enabled people to increase the depth and width of neural networks. The deep neural networks (DNNs) have achieved many state-of-the-art results in the field of speech recognition [35] and speech separation [36]. DNNs containing multiple hidden layers of nonlinearity have shown great potential to better capture the complex relationships between noisy and clean utterances across various speakers, noise types and noise levels. More recently, Xu et al. [37] proposed a regression-based speech enhancement framework of DNNs using restricted Boltzmann machines (RBMs) for pretraining.

In this section we explore the use of DNNs for speech denoising, and propose a simpler training and denoising procedure that does not necessitate RBM pretraining or complex recurrent structures. We use a DNN that operates on the spectral domain of speech signals, and predicts clean speech spectra when presented with noisy input spectra. A series of experiments is conducted to compare the denoising performance under different parameter settings. Our results show that our simplified approach can perform better than other popular supervised single-channel denoising approaches and that it results in a very efficient processing model which forgoes computationally costly estimation steps.

6.2.2 Neural Networks for Spectral Denoising

In the following sections we introduce our model's structure, some domain-specific choices that we make, and a training procedure optimized for this task.

6.2.2.1 Network Architecture

The core concept in this section is to compute a regression between a noisy signal frame and a clean signal frame in the frequency domain. To do so we start with the obvious choice of using frames from a magnitude short-time Fourier transform (STFT). Using these features allows us to abstract many of the phase uncertainties and to focus on “turning off” parts of the input spectral frames that are purely noise [34].

More precisely, for a speech signal $s (t)$ $s (t)$ and a noise signal $n (t)$ $n (t)$ we construct a corresponding mixture signal $m (t) = s (t) + n (t)$ $m (t) = s (t) + n (t)$ . We compute the STFTs of the above time series to obtain the vectors $s_{t}, n_{t}$ $s_{t}, n_{t}$ and $m_{t}$ $m_{t}$ , which are the spectral frames corresponding to time t (each element of these vectors corresponds to a frequency bin). These vectors will constitute our training data set, with $m_{t}$ $m_{t}$ being the input and its corresponding $s_{t}$ $s_{t}$ being the target output.

We then proceed to design a neural network with L layers which would output a spectral frame prediction $y_{t}$ $y_{t}$ when it is presented with $‖ m_{t} ‖$ $‖ m_{t} ‖$ . This is akin to a Denoising Autoencoder (DAE) [38], although in this case we do not care to find an efficient hidden representation, instead we strive to predict the spectra of a clean signal when provided with the spectra of a noisy signal. The runtime denoising process is defined by

$h_{t}^{(l)} = f_{l} (W^{(l)} \cdot h_{t}^{(l - 1)} + b^{(l)})$ $h_{t}^{(l)} = f_{l} (W^{(l)} \cdot h_{t}^{(l - 1)} + b^{(l)})$

(6.5)

with l signifying the layer index (from 1 to L), and with $h_{t}^{(0)} = ‖ m_{t} ‖$ $h_{t}^{(0)} = ‖ m_{t} ‖$ and $y_{t} = h_{t}^{(L)}$ $y_{t} = h_{t}^{(L)}$ . The function $f_{l} (\cdot)$ $f_{l} (\cdot)$ is known as the activation function and can take various forms depending on our goals, but it is traditionally a sigmoid or some piecewise linear function. We will explore this selection in a later section. Likewise the number of layers L can range from 1 (which forms a shallow network), or as many as we deem necessary (which comes with a higher computational burden and the need for more training data).

For $L = 1$ $L = 1$ and $f_{l} (\cdot)$ $f_{l} (\cdot)$ being the identity function, this model collapses to a linear regression, whereas when using nonlinear $f_{l} (\cdot)$ $f_{l} (\cdot)$ 's and multiple layers we perform a deep nonlinear regression (or a regression deep neural network).

6.2.2.2 Implementation Details

The parameters that need to be estimated in order to obtain a functioning system are the set of $W^{(l)}$ $W^{(l)}$ matrices and $b^{(l)}$ $b^{(l)}$ vectors, known as the layer weights and biases, respectively. Fixed parameters that we will not learn include the number of layers L and the choice of activation functions $f_{l} (\cdot)$ $f_{l} (\cdot)$ . In order to perform training, we need to specify a cost function between the network predictions and the target outputs which we will need to optimize, and that will provide a means to see how well our model has adapted to the training data.

Activation Function

For the activation function the most common choices are the hyperbolic tangent and the logistic sigmoid function. However, we note that the outputs that we wish to predict are spectral magnitude values which would lie in the interval $[0, \infty)$ $[0, \infty)$ . This means that we should prefer an activation function that produces outputs in that interval. A popular choice that satisfies this preference is the rectified linear activation, which is defined as $y = \sup {x, 0}$ $y = \sup {x, 0}$ , i.e., the maximum between the input and 0. In our experience, however, this is a particularly difficult function to work with since it exhibits a zero derivative for negative values and is very likely to result in nodes that get “stuck” with a zero output once they reach that state. Instead we use a modified version which is defined as

$f (x) = {\begin{matrix} x & if x ⩾ ϵ, \\ \frac{- ϵ}{x - 1 - ϵ} & if x < ϵ, \end{matrix}$ $f (x) = {\begin{matrix} x & if x ⩾ ϵ, \\ \frac{- ϵ}{x - 1 - ϵ} & if x < ϵ, \end{matrix}$

where ϵ is a sufficiently small number (in our simulations set to $10^{- 5}$ $10^{- 5}$ ). This modification introduces a slight ramp starting from −∞ to ϵ, which guarantees that the derivative will point (albeit weakly) towards positive values and will provide a way to escape a zero state once a node is in it.

Cost Function

For the cost function we select the mean squared error (MSE) between the target and predicted vectors, $E \propto {‖ y_{t} - ‖ s_{t} ‖ ‖}^{2}$ $E \propto {‖ y_{t} - ‖ s_{t} ‖ ‖}^{2}$ . Although a choice such as the KL divergence or the Itakura–Saito divergence would have been more appropriate for measuring differences between spectra, in our experiments we find them to ultimately perform worse than the MSE.

Training Strategy

Once the above network characteristics have been specified, we can use a variety of methods to estimate the model parameters. Traditional choices include the backpropagation algorithm, as well as more sophisticated procedures such as conjugate gradient methods and optimization approaches such as Levenberg–Marquardt [39]. Additionally, there is a trend towards including a pretraining step using an RBM analogy for each layer [40]. In our experiments for this specific task, we find many of the sophisticated approaches to be either numerically unstable, computationally too expensive, or plainly redundant. We obtain the most rapid and reliable convergence behavior using the resilient back-propagation algorithm [41]. Combined with the use of the modified activation function that we present above, it requires no pretraining and converges in roughly the same number of iterations as conjugate gradient algorithms with far fewer computational requirements. The initial parameter values are set using the Nguyen–Widrow procedure [42]. For most of the experiments we train our models for 1000 iterations which are usually sufficient to achieve convergence. The details regarding the training data are discussed in the experimental results section.

6.2.2.3 Extracting Denoised Signals

After training a model, the denoising is performed as follows: the magnitude spectral frames from noisy speech signals are extracted and presented as inputs. If the model is properly trained, we obtain a prediction of the clean signal's magnitude spectrum for each noisy spectrum that we analyze. In order to invert that magnitude spectrum back to the time domain, we apply the phase of the mixture spectrum on it and we use the inverse STFT with overlap-add to synthesize the denoised signal in the time domain. For all our experiments we use a square-root Hann window for both the analysis and synthesis transforms, and a hop size of 25% of the Fourier window length.

6.2.2.4 Dealing with Gain

One potential problem with this scheme is that this network might not be able to extrapolate when presented with data at significantly large scales (e.g., 10x louder). When using large data sets, there is a high probability that we will see enough spectra at various low gains to adequately perform regression at lower scales, but we will not observe spectra louder than some threshold which means that we will not be able to denoise very loud signals. One approach is to standardize the gain of the involved spectra to lie inside a specific range, but we can instead employ some simple modifications to help us extrapolate better.

In order to do so we perform the following steps. We first normalize all the input and output spectra to have the same $ℓ_{1}$ $ℓ_{1}$ -norm (we arbitrarily choose unit norm). In the training process we add one more output node that is trained to predict the output gain of the speech signal. The target output gain values are also normalized to have unit variance over an utterance in order to impose invariance on the scale of the desired output signal. With this modification, in order to obtain the spectrum of the denoised signal we would have to multiply the output of that gain node with the speech spectrum predicted from all the other nodes. Because of the normalization on the predicted gain we will not recover the clean input signal with the exact gain, but rather a denoised signal that has roughly the same amplitude modulation with a constant scaling factor. In the next section we show how this method compares to simply training on unnormalized spectra.

6.2.3 Experimental Results

We now present the results of experiments that explore the effects of relevant signal and network parameters, as well as the degradation in performance when the training data set does not adequately represent the testing data.

6.2.3.1 Experimental Setup

The experiments are set up using the following recipe. We use 100 utterances from the TIMIT database, spanning ten different speakers. We also maintain a set of five noises specified as: Airport, Train, Subway, Babble, and Drill. We then generate a number of noisy speech recordings by selecting random subsets of noise and overlaying them with speech signals. While constructing the noisy mixtures we also specify the signal to noise ratio for each recording. Once we complete the generation of the noisy signals we split them into a training set and a test set.

During the denoising process we can specify multiple parameters that have a direct effect on separation quality and are linked to the network's structure. In this section we present the subset that we find to be most important. These include the number of input nodes, the number of hidden layers and the number of their nodes, the activation functions, and the number of prior input frames to take into account.

Of course, the number of parameters is quite large and considering all the possible combinations is an intractable task. In the following experiments we perform single parameter searches while keeping the rest of the parameters fixed in a set of sensible choices according to our observations. The fixed parameters are: input frame size 1024 pts, a single hidden layer with 2000 units, the rectified linear activation with the modification described above, 0 dB SNR inputs, no input normalization, and no temporal memory.

For all parameter sweeps we show the resulting signal-to-distortion ratio (SDR), signal-to-interference ratio (SIR) and signal-to-artifacts ratio (SAR) as computed from the BSS-EVAL toolbox [43]. We additionally compute the short-time objective intelligibility measure (STOI) which is a quantitative estimate of the intelligibility of the denoised speech [44]. For all these measures higher values are better.

6.2.3.2 Network Structure Analysis

In this section we present the effects of the network's structure on performance. We focus on four parameters that we find to be the most crucial, namely input window size, number of layers, activation function and temporal memory.

The number of input nodes is directly related to the size of the analysis window that we use, which is the same as the size of the FFT that we use to transform the time domain data to the frequency domain. In Fig. 6.6 we show the effects of different window sizes. We see that a window of about 64 ms (1024 pts) produces the best result.

Figure 6.6 Comparing different input FFT sizes we see that for speech signals sampled at 16 kHz we obtain the optimal results with 1024 pts. As with all figures in this section, the bars show average values and the vertical lines on the bars denote minimum and maximum observed values from our experiments.

Another important parameter is that of the depth and width of the network, i.e., the number of hidden layers and their corresponding nodes. In Fig. 6.7 we show the results over various settings ranging from a simple shallow network to a two-hidden layer network with 2000 nodes per layer. We note that with more units we tend to see an increase in the SIR, but that this trend stops after a while. It is not clear if this is an effect that relates to the number of training data points that we use or not. Regardless the SDR, SAR and STOI seem to require more hidden layers with more units. Consolidating both observations we note that a single hidden layer with 2000 units is optimal.

Figure 6.7 Comparing different network structures we see that a single hidden layer with 2000 units seems to perform best. Entries corresponding to a single legend number denote a single hidden layer with that many hidden units. Entries corresponding to two legend numbers denote a two hidden layer network with the two numbers being the units in the first and second hidden layer, respectively.

We also examine the effect of various activation functions with the results shown in Fig. 6.8. The ones that we consider are the rectified linear activation (with the modifications described above), the hyperbolic tangent and the logistic sigmoid functions. For all cases it seems that the modified rectified linear activation is consistently the best performer.

Figure 6.8 Comparing different activation functions we see that the rectified linear activation outperforms other common functions. The legend entries show the activation function for the hidden and the output layer, with “relu” being the rectified linear, “tanh” being the hyperbolic tangent and “logs” being the logistic sigmoid.

Finally, we examine the effects of a convolutive structure on the input as shown in Fig. 6.9. We do so using a model that receives as input the current analysis window as well as an arbitrary number of past windows. The number of past windows ranges from 0 to 14 in our experiments. We observe a familiar pattern in the measured results, where the SIR improves at the expense of a diminishing SDR/SAR/STOI. Overall we conclude that the input of two consecutive frames is a good choice, although even a simple memoryless model would perform reasonably well enough.

Figure 6.9 Using a convolutive form that takes into account prior input frames, we note that although SIR performance increases as we include more past frames there is an overall degradation in quality after more than two frames.

6.2.3.3 Analysis of Robustness to Variations

In order to evaluate the robustness of this model, we test it under a variety of situations in which it is presented with unseen data, such as unseen SNRs, speakers and noise types.

In Fig. 6.10 we show the robustness of this model under various SNRs. The model is trained on 0 dB SNR mixtures and it is evaluated on mixtures ranging from 20 dB SNR to −18 dB SNR. We additionally test both the method to train on the raw input data and the method using the gain prediction model described above. In Fig. 6.10 these two methods are compared with the use of the front and back bars. Note that the shown values are absolute, not the improvement from the input mixture. As we see, for positive SNRs we get a much improved SIR and a relatively constant SDR/SIR/STOI, and training on the raw inputs seems to work better. For negative SNRs we still get an improvement although it is not as drastic as before. We also note that in these cases training with gain prediction tends to perform better.

Figure 6.10 Using multiple SNR inputs and testing on a network that is trained on 0 dB SNR. Note that the results are absolute, i.e., we do not show the improvement. All results are shown using pairs of bars. The left/back bars in each pair show the results when we train on raw data, and the right/front bars show the results when we do the gain prediction.

Next we evaluate this method's robustness to data that is unseen in the training process. These tests provide a glimpse of how well we can expect this approach to work when applied on noise and speakers on which it is not trained. We perform three experiments for this, one where the testing noise is not seen in training, one where the testing speaker is not seen in training, and one where both the testing noise and the testing speaker are not seen in training. For the unseen noise case we train the model on mixtures with Babble, Airport, Train and Subway noises, and evaluate it on mixtures that include a Drill noise (which is significantly different from the training noises in both spectral and temporal structure). For the unknown speaker case we simply hold out from the training data some of the speakers, and for the case where both the noise and the speaker are unseen we use a combination of the above. The results of these experiments are shown in Fig. 6.11. For the case where the speaker is unknown we see only a mild degradation in performance, which means that this approach can be easily used in speaker variant situations. With the unseen noise we observe a larger degradation in results, which is expected due to the drastically different nature of the noise type. Even then, the result is still good enough as compared to other single-channel denoising approaches. The result of the case where both the noise and the speaker are unknown seems to be at the same level as that of the case of the unseen noise, which once again reaffirms our conclusion that this approach is very good at generalizing across speakers.

Figure 6.11 In this figure we compare the performance of our network when used on data that is not represented in training. We show the results of separation with known speakers and noise, with unseen speakers, with unseen noise, and with unseen speakers and noise.

6.2.3.4 Comparison with NMF

We present one more plot that shows how this approach compares to another popular supervised single-channel denoising approach. In Fig. 6.12 we compare our performance to a nonnegative matrix factorization (NMF) model trained on the speakers and noise at hand [33]. For the NMF model we use what we find to be the optimal number of basis functions for this task. It is clear that our proposed method significantly outperforms this approach.

Figure 6.12 Comparison of the proposed approach with NMF-based denoising.

Based on the above experiments we can draw a series of conclusions. Primarily we see that this approach is a viable one, being adequately robust to unseen mixing situations (both with gains and types of sources). We also see that a deep or convolutive structure is not crucial, although it does offer a minor performance advantage. In terms of activation functions we note that the rectified linear activation seems to perform the best. Our proposed approach provides a very efficient runtime denoising process which is comprised of only a linear transform on the size of the input frame followed by a max operation. This brings our approach in the same level of computational complexity as spectral subtraction, while offering a significant advantage in denoising performance. Unlike methods such as NMF-based denoising there is no estimation performed at runtime which makes for a significantly more lightweight process.

Of course, our experiments are not exhaustive, but they do provide some guidelines on what structure to use to achieve good denoising results. We expect that with further experiments measuring many more of the available options, in both training and postprocessing, we can achieve even better performance.

6.2.4 Conclusion and Future Work

We build a deep neural network to learn the mapping between the noisy speech signal to its clean counterpart, and conduct a series of experiments to investigate its usefulness. Our proposed model demonstrates clear advantage over the NMF competitor.

Speech denoising is one problem of monaural source separation, i.e., source separation from monaural recordings, which includes other related tasks, such as speech separation and singing voice separation. Monaural source separation is important for a number of real world applications. For example, in the singing voice separation, extracting the singing voice from the music accompaniment can enhance the accuracy of chord recognition and pitch estimation. In the future, we may make use of the connection between speech denoising and other related tasks, and adopt similar deep learning-based approaches to solve these tasks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 6: Signal Processing

Create new playlist

Sign In

Sign Up