Chapter 11
Machine Learning for Audio

P. Bhattacharya P. Nowak and U. Zölzer

11.1 Introduction

Machine learning can be described as a set of methods or algorithms that offer the capability to learn from data automatically and develop flexible parameterized models. In classical machine learning, the important features of the input are manually designed and extracted from data, and the system automatically learns to map these features to the requisite outputs. This learning is usually achieved through parameter adaptation and/or hyper‐parameter tuning based on a predefined goal or optimization of a cost function. Classical machine learning is used in, and works well for, simple pattern recognition problems. A bulk of the effort or computation would include the design of optimal features for the system. Once the features are handcrafted from a dataset, a generic classification or regression model is used to obtain the output. Some examples of classical machine learning algorithms include linear regression, logistic regression, k‐nearest neighbors, and simple decision trees. Representation learning goes a step further and mostly eliminates the need for handcrafted features. Instead, the required features are automatically discovered from the data. With the evolution of graphical processing units, new machine learning methods supporting more complex models have come to the forefront, with deep learning being one of the major topics of research. Deep learning is part of a broader family of machine learning methods, primarily based on artificial neural networks with representation learning. In deep learning, there are multiple levels of feature extraction from raw or processed data. These features are automatically extracted and combined across various levels to produce the output. Each level can be composed of linear and nonlinear, modifiable, and parameterized operators that can extract and represent features from the representations in the preceding level. With the increment of diverse levels inside the model, the complexity of the overall model increases and helps in an improved fine‐tuned feature representation. Deep learning architectures such as deep neural network (DNN), recurrent neural network (RNN), and convolutional neural network (CNN) have been applied to multiple fields, which include computer vision, audio recognition, natural language processing, social network filtering, machine translation, medical image analysis, material inspection, and gaming. In most of the areas, they have produced much improved results compared to the previous methods.

11.2 Unsupervised and Supervised Learning

Machine learning algorithms can be primarily classified into supervised, unsupervised, and reinforcement learning, of which the first two methods are of particular interest and are used widely in signal processing applications. Figure 11.1 shows a minimal representation of an unsupervised and a supervised system. Unsupervised learning is a task of identifying previously undetected patterns in a dataset with no labels and with a minimum of human supervision. In contrast to supervised learning that usually makes use of human‐labeled data, unsupervised learning methods primarily extract underlying statistical or semantic features from the input data and allow for modeling of probability densities over inputs. In audio processing, feature extraction across various domains [Ler12] is very important for applications like audio analysis and fingerprinting, decomposition and separation, and content‐based information retrieval [Mül15]. The underlying features or statistics can be analyzed and used to categorize incoming data, identify outliers, or generate latent representations. A clustering method, like the k‐means algorithm [Mac67] or a mixture model, is a classical unsupervised learning method that groups unlabeled data into categories. It tries to recognize the common features in data and categorizes a new input based on the presence or absence of such features. Hence, such methods can be used in anomaly detection where an input does not belong to any group [Lu16]. Other popular unsupervised methods like the principal component analysis (PCA), independent component analysis (ICA), and non‐negative matrix factorization (NMF) are also used extensively in audio processing and acoustics. ICA is a method which recovers unobserved signals from observed mixtures under the assumption of mutual independence. ICA or its faster variants are used for blind source separation in audio [Chi06] and music classification [Poh06], among others. NMF is another method which has a clustering or grouping property, and is also used for blind source separation [Vir07] and music analysis [Fev09]. Deep unsupervised learning is usually based on deep neural networks without explicit input/label pairs for training. Some models employed in deep unsupervised learning include autoencoder, deep belief network (DBN), and self organizing map (SOM), which are primarily used for dimension reduction or creation of latent variables. SOM is a neural network which produces a low‐dimensional representation of a large input space and is useful for visualization. Autoencoders are used to generate latent variables which could be of a smaller size compared to the original high fidelity input and hence could be used in audio compression [Min19]. However, autoencoders, belief networks, or similar networks can learn underlying latent representations or embeddings from the input data, which can be used for classification tasks. An example of a neural network for speaker verification can be found in [Sny17] where the network learns to create embeddings, named as x‐vectors.

Schematic illustration of a minimal illustration of unsupervised and supervised learning models.

Figure 11.1 A minimal illustration of unsupervised and supervised learning models.

Supervised learning is the task of learning a generalized mapping function of a system based on example input–label pairs. It models a function from labeled training data consisting of a set of examples. In supervised learning, each example is a pair consisting of an input and a desired output value. Supervised models have a predefined cost function that compares the predicted and desired output values. Based on any difference or similarity between the two, the model parameters are iteratively updated with a goal towards minimizing the difference or maximizing the similarity. Linear and logistic regression methods are two of the basic supervised learning methods based on least square minimization or least absolute deviation, and maximum likelihood estimation. These methods and their improved versions are used to estimate model parameters or coefficients to fit a given data or criteria. Other well‐known methods include k‐nearest neighbors (k‐NN) and decision trees, where the latter is used in multiple audio applications [Lav09, Aka12]. Support vector machine (SVM) is another popular learning method mostly used for classification tasks, and it has been applied to perform audio classification and retrieval [GL03, LL01]. Deep learning methods gradually came to the forefront with DNN, CNN, and generative adversarial network (GAN). Deep learning architectures are gradually being used for a variety of audio applications like audio enhancement [Mee18], music recognition [Han17], pitch estimation [Zha16], virtual analog modeling [Mar20] and more.

This chapter is particularly focused on deep learning methods and it illustrates the concept of backpropagation on simple neural and convolutional network architectures, through useful derivations. Additionally, certain example applications are provided to illustrate the usability and performance of these methods in multiple areas of audio processing.

11.3 Gradient Descent and Backpropagation

Most supervised or semi‐supervised machine learning algorithms include an objective function based on a ground truth or a prior, which is usually minimized to find an optimal set of parameters. The consequent derivation from this optimization criteria leads to the method of steepest descent, where the optimal solution is directly proportional to the negative gradient of the objective function. In deep learning, the proposed models are usually multilayered and consist of differentiable parameterized functions. To find the optimal set of parameters, the respective gradients of the objective function with respect to the parameters need to be calculated. The required gradient is eventually broken into a product of multiple local gradients from the hidden layers by the chain rule of derivatives. This method is generally known as backpropagation. The following section describes an artificial neural network (ANN) and exhibits how backpropagation works through the network layers.

11.3.1 Feedforward Artificial Neural Network

An artificial neural network is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal to other neurons. An artificial neuron that receives a signal then processes it and can stimulate the connected neurons. In ANN implementations, the signal at a connection is a real number and the output of each neuron is computed by some nonlinear function of the sum of its inputs. Neurons and their connections typically have a weight that is adapted as learning proceeds. The weight increases or decreases the strength of the signal at a connection. In the case of multilayer perceptrons (MLP) [Ros61], neurons may have a threshold such that a signal is sent or the neuron is fired only if the aggregate surpasses the threshold. Typically, neurons are aggregated into layers and different layers may perform different transformations on their inputs. Signals travel from the first layer, or the input layer, to the last layer, or the output layer, possibly after traversing a number of intermediate layers, also referred to as hidden layers. An example neural network with one hidden layer is illustrated in Fig. 11.2. Every node of one layer is connected to every other node of the succeeding layer by scalar weights, which results in a densely connected structure and a matrix of scalar weights. The dimensions of a weight matrix is defined by the dimensions of the connected layers. Given an input element x Subscript i, hidden element y Subscript j, hidden activation a Subscript j, and output element y Subscript k, the operations are given by

(11.1)y Subscript j Baseline equals sigma-summation Underscript i equals 1 Overscript upper I Endscripts w Subscript j i Baseline dot x Subscript i Baseline plus b Subscript j Baseline comma

where w Subscript j i and b Subscript j represent the associated scalar weight between x Subscript i and y Subscript j and the bias, respectively, and w Subscript k j and b Subscript k represent the associated scalar weight between a Subscript j and y Subscript k and the corresponding bias, respectively. Equation (11.2) refers to an activation operation, where sigma left-parenthesis dot right-parenthesis refers to a sigmoid activation function described in the next section.

Activation Function

An activation function performs a nonlinear operation on each neuron individually, retaining the dimensions of that layer. The most commonly used activations in ANN are the sigmoid, Tanh, and the ReLU functions, as illustrated in Fig. 11.3. The sigmoid function is given by

where x is the input to the layer. The derivative of this function is given by

Schematic illustration of a feedforward neural network with one hidden layer.

Figure 11.2 A feedforward neural network with one hidden layer.

Schematic illustration of activation functions.

Figure 11.3 Activation functions.

The Tanh function (upper T) is an activation or gating unit given by

(11.6)upper T left-parenthesis x right-parenthesis equals StartFraction e Superscript x Baseline minus e Superscript negative x Baseline Over e Superscript x Baseline plus e Superscript negative x Baseline EndFraction comma

and its derivative is given by

(11.7)StartFraction partial-differential upper T left-parenthesis x right-parenthesis Over partial-differential x EndFraction equals upper T prime left-parenthesis x right-parenthesis equals 1 minus upper T squared left-parenthesis x right-parenthesis period

Rectified linear unit (ReLU) is another activation function, which is given by

(11.8)upper R left-parenthesis x right-parenthesis equals max left-parenthesis 0 comma x right-parenthesis comma

having a point of discontinuity at 0. The corresponding approximate derivative is given by

Backpropagation

Based on the network shown in Fig. 11.2, the parameter elements w Subscript k j Baseline comma b Subscript j Baseline, w Subscript j i Baseline comma b Subscript k Baseline, for any i comma j comma k, need to be updated and the goal is to calculate the corresponding gradients. The loss function is assumed to be a sum of squared error given by

(11.10)upper E equals one half sigma-summation Underscript k equals 1 Overscript upper K Endscripts left-parenthesis d Subscript k Baseline minus y Subscript k Baseline right-parenthesis squared comma

where d Subscript k denotes a ground truth element or label. The gradient delta Subscript w Sub Subscript k j, required for updating the weight parameter w Subscript k j, is expressed and extended according to the chain rule of derivatives as

(11.11)delta Subscript w Sub Subscript k j Baseline equals StartFraction partial-differential upper E Over partial-differential w Subscript k j Baseline EndFraction equals StartFraction partial-differential upper E Over partial-differential y Subscript k Baseline EndFraction dot StartFraction partial-differential y Subscript k Baseline Over partial-differential w Subscript k j Baseline EndFraction
(11.12)equals minus left-parenthesis d Subscript k Baseline minus y Subscript k Baseline right-parenthesis dot StartFraction partial-differential Over partial-differential w Subscript k j Baseline EndFraction left-parenthesis sigma-summation Underscript j equals 1 Overscript upper J Endscripts w Subscript k j Baseline dot a Subscript j Baseline plus b Subscript k Baseline right-parenthesis

In a similar way, the gradient delta Subscript b Sub Subscript k, to update the bias term b Subscript k, is given by

Additionally, the partial gradient of the loss function with respect to the hidden activation delta Subscript a Sub Subscript j, which will be backpropagated to the previous layer, is given by

(11.15)delta Subscript a Sub Subscript j Baseline equals sigma-summation Underscript k equals 1 Overscript upper K Endscripts StartFraction partial-differential upper E Over partial-differential a Subscript j Baseline EndFraction equals sigma-summation Underscript k equals 1 Overscript upper K Endscripts StartFraction partial-differential upper E Over partial-differential y Subscript k Baseline EndFraction dot StartFraction partial-differential y Subscript k Baseline Over partial-differential a Subscript j Baseline EndFraction
(11.16)equals sigma-summation Underscript k equals 1 Overscript upper K Endscripts minus left-parenthesis d Subscript k Baseline minus y Subscript k Baseline right-parenthesis dot StartFraction partial-differential Over partial-differential a Subscript j Baseline EndFraction left-parenthesis sigma-summation Underscript j equals 1 Overscript upper J Endscripts w Subscript k j Baseline dot a Subscript j Baseline plus b Subscript k Baseline right-parenthesis

In the next step, it is necessary to calculate the gradient delta Subscript w Sub Subscript j i, required for updating the weight parameter w Subscript j i, and it can be expressed and extended according to the chain rule of derivatives. Assuming that a sigmoid activation is used and with the help of the Eqs. (11.17) and (11.3), it can be derived as

(11.18)delta Subscript w Sub Subscript j i Baseline equals StartFraction partial-differential upper E Over partial-differential w Subscript j i Baseline EndFraction equals sigma-summation Underscript k equals 1 Overscript upper K Endscripts StartFraction partial-differential upper E Over partial-differential y Subscript k Baseline EndFraction dot StartFraction partial-differential y Subscript k Baseline Over partial-differential a Subscript j Baseline EndFraction dot StartFraction partial-differential a Subscript j Baseline Over partial-differential y Subscript j Baseline EndFraction dot StartFraction partial-differential y Subscript j Baseline Over partial-differential w Subscript j i Baseline EndFraction
(11.19)equals sigma-summation Underscript k equals 1 Overscript upper K Endscripts minus e Subscript k Baseline dot w Subscript k j Baseline dot sigma prime left-parenthesis y Subscript j Baseline right-parenthesis dot StartFraction partial-differential Over partial-differential w Subscript j i Baseline EndFraction left-parenthesis sigma-summation Underscript i equals 1 Overscript upper I Endscripts w Subscript j i Baseline dot x Subscript i Baseline plus b Subscript j Baseline right-parenthesis
(11.20)equals sigma-summation Underscript k equals 1 Overscript upper K Endscripts minus e Subscript k Baseline dot w Subscript k j Baseline dot sigma prime left-parenthesis y Subscript j Baseline right-parenthesis dot x Subscript i Baseline period

The expression ModifyingAbove sigma With tilde left-parenthesis dot right-parenthesis in the above equations is the local derivative of the sigmoid activation function, as given by Eq. (11.5). The gradient delta Subscript b Sub Subscript j is calculated similarly and is given by

(11.21)delta Subscript b Sub Subscript j Baseline equals StartFraction partial-differential upper E Over partial-differential b Subscript j Baseline EndFraction equals sigma-summation Underscript k equals 1 Overscript upper K Endscripts StartFraction partial-differential upper E Over partial-differential y Subscript k Baseline EndFraction dot StartFraction partial-differential y Subscript k Baseline Over partial-differential a Subscript j Baseline EndFraction dot StartFraction partial-differential a Subscript j Baseline Over partial-differential y Subscript j Baseline EndFraction dot StartFraction partial-differential y Subscript j Baseline Over partial-differential b Subscript j Baseline EndFraction
(11.22)equals sigma-summation Underscript k equals 1 Overscript upper K Endscripts minus e Subscript k Baseline dot w Subscript k j Baseline dot sigma prime left-parenthesis y Subscript j Baseline right-parenthesis dot StartFraction partial-differential Over partial-differential b Subscript j Baseline EndFraction left-parenthesis sigma-summation Underscript i equals 1 Overscript upper I Endscripts w Subscript j i Baseline dot x Subscript i Baseline plus b Subscript j Baseline right-parenthesis
(11.23)equals sigma-summation Underscript k equals 1 Overscript upper K Endscripts minus e Subscript k Baseline dot w Subscript k j Baseline dot sigma prime left-parenthesis y Subscript j Baseline right-parenthesis period

In the next step, the parameters can be iteratively updated with the help of a simple gradient descent, given by

(11.24)w Subscript k j Baseline equals w Subscript k j Baseline minus eta Subscript w Baseline dot delta Subscript w Sub Subscript k j Subscript Baseline comma
(11.25)b Subscript k Baseline equals b Subscript k Baseline minus eta Subscript b Baseline dot delta Subscript b Sub Subscript k Subscript Baseline comma
(11.26)w Subscript j i Baseline equals w Subscript j i Baseline minus eta Subscript w Baseline dot delta Subscript w Sub Subscript j i Subscript Baseline comma
(11.27)b Subscript j Baseline equals b Subscript j Baseline minus eta Subscript b Baseline dot delta Subscript b Sub Subscript j Subscript Baseline comma

where eta Subscript b and eta Subscript w denote the step sizes or learning rates for weights and biases, respectively. The learning rate is usually a small fractional value and eta Subscript b is usually lower than eta Subscript w. Gradient descent or stochastic gradient descent is usually very slow in convergence when used in a deep learning method across large datasets. Hence, faster gradient descent methods like Nesterov momentum [Nes83], adaptive gradient or adagrad [Duc11], rmsprop [Dau15], and adaptive momentum or adam [Kin15] are primarily used for large‐scale deep learning applications.

11.3.2 Convolutional Neural Network

A CNN is an artificial neural network which contains at least one convolutional unit which performs a convolution or cross‐correlation between the input and a set of pre‐defined filters. The filter coefficients, also referred to as weights, are initialized using one of the widely used initialization methods [Glo10, He15] and are subsequently updated iteratively. The primary advantage of such a network over a feedforward artificial neural network is that a convolutional layer contains fewer parameters compared to a corresponding fully connected layer. A CNN also contains hidden layers and an activation function between each convolutional layer. An example network with one hidden layer is illustrated in Fig. 11.4, which contains a convolutional unit and a fully connected unit. The hidden layer is produced by the filtering operation between the nodes in the input layer and the pre‐initialized filters in the convolutional unit. The filtering operation is followed by a ReLU activation function, which is commonly used in a CNN. Finally, the hidden activations are densely connected with the neurons of the output layer. Given an input element x Subscript j, hidden element y Subscript j, hidden activation a Subscript j, and output element y Subscript k, the operations are given by

(11.28)y Subscript j Baseline equals sigma-summation Underscript l equals minus StartFraction h minus 1 Over 2 EndFraction Overscript StartFraction h minus 1 Over 2 EndFraction Endscripts w Subscript l Baseline dot x Subscript j plus l Baseline plus b comma
(11.29)a Subscript j Baseline equals upper R left-parenthesis y Subscript j Baseline right-parenthesis comma

where w Subscript l and b represent the associated filter coefficient of the filter having a length h and the scalar bias value, respectively, and w Subscript k j and b Subscript k represent the associated scalar weights between a Subscript j and y Subscript k and the corresponding bias, respectively. The convolutional unit can also contain multiple filters and bias values resulting in a hidden layer containing multiple stacked vectors instead of one vector.

Schematic illustration of a convolutional neural network with one hidden layer.

Figure 11.4 A convolutional neural network with one hidden layer.

Based on the network shown in Fig. 11.4, the weight elements w Subscript k j Baseline comma b Subscript j Baseline, w Subscript l, and b need to be updated and the goal is to calculate the corresponding gradients. The gradient delta Subscript w Sub Subscript k j, required for updating the weight parameter w Subscript k j, is already given by Eq. (11.13) while the gradient delta Subscript b Sub Subscript k, to update the bias term b Subscript k, is given by Eq. (11.14). Additionally, the partial gradient of the loss function with respect to the hidden activation delta Subscript a Sub Subscript j, which will be backpropagated to the previous layer, is given by Eq. (11.17). In the next step, it is necessary to calculate the gradient delta Subscript w Sub Subscript m, required for updating the filter coefficient w Subscript m, and it can be expressed and extended according to the chain rule of derivatives. With the help of Eqs. (11.17) and (11.30), it can be derived as

(11.31)delta Subscript w Sub Subscript m Baseline equals StartFraction partial-differential upper E Over partial-differential w Subscript m Baseline EndFraction equals sigma-summation Underscript j equals 1 Overscript upper J Endscripts sigma-summation Underscript k equals 1 Overscript upper K Endscripts StartFraction partial-differential upper E Over partial-differential y Subscript k Baseline EndFraction dot StartFraction partial-differential y Subscript k Baseline Over partial-differential a Subscript j Baseline EndFraction dot StartFraction partial-differential a Subscript j Baseline Over partial-differential y Subscript j Baseline EndFraction dot StartFraction partial-differential y Subscript j Baseline Over partial-differential w Subscript m Baseline EndFraction
(11.32)equals sigma-summation Underscript j equals 1 Overscript upper J Endscripts sigma-summation Underscript k equals 1 Overscript upper K Endscripts minus e Subscript k Baseline dot w Subscript k j Baseline dot upper R prime left-parenthesis y Subscript j Baseline right-parenthesis dot StartFraction partial-differential Over partial-differential w Subscript m Baseline EndFraction left-parenthesis sigma-summation Underscript l equals minus StartFraction h minus 1 Over 2 EndFraction Overscript StartFraction h minus 1 Over 2 EndFraction Endscripts w Subscript l Baseline dot x Subscript j plus l Baseline plus b right-parenthesis
(11.33)equals sigma-summation Underscript j equals 1 Overscript upper J Endscripts sigma-summation Underscript k equals 1 Overscript upper K Endscripts minus e Subscript k Baseline dot w Subscript k j Baseline dot upper R prime left-parenthesis y Subscript j Baseline right-parenthesis dot left-parenthesis sigma-summation Underscript l equals minus StartFraction h minus 1 Over 2 EndFraction Overscript StartFraction h minus 1 Over 2 EndFraction Endscripts StartFraction partial-differential w Subscript l Baseline Over partial-differential w Subscript m Baseline EndFraction dot x Subscript j plus l Baseline right-parenthesis

where the expression delta Subscript y Sub Subscript j is given by

(11.35)delta Subscript y Sub Subscript j Subscript Baseline equals sigma-summation Underscript k equals 1 Overscript upper K Endscripts minus e Subscript k Baseline dot w Subscript k j Baseline dot upper R prime left-parenthesis y Subscript j Baseline right-parenthesis comma

and the expression upper R prime left-parenthesis y Subscript j Baseline right-parenthesis refers to the local derivative of the ReLU activation function, as given by Eq. (11.9). Equation (11.34) leads to a cross‐correlation operation between the incoming gradient and the input vector. The gradient delta Subscript b is calculated similarly and is given by

(11.36)delta Subscript b Baseline equals StartFraction partial-differential upper E Over partial-differential b EndFraction equals sigma-summation Underscript j equals 1 Overscript upper J Endscripts sigma-summation Underscript k equals 1 Overscript upper K Endscripts StartFraction partial-differential upper E Over partial-differential y Subscript k Baseline EndFraction dot StartFraction partial-differential y Subscript k Baseline Over partial-differential a Subscript j Baseline EndFraction dot StartFraction partial-differential a Subscript j Baseline Over partial-differential y Subscript j Baseline EndFraction dot StartFraction partial-differential y Subscript j Baseline Over partial-differential b EndFraction
(11.37)equals sigma-summation Underscript j equals 1 Overscript upper J Endscripts delta Subscript y Sub Subscript j Subscript Baseline period

After the calculation of the required gradients, the parameters are updated by stochastic gradient descent or a faster parameter update algorithm.

11.4 Applications

The following sections describe several deep learning applications in the area of audio signal processing. In the first example, the application proposed in [Bha20] is described, where a cascade of parametric peak and shelving filters is adapted to model head‐related transfer functions. The iterative optimization method is sample based and uses the instantaneous backpropagation method. In the second example, the sparse feedback delay network (FDN) with Schroeder allpass filters, described in Section 7.3.3, is adapted to model a desired room impulse response or simulate a room impulse response (RIR) based on a desired reverberation time. The final application of the section describes a CNN which is used to reduce Gaussian noise in audio or speech signals.

11.4.1 Parametric Filter Adaptation

Parametric peak and shelving filters, as introduced in Section 6.2.2, can be cascaded to perform audio processing tasks like equalization, spectral shaping, and modeling of complex transfer functions. An independent optimization of the mentioned parameters of each individual filter in this cascaded structure can be performed with the help of a backpropagation algorithm. Earlier research in this area includes the work done in [Gao92], which introduces a backpropagation‐based adaptive IIR filter, and in [Ros99], which trained a recursive filter with derivative function to adapt a controller. In the context of neural networks, a cascaded structure of FIR and IIR filters with multilayer perceptrons was used in [Bac91] for time‐series modeling based on a simplified instantaneous backpropagation through time (IBPTT). A similar adaptive IIR‐multilayer perceptron (IIR‐MLP) network was also described in [Cam96] based on causal backpropagation through time (CBPTT). In recent years, backpropagation is extensively used in CNNs as well as recurrent neural networks, the latter being recursive in nature. However, in the aforementioned literature, the adaptation is primarily performed directly on filter coefficients. This work is one of the few methods that adapt the control parameters with deep learning [Ner20, Väl19, Eng20] and the application described in this section is already introduced in [Bha20].

Schematic illustration of a cascaded structure of peak and shelving filters.

Figure 11.5 A cascaded structure of peak and shelving filters.

Schematic illustration of signal flow graph of a LF/HF shelving filter.

Figure 11.6 Signal flow graph of a LF/HF shelving filter.

The proposed cascaded structure with M filters is illustrated in Fig. 11.5, where the first and the last filters are shelving filters while the remaining filters are peak filters. Peak and shelving filters are described in Section 6.2.2 along with their illustrations and transfer functions. From the transfer functions defining low‐frequency and high‐frequency shelving filters (LFS and HFS) in Eq. (6.59) and Eq. (6.71), one can derive the signal flow graph in Fig. 11.6 and the corresponding difference equation can be calculated to be

where y 1 left-parenthesis n right-parenthesis defines the output signal of a first‐order allpass filter according to

(11.40)x Subscript h Baseline left-parenthesis n right-parenthesis equals x left-parenthesis n right-parenthesis minus a Subscript upper B slash upper C Baseline x Subscript h Baseline left-parenthesis n minus 1 right-parenthesis period

Similarly, the signal flow graph in Fig. 11.7 and the difference equation of a peak filter can be derived from Eq. (6.81) as

where y 2 left-parenthesis n right-parenthesis defines the output signal of a second‐order allpass filter according to

(11.43)x Subscript h Baseline left-parenthesis n right-parenthesis equals x left-parenthesis n right-parenthesis minus d left-parenthesis 1 minus a Subscript upper B slash upper C Baseline right-parenthesis x Subscript h Baseline left-parenthesis n minus 1 right-parenthesis plus a Subscript upper B slash upper C Baseline x Subscript h Baseline left-parenthesis n minus 2 right-parenthesis period
Schematic illustration of signal flow graph of a peak filter.

Figure 11.7 Signal flow graph of a peak filter.

The partial derivative of a pre‐defined cost function, with respect to the filter parameters in a cascaded structure, needs to be calculated as a product of multiple local derivatives according to the chain rule. Given the cascaded structure and a reference system under test, a global instantaneous cost or loss function upper C left-parenthesis n right-parenthesis for the nth sample can be defined, as shown in Fig. 11.8. The derivative of the cost function with respect to a parameter p Subscript upper M minus 1 can be written as

according to the chain rule of derivatives, where y left-parenthesis n right-parenthesis represents the predicted output of the cascaded filter structure, y Subscript d Baseline left-parenthesis n right-parenthesis represents the desired output of the system under test, x Subscript upper M minus 1 Baseline left-parenthesis n right-parenthesis represents the output of the {upper M minus 1}th filter in the cascade, and p Subscript upper M minus 1 represents any control parameter like gain, bandwidth, or center frequency of the {upper M minus 1}th peak filter.

Schematic illustration of block diagram of the model.

Figure 11.8 Block diagram of the model.

Schematic illustration of backpropagation through the cascaded structure.

Figure 11.9 Illustration of backpropagation through the cascaded structure.

This will result in a simplified instantaneous backpropagation algorithm [Bac91], which is illustrated in Fig. 11.9. To calculate the above derivative, with respect to the instantaneous cost function, it is necessary to calculate the local derivative StartFraction partial-differential upper C left-parenthesis n right-parenthesis Over partial-differential y left-parenthesis n right-parenthesis EndFraction of the cost function, the local derivative StartFraction partial-differential y left-parenthesis n right-parenthesis Over partial-differential x Subscript upper M minus 1 Baseline left-parenthesis n right-parenthesis EndFraction of a filter output with respect to its input, and the local derivative StartFraction partial-differential x Subscript upper M minus 1 Baseline left-parenthesis n right-parenthesis Over partial-differential p Subscript upper M minus 1 Baseline EndFraction of a filter output with respect to its parameter p Subscript upper M minus 1. Hence, in general, the above three types of local gradients need to be calculated to adapt the cascaded structure. In the following sections, the local gradients of the filter output against the filter input and the control parameters are derived for shelving and peak filters. Finally, the cascaded structure is used for modeling head‐related transfer function (HRTF) magnitudes as an example application.

Shelving Filter

For a shelving filter, local derivatives of the filter output are calculated against the filter input, the gain, and the cutoff frequency. Referring to Eq. (11.38), the derivative of a low‐frequency shelving (LFS) and a high‐frequency shelving (HFS) filter output y left-parenthesis n right-parenthesis, with respect to its input x left-parenthesis n right-parenthesis, is calculated as

(11.45)StartFraction partial-differential y left-parenthesis n right-parenthesis Over partial-differential x left-parenthesis n right-parenthesis EndFraction equals StartFraction upper H 0 Over 2 EndFraction left-bracket 1 plus-or-minus a Subscript upper B slash upper C Baseline right-bracket plus 1 period

The derivative of the shelving filter output, with respect to the filter gain upper G, for the boost case is calculated as

(11.46)StartFraction partial-differential y left-parenthesis n right-parenthesis Over partial-differential upper G EndFraction equals StartFraction left-bracket x left-parenthesis n right-parenthesis plus-or-minus y 1 left-parenthesis n right-parenthesis right-bracket Over 2 EndFraction StartFraction partial-differential upper H 0 Over partial-differential upper G EndFraction
(11.47)equals StartFraction left-bracket x left-parenthesis n right-parenthesis plus-or-minus y 1 left-parenthesis n right-parenthesis right-bracket Over 2 EndFraction StartFraction partial-differential Over partial-differential upper G EndFraction left-bracket 1 0 Superscript StartFraction upper G Over 20 EndFraction Baseline minus 1 right-bracket
(11.48)equals StartFraction left-bracket x left-parenthesis n right-parenthesis plus-or-minus y 1 left-parenthesis n right-parenthesis right-bracket Over 40 EndFraction 1 0 Superscript StartFraction upper G Over 20 EndFraction Baseline ln left-parenthesis 10 right-parenthesis period

The derivative of the filter output, with respect to the filter gain upper G, for the cut case is different from the boost case because of the dependence between the gain parameter and the parameter for cutoff frequency that can be seen in Eqs. (6.47) and (6.53). With the help of Eq. (11.38), it can calculated as

(11.49)StartFraction partial-differential y left-parenthesis n right-parenthesis Over partial-differential upper G EndFraction equals StartFraction left-bracket x left-parenthesis n right-parenthesis plus-or-minus y 1 left-parenthesis n right-parenthesis right-bracket Over 2 EndFraction StartFraction partial-differential upper H 0 Over partial-differential upper G EndFraction plus-or-minus StartFraction upper H 0 Over 2 EndFraction StartFraction partial-differential y 1 left-parenthesis n right-parenthesis Over partial-differential upper G EndFraction

where the expression StartFraction partial-differential y 1 left-parenthesis n right-parenthesis Over partial-differential upper G EndFraction of Eq. (11.50) can be extended with the help of Eq. (11.39) as

(11.51)StartFraction partial-differential y 1 left-parenthesis n right-parenthesis Over partial-differential upper G EndFraction equals StartFraction partial-differential a Subscript upper C Baseline Over partial-differential upper G EndFraction x Subscript h Baseline left-parenthesis n right-parenthesis plus a Subscript upper C Baseline StartFraction partial-differential x Subscript h Baseline left-parenthesis n right-parenthesis Over partial-differential upper G EndFraction plus StartFraction partial-differential x Subscript h Baseline left-parenthesis n minus 1 right-parenthesis Over partial-differential upper G EndFraction comma

with

(11.52)StartFraction partial-differential a Subscript upper C Baseline Over partial-differential upper G EndFraction equals StartStartFraction minus ln left-parenthesis 10 right-parenthesis upper V 0 tangent left-parenthesis pi StartFraction f Subscript c Baseline Over f Subscript upper S Baseline EndFraction right-parenthesis OverOver 10 left-bracket tangent left-parenthesis pi StartFraction f Subscript c Baseline Over f Subscript upper S Baseline EndFraction right-parenthesis plus upper V 0 right-bracket squared EndEndFraction comma left-parenthesis bold upper L upper F upper S right-parenthesis
(11.53)StartFraction partial-differential a Subscript upper C Baseline Over partial-differential upper G EndFraction equals StartStartFraction ln left-parenthesis 10 right-parenthesis upper V 0 tangent left-parenthesis pi StartFraction f Subscript c Baseline Over f Subscript upper S Baseline EndFraction right-parenthesis OverOver 10 left-bracket upper V 0 tangent left-parenthesis pi StartFraction f Subscript c Baseline Over f Subscript upper S Baseline EndFraction right-parenthesis plus 1 right-bracket squared EndEndFraction comma left-parenthesis bold upper H upper F upper S right-parenthesis

From Eq. (11.54), StartFraction partial-differential x Subscript h Baseline left-parenthesis n minus 1 right-parenthesis Over partial-differential upper G EndFraction is calculated by employing the chain rule of derivatives and IBPTT with the initialization of StartFraction partial-differential x Subscript h Baseline left-parenthesis k right-parenthesis Over partial-differential upper G EndFraction vertical-bar Subscript k equals 0 Baseline equals 0.

Finally, the derivative of the shelving filter output, with respect to the cutoff frequency f Subscript c, is calculated as

where

(11.56)StartFraction partial-differential y 1 left-parenthesis n right-parenthesis Over partial-differential f Subscript c Baseline EndFraction equals StartFraction partial-differential a Subscript upper B slash upper C Baseline Over partial-differential f Subscript c Baseline EndFraction x Subscript h Baseline left-parenthesis n right-parenthesis plus a Subscript upper B slash upper C Baseline StartFraction partial-differential x Subscript h Baseline left-parenthesis n right-parenthesis Over partial-differential f Subscript c Baseline EndFraction plus StartFraction partial-differential x Subscript h Baseline left-parenthesis n minus 1 right-parenthesis Over partial-differential f Subscript c Baseline EndFraction comma

with

(11.57)StartFraction partial-differential a Subscript upper B Baseline Over partial-differential f Subscript c Baseline EndFraction equals StartStartFraction 2 pi secant left-parenthesis 2 pi StartFraction f Subscript c Baseline Over f Subscript upper S Baseline EndFraction right-parenthesis left-bracket secant left-parenthesis 2 pi StartFraction f Subscript c Baseline Over f Subscript upper S Baseline EndFraction right-parenthesis minus tangent left-parenthesis 2 pi StartFraction f Subscript c Baseline Over f Subscript upper S Baseline EndFraction right-parenthesis right-bracket OverOver f Subscript upper S Baseline EndEndFraction comma
(11.58)StartFraction partial-differential a Subscript upper C Baseline Over partial-differential f Subscript c Baseline EndFraction equals StartFraction 2 pi upper V 0 secant left-parenthesis pi StartFraction f Subscript c Baseline Over f Subscript upper S Baseline EndFraction right-parenthesis squared Over f Subscript upper S Baseline left-bracket tangent left-parenthesis pi StartFraction f Subscript c Baseline Over f Subscript upper S Baseline EndFraction right-parenthesis plus upper V 0 right-bracket squared EndFraction comma left-parenthesis bold upper L upper F upper S right-parenthesis
(11.59)StartFraction partial-differential a Subscript upper C Baseline Over partial-differential f Subscript c Baseline EndFraction equals StartFraction 2 pi upper V 0 secant left-parenthesis pi StartFraction f Subscript c Baseline Over f Subscript upper S Baseline EndFraction right-parenthesis squared Over f Subscript upper S Baseline left-bracket upper V 0 tangent left-parenthesis pi StartFraction f Subscript c Baseline Over f Subscript upper S Baseline EndFraction right-parenthesis plus 1 right-bracket squared EndFraction comma left-parenthesis bold upper H upper F upper S right-parenthesis

From Eq. (11.60), StartFraction partial-differential x Subscript h Baseline left-parenthesis n minus 1 right-parenthesis Over partial-differential f Subscript c Baseline EndFraction is calculated by employing the chain rule of derivatives and IBPTT with the initialization of StartFraction partial-differential x Subscript h Baseline left-parenthesis k right-parenthesis Over partial-differential f Subscript c Baseline EndFraction vertical-bar Subscript k equals 0 Baseline equals 0.

Peak Filter

For a peak filter, local derivatives of the filter output are calculated against the filter input, the gain, the center frequency, and the bandwidth. Referring to Eq. (11.41), the derivative of a second‐order peak filter output y left-parenthesis n right-parenthesis, with respect to its input x left-parenthesis n right-parenthesis, is calculated as

As done in the case of shelving filters, the derivative of the peak filter output, with respect to the filter gain upper G, for the boost case will result in

For the cut case as well, the derivative of the peak filter output, with respect to the filter gain upper G, is derived in a similar manner. With the help of Eq. (11.41), the derivation leads to

(11.63)StartFraction partial-differential y left-parenthesis n right-parenthesis Over partial-differential upper G EndFraction equals StartFraction left-bracket x left-parenthesis n right-parenthesis minus y 2 left-parenthesis n right-parenthesis right-bracket Over 40 EndFraction 1 0 Superscript StartFraction upper G Over 20 EndFraction Baseline ln left-parenthesis 10 right-parenthesis minus StartFraction upper H 0 Over 2 EndFraction StartFraction partial-differential y 2 left-parenthesis n right-parenthesis Over partial-differential upper G EndFraction comma

and the expression StartFraction partial-differential y 2 left-parenthesis n right-parenthesis Over partial-differential upper G EndFraction from the above equation can be extended with the help of Eq. (11.42) as

(11.64)StartLayout 1st Row 1st Column StartFraction partial-differential y 2 left-parenthesis n right-parenthesis Over partial-differential upper G EndFraction 2nd Column equals minus StartFraction partial-differential a Subscript upper C Baseline Over partial-differential upper G EndFraction x Subscript h Baseline left-parenthesis n right-parenthesis minus a Subscript upper C Baseline StartFraction partial-differential x Subscript h Baseline left-parenthesis n right-parenthesis Over partial-differential upper G EndFraction minus midline-horizontal-ellipsis 2nd Row 1st Column Blank 2nd Column d StartFraction partial-differential a Subscript upper C Baseline Over partial-differential upper G EndFraction x Subscript h Baseline left-parenthesis n minus 1 right-parenthesis plus StartFraction partial-differential x Subscript h Baseline left-parenthesis n minus 2 right-parenthesis Over partial-differential upper G EndFraction plus midline-horizontal-ellipsis 3rd Row 1st Column Blank 2nd Column d left-parenthesis 1 minus a Subscript upper C Baseline right-parenthesis StartFraction partial-differential x Subscript h Baseline left-parenthesis n minus 1 right-parenthesis Over partial-differential upper G EndFraction comma EndLayout

with

(11.65)StartFraction partial-differential a Subscript upper C Baseline Over partial-differential upper G EndFraction equals StartStartFraction minus ln left-parenthesis 10 right-parenthesis upper V 0 tangent left-parenthesis pi StartFraction f Subscript b Baseline Over f Subscript upper S Baseline EndFraction right-parenthesis OverOver 10 left-bracket tangent left-parenthesis pi StartFraction f Subscript b Baseline Over f Subscript upper S Baseline EndFraction right-parenthesis plus upper V 0 right-bracket squared EndEndFraction comma

and

From Eq. (11.66), the expressions StartFraction partial-differential x Subscript h Baseline left-parenthesis n minus 1 right-parenthesis Over partial-differential upper G EndFraction and StartFraction partial-differential x Subscript h Baseline left-parenthesis n minus 2 right-parenthesis Over partial-differential upper G EndFraction are calculated with the chain rule of derivatives and IBPTT with the initialization of StartFraction partial-differential x Subscript h Baseline left-parenthesis k right-parenthesis Over partial-differential upper G EndFraction vertical-bar Subscript k equals 0 Baseline equals 0 and StartFraction partial-differential x Subscript h Baseline left-parenthesis k right-parenthesis Over partial-differential upper G EndFraction vertical-bar Subscript k equals negative 1 Baseline equals 0.

The derivative of the peak filter output, with respect to the cutoff frequency f Subscript c, leads to an expression similar to Eq. (11.55) given by

(11.67)StartFraction partial-differential y left-parenthesis n right-parenthesis Over partial-differential f Subscript c Baseline EndFraction equals minus StartFraction upper H 0 Over 2 EndFraction StartFraction partial-differential y 2 left-parenthesis n right-parenthesis Over partial-differential f Subscript c Baseline EndFraction comma

where

(11.68)StartLayout 1st Row 1st Column StartFraction partial-differential y 2 left-parenthesis n right-parenthesis Over partial-differential f Subscript c Baseline EndFraction 2nd Column equals minus a Subscript upper B slash upper C Baseline StartFraction partial-differential x Subscript h Baseline left-parenthesis n right-parenthesis Over partial-differential f Subscript c Baseline EndFraction plus StartFraction partial-differential d Over partial-differential f Subscript c Baseline EndFraction left-parenthesis 1 minus a Subscript upper B slash upper C Baseline right-parenthesis x Subscript h Baseline left-parenthesis n minus 1 right-parenthesis plus midline-horizontal-ellipsis 2nd Row 1st Column Blank 2nd Column d left-parenthesis 1 minus a Subscript upper B slash upper C Baseline right-parenthesis StartFraction partial-differential x Subscript h Baseline left-parenthesis n minus 1 right-parenthesis Over partial-differential f Subscript c Baseline EndFraction plus StartFraction partial-differential x Subscript h Baseline left-parenthesis n minus 2 right-parenthesis Over partial-differential f Subscript c Baseline EndFraction comma EndLayout

with

(11.69)StartFraction partial-differential d Over partial-differential f Subscript c Baseline EndFraction equals StartStartFraction 2 pi sine left-parenthesis 2 pi StartFraction f Subscript c Baseline Over f Subscript upper S Baseline EndFraction right-parenthesis OverOver f Subscript upper S Baseline EndEndFraction comma

and

From Eq. (11.70), the expressions StartFraction partial-differential x Subscript h Baseline left-parenthesis n minus 1 right-parenthesis Over partial-differential f Subscript c Baseline EndFraction and StartFraction partial-differential x Subscript h Baseline left-parenthesis n minus 2 right-parenthesis Over partial-differential f Subscript c Baseline EndFraction are calculated with the chain rule of derivatives and IBPTT with the initialization of StartFraction partial-differential x Subscript h Baseline left-parenthesis k right-parenthesis Over partial-differential f Subscript c Baseline EndFraction vertical-bar Subscript k equals 0 Baseline equals 0 and StartFraction partial-differential x Subscript h Baseline left-parenthesis k right-parenthesis Over partial-differential f Subscript c Baseline EndFraction vertical-bar Subscript k equals negative 1 Baseline equals 0.

Finally, the derivative of the peak filter output, with respect to the bandwidth f Subscript b, is calculated as

(11.71)StartFraction partial-differential y left-parenthesis n right-parenthesis Over partial-differential f Subscript b Baseline EndFraction equals minus StartFraction upper H 0 Over 2 EndFraction StartFraction partial-differential y 2 left-parenthesis n right-parenthesis Over partial-differential f Subscript b Baseline EndFraction period

From the above equation, StartFraction partial-differential y 2 left-parenthesis n right-parenthesis Over partial-differential f Subscript b Baseline EndFraction is calculated as

(11.72)StartLayout 1st Row 1st Column StartFraction partial-differential y 2 left-parenthesis n right-parenthesis Over partial-differential f Subscript b Baseline EndFraction 2nd Column equals minus StartFraction partial-differential a Subscript upper B slash upper C Baseline Over partial-differential f Subscript b Baseline EndFraction x Subscript h Baseline left-parenthesis n right-parenthesis minus a Subscript upper B slash upper C Baseline StartFraction partial-differential x Subscript h Baseline left-parenthesis n right-parenthesis Over partial-differential f Subscript b Baseline EndFraction minus midline-horizontal-ellipsis 2nd Row 1st Column Blank 2nd Column d StartFraction partial-differential a Subscript upper B slash upper C Baseline Over partial-differential f Subscript b Baseline EndFraction x Subscript h Baseline left-parenthesis n minus 1 right-parenthesis plus StartFraction partial-differential x Subscript h Baseline left-parenthesis n minus 2 right-parenthesis Over partial-differential f Subscript b Baseline EndFraction plus midline-horizontal-ellipsis 3rd Row 1st Column Blank 2nd Column d left-parenthesis 1 minus a Subscript upper B slash upper C Baseline right-parenthesis StartFraction partial-differential x Subscript h Baseline left-parenthesis n minus 1 right-parenthesis Over partial-differential f Subscript b Baseline EndFraction comma EndLayout

with

(11.73)StartFraction partial-differential a Subscript upper B Baseline Over partial-differential f Subscript b Baseline EndFraction equals StartStartFraction 2 pi secant left-parenthesis 2 pi StartFraction f Subscript b Baseline Over f Subscript upper S Baseline EndFraction right-parenthesis left-bracket secant left-parenthesis 2 pi StartFraction f Subscript b Baseline Over f Subscript upper S Baseline EndFraction right-parenthesis minus tangent left-parenthesis 2 pi StartFraction f Subscript b Baseline Over f Subscript upper S Baseline EndFraction right-parenthesis right-bracket OverOver f Subscript upper S Baseline EndEndFraction comma
(11.74)StartFraction partial-differential a Subscript upper C Baseline Over partial-differential f Subscript b Baseline EndFraction equals StartFraction 2 pi upper V 0 secant left-parenthesis pi StartFraction f Subscript b Baseline Over f Subscript upper S Baseline EndFraction right-parenthesis squared Over f Subscript upper S Baseline left-bracket tangent left-parenthesis pi StartFraction f Subscript b Baseline Over f Subscript upper S Baseline EndFraction right-parenthesis plus upper V 0 right-bracket squared EndFraction comma

and

From Eq. (11.75), the expressions StartFraction partial-differential x Subscript h Baseline left-parenthesis n minus 1 right-parenthesis Over partial-differential f Subscript b Baseline EndFraction and StartFraction partial-differential x Subscript h Baseline left-parenthesis n minus 2 right-parenthesis Over partial-differential f Subscript b Baseline EndFraction are calculated with chain rule of derivatives and IBPTT with the initialization of StartFraction partial-differential x Subscript h Baseline left-parenthesis k right-parenthesis Over partial-differential f Subscript b Baseline EndFraction vertical-bar Subscript k equals 0 Baseline equals 0 and StartFraction partial-differential x Subscript h Baseline left-parenthesis k right-parenthesis Over partial-differential f Subscript b Baseline EndFraction vertical-bar Subscript k equals negative 1 Baseline equals 0.

Cascaded Structure

With the help of the above formulations, the necessary gradients for parameter update can be derived. As an example, the derivative of the cost function, e.g. the squared‐error function given by

with respect to the gain of the {upper M minus 1}th peak filter upper G Subscript upper M minus 1 in the filter cascade, as shown in Fig. 11.9, is given by

(11.77)StartFraction partial-differential upper C left-parenthesis n right-parenthesis Over partial-differential upper G Subscript upper M minus 1 Baseline EndFraction equals StartFraction partial-differential upper C left-parenthesis n right-parenthesis Over partial-differential y left-parenthesis n right-parenthesis EndFraction StartFraction partial-differential y left-parenthesis n right-parenthesis Over partial-differential x Subscript upper M minus 1 Baseline left-parenthesis n right-parenthesis EndFraction StartFraction partial-differential x Subscript upper M minus 1 Baseline left-parenthesis n right-parenthesis Over partial-differential upper G Subscript upper M minus 1 Baseline EndFraction comma

according to Eq. (11.44). If the boost case is assumed, then with the help of the derivative of the cost function from Eq. (11.76), with respect to y left-parenthesis n right-parenthesis given by

(11.78)StartFraction partial-differential upper C left-parenthesis n right-parenthesis Over partial-differential y left-parenthesis n right-parenthesis EndFraction equals minus 2 left-bracket y Subscript d Baseline left-parenthesis n right-parenthesis minus y left-parenthesis n right-parenthesis right-bracket comma

Eq. (11.61), and Eq. (11.62), the required gradient for parameter update is derived as

(11.79)StartFraction partial-differential upper C left-parenthesis n right-parenthesis Over partial-differential upper G Subscript upper M minus 1 Baseline EndFraction equals e left-parenthesis n right-parenthesis left-parenthesis StartFraction upper H 0 Over 2 EndFraction left-bracket 1 plus a Subscript upper B Sub Subscript upper M minus 1 Subscript Baseline right-bracket plus 1 right-parenthesis left-parenthesis upper K left-bracket x Subscript upper M minus 2 Baseline left-parenthesis n right-parenthesis minus y Subscript 1 Sub Subscript upper M minus 1 Subscript Baseline left-parenthesis n right-parenthesis right-bracket right-parenthesis comma

where

(11.80)e left-parenthesis n right-parenthesis equals StartFraction partial-differential upper C left-parenthesis n right-parenthesis Over partial-differential y left-parenthesis n right-parenthesis EndFraction comma
(11.81)upper K equals StartFraction 1 0 Superscript StartFraction upper G Super Subscript upper M minus 1 Superscript Over 20 EndFraction Baseline ln left-parenthesis 10 right-parenthesis Over 40 EndFraction equals StartFraction upper V Subscript 0 Sub Subscript upper M minus 1 Subscript Baseline ln left-parenthesis 10 right-parenthesis Over 40 EndFraction period

The gain upper G Subscript upper M minus 1 is subsequently updated by the above expression, preferably with the adaptive momentum update for faster convergence.

Head‐related Transfer Function Modeling

An HRTF is a direction dependent transfer function between an external sound source and the human ear. The inverse Fourier transform of the HRTF is the head‐related impulse response (HRIR). In spatial audio through headphones, mono signals are filtered with the corresponding HRIRs to create a virtual sound that is localized in a certain direction. To achieve a good resolution of the 3‐D space, HRIRs have to be saved for a high number of directions, which results in a large amount of stored data. Hence, parametric IIR filters can be used to model the HRTF magnitudes with a lower number of saved parameters.

Schematic illustration of block diagram illustrating the initialization method for HRTF modeling.

Figure 11.10 Block diagram illustrating the initialization method for HRTF modeling.

Schematic illustration of magnitude responses: in the top plot, the whole filter cascade; and in the bottom plot, all individual filter stages of the desired HRTF, the initial HRTF estimate, and the final approximation of the right ear of ‘Subject-008’, for an azimuth φ=20° and elevation θ=0°.

Figure 11.11 Magnitude responses: in the top plot, the whole filter cascade; and in the bottom plot, all individual filter stages of the desired HRTF, the initial HRTF estimate, and the final approximation of the right ear of ‘Subject‐008’, for an azimuth phi equals 2 0 Superscript ring and elevation theta equals 0 Superscript ring.

For the initialization of the cascaded filter structure, an approximated number of required peak filters needs to be determined. Afterward, the allocation of the initial parameter values is done, as illustrated in Fig. 11.10. Initially, the magnitude response is smoothed and the mean of the magnitude response is subtracted. Based on the prominence and proximity to one another, a finite number of peaks and notches are selected. The initial center frequency of a peak filter is determined based on the position of a peak or notch while the initial cutoff frequencies of the shelving filters are determined based on the slopes in the magnitude response. The gain of a filter is initialized by the magnitude of the transfer function at the position of the peak or notch. To reduce the summation effect arising from the cascade, the gain of every peak filter is scaled by a fractional factor depending on the magnitude. Gains of notches in the positive half and peaks in the negative half are converted to small negative and positive values, respectively. Finally, the bandwidth of a peak filter is initialized based on the average local gradient of the magnitude response around the peak's position. The number of filters can be reduced or increased based on the third step. A major drawback of such an initialization is that more than an optimal number of filters will usually be proposed. Additionally, the peak picking method will be insufficient in flat regions. However, a direct initialization is simple and the simultaneous filter update improves the run‐time.

To adapt the cascaded structure, a log‐spectral distance between the estimated and the desired magnitude responses is considered as the objective function, which is given by

(11.82)upper C left-parenthesis k right-parenthesis equals left-parenthesis StartAbsoluteValue upper Y Subscript d Baseline left-parenthesis k right-parenthesis EndAbsoluteValue Subscript dB Baseline minus StartAbsoluteValue upper Y left-parenthesis k right-parenthesis EndAbsoluteValue Subscript dB Baseline right-parenthesis squared comma

where upper Y Subscript d Baseline left-parenthesis k right-parenthesis and upper Y left-parenthesis k right-parenthesis denote the magnitude responses of the desired and estimated output signals, respectively, in dB, and k denotes a frequency bin. The derivative of the Fourier transform between time and frequency domain can be performed with the help of Wirtinger calculus, as demonstrated in [Car17] and [Wan20].

For the evaluation, a subset of HRIRs from the CIPIC database [Alg01] is chosen to evaluate the HRTF magnitude approximation. In the first step, the HRIRs are converted to HRTFs and the magnitude responses are treated as the desired signal for the filter cascade. Nevertheless, before performing the discrete Fourier transform, the HRIRs are padded with zeros to a length of 1024 to achieve a better frequency resolution. Afterward, the aforementioned initialization is performed. Owing to differences in the HRTFs between subjects and directions, every transfer function needs a unique initialization, which can result in a different number of peak filters and initial parameter values. After the initialization of the cascaded structure, the filters are trained and updated for 100 epochs with the adam method [Kin15]. However, in the cases of most HRTFs, a smaller number of epochs is sufficient to achieve a good approximation. The learning rate during the update method is selected as eta equals 1 0 Superscript hyphen 1. Additionally, there is a learning rate drop factor of 0.99 for every time the error in a particular epoch is higher than the error in the previous epoch. The aforementioned hyper‐parameters might change for a few exceptional cases where convergence requires different values.

In Fig. 11.11, the magnitude responses of the desired HRTF, the initial HRTF estimate, and the final approximation of the right ear of ‘Subject_008’, for an azimuth phi equals 2 0 Superscript ring and elevation theta equals 0 Superscript ring, are plotted. As can be seen, the algorithm is able to reproduce the desired magnitude response with 17 peaks and two shelving filters. A more detailed evaluation and discussion can be found in [Bha20].

11.4.2 Room Simulation

In the following application, a sparse FDN, introduced in Section 7.3.3, is adapted to simulate a desired RIR. This section provides some derivations related to the proposed FDN with a cascaded allpass filter and delay line in each of the upper K branches having a sparse diagonal feedback matrix. Figure 11.12 shows an overview of the proposed FDN. The corresponding difference equations are given as

(11.83)y Subscript k Baseline left-parenthesis n right-parenthesis equals s Subscript k Baseline left-parenthesis n minus upper M Subscript k Baseline right-parenthesis minus m dot s Subscript k Baseline left-parenthesis n right-parenthesis comma
(11.84)s Subscript k Baseline left-parenthesis n right-parenthesis equals x Subscript k Baseline left-parenthesis n right-parenthesis plus m dot s Subscript k Baseline left-parenthesis n minus upper M Subscript k Baseline right-parenthesis comma
(11.86)x 1 left-parenthesis n right-parenthesis equals x left-parenthesis n right-parenthesis plus g dot y Subscript normal upper K Baseline left-parenthesis n minus upper D Subscript normal upper K Baseline right-parenthesis comma
(11.87)y left-parenthesis n right-parenthesis equals sigma-summation Underscript k equals 1 Overscript normal upper K Endscripts a Subscript k Baseline dot y Subscript k Baseline left-parenthesis n minus upper D Subscript k Baseline right-parenthesis comma

where y left-parenthesis n right-parenthesis denotes the overall output of the network, a Subscript k denotes the kth coefficient of a mixing vector, y Subscript normal k Baseline left-parenthesis n right-parenthesis denotes the output of the kth Schroeder allpass, upper M Subscript k and upper D Subscript k denote the associated delays inside the allpass and the subsequent delay line of the kth branch, respectively, s Subscript k Baseline left-parenthesis n right-parenthesis denotes a state of the kth Schroeder allpass, x Subscript k Baseline left-parenthesis n right-parenthesis denotes the input to the kth Schroeder allpass, x left-parenthesis n right-parenthesis denotes the input signal to the overall network, and g, a Subscript k are the control parameters. The output of the FDN is sent to a loss function given by

(11.88)upper E equals sigma-summation Underscript n equals 1 Overscript upper N Endscripts left-bracket d left-parenthesis n right-parenthesis minus y left-parenthesis n right-parenthesis right-bracket squared comma

where d denotes the desired signal that works as a ground truth or label. To adapt g, the expression StartFraction partial-differential upper E Over partial-differential g EndFraction should be calculated with chain rule of derivatives and backpropagation as given by

(11.90)equals sigma-summation Underscript n equals 1 Overscript upper N Endscripts sigma-summation Underscript k equals 1 Overscript normal upper K Endscripts e left-parenthesis n right-parenthesis dot a Subscript k Baseline dot StartFraction partial-differential y Subscript k Baseline left-parenthesis n minus upper D Subscript k Baseline right-parenthesis Over partial-differential g EndFraction

In Eq. (11.89), the following expressions are substituted as

(11.93)StartFraction partial-differential upper E Over partial-differential y left-parenthesis n right-parenthesis EndFraction equals minus 2 left-bracket d left-parenthesis n right-parenthesis minus y left-parenthesis n right-parenthesis right-bracket equals e left-parenthesis n right-parenthesis comma
(11.94)StartFraction partial-differential y left-parenthesis n right-parenthesis Over partial-differential g EndFraction equals sigma-summation Underscript k equals 1 Overscript normal upper K Endscripts a Subscript k Baseline dot StartFraction partial-differential y Subscript k Baseline left-parenthesis n minus upper D Subscript k Baseline right-parenthesis Over partial-differential g EndFraction comma

while the following expressions from Eq. (11.91) are given by

(11.95)StartFraction partial-differential y Subscript k Baseline left-parenthesis n minus upper D Subscript k Baseline right-parenthesis Over partial-differential s Subscript k Baseline left-parenthesis n minus upper D Subscript k Baseline right-parenthesis EndFraction equals StartFraction partial-differential s Subscript k Baseline left-parenthesis n minus upper D Subscript k Baseline minus upper M Subscript k Baseline right-parenthesis Over partial-differential s Subscript k Baseline left-parenthesis n minus upper D Subscript k Baseline right-parenthesis EndFraction minus m dot StartFraction partial-differential s Subscript k Baseline left-parenthesis n minus upper D Subscript k Baseline right-parenthesis Over partial-differential s Subscript k Baseline left-parenthesis n minus upper D Subscript k Baseline right-parenthesis EndFraction equals negative m comma
(11.96)StartFraction partial-differential s Subscript k Baseline left-parenthesis n minus upper D Subscript k Baseline right-parenthesis Over partial-differential x Subscript k Baseline left-parenthesis n minus upper D Subscript k Baseline right-parenthesis EndFraction equals StartFraction partial-differential x Subscript k Baseline left-parenthesis n minus upper D Subscript k Baseline right-parenthesis Over partial-differential x Subscript k Baseline left-parenthesis n minus upper D Subscript k Baseline right-parenthesis EndFraction plus m dot StartFraction partial-differential s Subscript k Baseline left-parenthesis n minus upper D Subscript k Baseline minus upper M Subscript k Baseline right-parenthesis Over partial-differential x Subscript k Baseline left-parenthesis n minus upper D Subscript k Baseline right-parenthesis EndFraction equals 1 period

From Eq. (11.92), the expression StartFraction partial-differential x Subscript k Baseline left-parenthesis n minus upper D Subscript k Baseline right-parenthesis Over partial-differential g EndFraction can be derived further and has to be expressed in the form of StartFraction partial-differential x 1 left-parenthesis dot right-parenthesis Over partial-differential g EndFraction, because x 1 left-parenthesis dot right-parenthesis is a function of g. With the help of Eq. (11.85) the derivation can be written as

(11.97)StartFraction partial-differential x Subscript k Baseline left-parenthesis n minus upper D Subscript k Baseline right-parenthesis Over partial-differential g EndFraction equals StartFraction partial-differential y Subscript k minus 1 Baseline left-parenthesis n minus upper D Subscript k Baseline minus upper D Subscript k minus 1 Baseline right-parenthesis Over partial-differential g EndFraction
(11.98)equals negative m dot StartFraction partial-differential x Subscript k minus 1 Baseline left-parenthesis n minus upper D Subscript k Baseline minus upper D Subscript k minus 1 Baseline right-parenthesis Over partial-differential g EndFraction
(11.99)equals negative m dot StartFraction partial-differential y Subscript k minus 2 Baseline left-parenthesis n minus upper D Subscript k Baseline minus upper D Subscript k minus 1 Baseline minus upper D Subscript k minus 2 Baseline right-parenthesis Over partial-differential g EndFraction

The expression in Eq. (11.100) can be extended and expressed in a more general form given by

(11.101)StartFraction partial-differential x Subscript k Baseline left-parenthesis n minus upper D Subscript k Baseline right-parenthesis Over partial-differential g EndFraction equals left-parenthesis negative m right-parenthesis Superscript p Baseline dot StartFraction partial-differential x Subscript k minus p Baseline left-parenthesis n minus upper D Subscript k Baseline minus midline-horizontal-ellipsis minus upper D Subscript k minus p Baseline right-parenthesis Over partial-differential g EndFraction period

With p equals k minus 1, the expression in the above equations can be expressed as a partial derivative function of x 1, differentiable with respect to g, and is given by

(11.102)StartFraction partial-differential x Subscript k Baseline left-parenthesis n minus upper D Subscript k Baseline right-parenthesis Over partial-differential g EndFraction equals left-parenthesis negative m right-parenthesis Superscript k minus 1 Baseline dot StartFraction partial-differential x 1 left-parenthesis n minus upper D Subscript k Baseline minus midline-horizontal-ellipsis minus upper D 1 right-parenthesis Over partial-differential g EndFraction

The second expression in Eq. (11.103) can be extended further and can be neglected for a large normal upper K because m equals 0.5 is usually a small fraction. Combining Eq. (11.92) and Eq. (11.103) will lead to the approximate expression given by

(11.104)StartFraction partial-differential upper E Over partial-differential g EndFraction equals minus sigma-summation Underscript n equals 1 Overscript upper N Endscripts sigma-summation Underscript k equals 1 Overscript normal upper K Endscripts e left-parenthesis n right-parenthesis dot a Subscript k Baseline dot m dot StartFraction partial-differential x Subscript k Baseline left-parenthesis n minus upper D Subscript k Baseline right-parenthesis Over partial-differential g EndFraction
(11.105)equals sigma-summation Underscript n equals 1 Overscript upper N Endscripts sigma-summation Underscript k equals 1 Overscript normal upper K Endscripts e left-parenthesis n right-parenthesis dot a Subscript k Baseline dot left-parenthesis negative m right-parenthesis Superscript k Baseline dot y Subscript normal upper K Baseline left-parenthesis n minus sigma-summation Underscript i equals 1 Overscript k Endscripts upper D Subscript i Baseline minus upper D Subscript normal upper K Baseline right-parenthesis period
Schematic illustration of feedback delay network.

Figure 11.12 Feedback delay network.

In addition to controlling g, the parameters a Subscript k of the mixing vectors can also be adapted and the expression StartFraction partial-differential upper E Over partial-differential a Subscript k Baseline EndFraction needs to be derived. This derivation is given by

(11.106)StartFraction partial-differential upper E Over partial-differential a Subscript k Baseline EndFraction equals sigma-summation Underscript n equals 1 Overscript upper N Endscripts StartFraction partial-differential upper E Over partial-differential y left-parenthesis n right-parenthesis EndFraction dot StartFraction partial-differential y left-parenthesis n right-parenthesis Over partial-differential a Subscript k Baseline EndFraction
(11.107)equals sigma-summation Underscript n equals 1 Overscript upper N Endscripts e left-parenthesis n right-parenthesis dot y Subscript k Baseline left-parenthesis n minus upper D Subscript k Baseline right-parenthesis period

Finally, the respective updates could be performed with gradient descent as

(11.108)g equals g minus eta Subscript g Baseline dot StartFraction partial-differential upper E Over partial-differential g EndFraction comma
(11.109)a Subscript k Baseline equals a Subscript k Baseline minus eta Subscript a Baseline dot StartFraction partial-differential upper E Over partial-differential a Subscript k Baseline EndFraction comma

where eta Subscript g and eta Subscript a are the respective learning rates and eta Subscript a is usually chosen as a very small fraction compared with eta Subscript g.

The update can occur for a finite number of iterations or it can stop based on a minimum error threshold. Instead of a desired RIR, the optimization can also be driven by an approximate desired energy decay curve (EDC) of an impulse response. A normalized EDC for an impulse response h left-parenthesis n right-parenthesis in a discrete case is defined as

(11.110)EDC left-parenthesis n right-parenthesis equals 10 dot log Subscript 10 Baseline left-parenthesis StartFraction sigma-summation Underscript k equals n Overscript upper N Endscripts h left-parenthesis k right-parenthesis squared Over sigma-summation Underscript k equals 1 Overscript upper N Endscripts h left-parenthesis k right-parenthesis squared EndFraction right-parenthesis period

The EDC is used for estimating the reverberation times such as upper T 60, and for a long RIR, it decays almost linearly before it subsides rapidly near the end. Hence, in the absence of a desired RIR, a normalized EDC can be constructed based on a desired reverberation time with the assumption of linearity, and this approximate EDC can be used as a ground truth. In this simulation, the input x left-parenthesis n right-parenthesis is an impulse and therefore the estimated output y left-parenthesis n right-parenthesis is an impulse response. The EDC of y left-parenthesis n right-parenthesis is then calculated and its deviation from the approximate ground truth is used as an error to drive the FDN parameter adaptation. The mixing coefficients a Subscript k are initialized randomly and scaled, while the initial value of the parameter g is selected as 0.3.

Figure 11.13 shows a simple example of a simulation based on a desired RIR as the ground truth. The first plot shows the normalized decay curves for the final estimated impulse response after 20 iterations, the desired or ground truth impulse response, and the initial impulse response. The second plot shows the final RIR, which is broadband in nature. The EDC indicates a upper T 60 reverberation time of approximately 0.67 seconds and the density of the impulse response depends on the number of branches in the network. In this simulation, the respective delays inside the allpass filters are purely prime numbers, in terms of number of samples, and they are selected as upper M 1 greater-than 200 and upper M 1 less-than upper M Subscript k Baseline less-than 1.5 dot upper M 1. If the value of upper M Subscript k is very high, a subset of prime values between upper M 1 and upper M Subscript k is selected to reduce the number of branches and accelerate convergence. However, reducing the number of branches leads to a reduction of impulse response density. The delay after an allpass filter is selected as upper D Subscript k Baseline less-than-or-equal-to StartFraction upper M Subscript k Baseline Over 10 EndFraction.

Schematic illustration of results, in terms of energy decay curves (EDCs) and the final estimated RIR of a room simulation by the FDN based on a desired RIR as ground truth.

Figure 11.13 Results, in terms of energy decay curves (EDCs) and the final estimated RIR of a room simulation by the FDN based on a desired RIR as ground truth.

Schematic illustration of results in terms of EDCs and the final estimated RIR of a room simulation by the FDN based on a desired reverberation time T60 as ground truth.

Figure 11.14 Results in terms of EDCs and the final estimated RIR of a room simulation by the FDN based on a desired reverberation time upper T 60 as ground truth.

Figure 11.14 shows an example of a simulation based on a desired upper T 60. The ground truth is created by a linear interpolation on the basis of the proposed upper T 60 as 1.5 s and is treated as an approximation of the expected EDC. It is however noteworthy that such an approximation of the ground truth is not entirely well posed, owing to its dissimilarity with real EDCs near the tail end. Through the iterations, the EDC of the estimated impulse response gradually drifts towards the approximated ground truth and can eventually improve further. The corresponding final impulse response after 10 iterations is also shown in the plot. The method can be extended to stereo by adding an additional mixing vector of coefficients. A pseudo‐random initialization of each mixing vector ensures that the individual impulse responses are decorrelated to avoid a narrow sound field while rendering an audio or they can be decorrelated during adaptation, if required.

Schematic illustration of the denoising model during training phase (top plot) and the testing phase (bottom plot).

Figure 11.15 Illustration of the denoising model during training phase (top plot) and the testing phase (bottom plot).

11.4.3 Audio Denoising

Noise reduction is one of the most widely performed low level applications in audio signal processing. One of the earliest approaches in noise reduction is the method of spectral subtraction [Bol79], which is quite fast and is suitable for reducing stationary noise. Another popular approach to noise reduction is Wiener filtering based on minimum mean squared error, where an estimation of the a priori signal‐to‐noise ratio (SNR) is required [EM84]. Other methods based on adaptive filtering [Tak03], wavelets [Soo97], adaptive time‐frequency block thresholding [Yu08], and statistical modeling [God01] are proposed for efficient audio restoration. Additionally, companding based noise reduction employed by Dolby and dynamic noise limiters introduced by Philips are some of the commercially used denoising systems. In recent years, deep learning methods have been successfully used in audio denoising and notable deep neural network architectures like CNN [Par17], wavenet [Ret18], and recurrent neural network (RNN) [Val18] have shown very promising results. In this section, a convolutional neural network model is described for an offline audio denoising application against additive Gaussian noise.

Denoising Model

Figure 11.15 illustrates the proposed denoising model. In the training phase, Gaussian noise, denoted by r left-parenthesis n right-parenthesis, is initially added to the clean audio signal sampled at 48 kHz and denoted by x left-parenthesis n right-parenthesis. It is ensured that the overall SNR in each audio file is 5 dB. The noisy audio is transformed with a short‐time Fourier transform (STFT) using a hamming window of 1024 samples, 75% overlap, and a frequency resolution of 46.875 Hz. After the transform, its magnitude response is calculated, and the coefficients are normalized between 0 and 1. With a unit stride, consecutive coefficient vectors of the magnitude response, denoted by StartAbsoluteValue upper X Subscript upper N Baseline left-parenthesis m comma k right-parenthesis EndAbsoluteValue and centered around the mth frame, are concatenated along the depth to produce a sequential input for the CNN. The model predicts the denoised coefficient vector of the magnitude response, denoted by StartAbsoluteValue upper Y left-parenthesis m comma k right-parenthesis EndAbsoluteValue, while the corresponding magnitude response from the clean audio, denoted by StartAbsoluteValue upper X left-parenthesis m comma k right-parenthesis EndAbsoluteValue, is used as the ground truth. The CNN is trained for several epochs with a combination of quadratic loss functions. In the testing phase, the aforementioned data pre‐processing is performed for the test audio and the trained model is used to estimate the denoised magnitude response. The phase response of the noisy audio is then combined with this magnitude response and the inverse STFT is performed to reconstruct the denoised audio signal.

Network Architecture

The CNN has a feedforward architecture with inception and attention modules, as illustrated in Fig. 11.16. The input structure contains five consecutive vectors of 513 coefficients each, centered on the mth frame index and it is sent to an inception module [Sze15] containing three parallel convolution layers. The filter groups in the layers have a spatial dimension of 3 times 1 but different dilation factors of 0, 1, and 2, and they produce output feature maps with depths of 32, 16, and 16, respectively. Dilation of a filter refers to the insertion of zeros in between filter coefficients. This ensures that there is an increment in its receptive field but not in the number of training parameters. Multiple dilation factors help in aggregating multiresolution features, which improves the network performance. The padding is adjusted according to the dilation factors and filter sizes so that the output feature maps have the same spatial dimensions. The features generated by the parallel convolutional layers are concatenated along depth to create an output of 64 feature maps, and it is rectified by the ReLU layer. These feature maps are sent to a residual inception‐attention block which features the aforementioned inception module containing an additional batch normalization operation after each convolution layer, and the inception module is followed by an attention module. The attention mechanism was introduced as an improvement in neural machine translation system [Bah15]. Although the mechanism is primarily used in RNNs or LSTMs for natural language processing [Che16], attention has emerged as an important module in computer vision [Xu15] and audio processing [Hao19].

Schematic illustration of the CNN architecture, where C0, C1, and C2 denote convolution layers with filter sizes of 3×1×din×32, 3×1×din×16, and 3×1×din×16 and dilation factors of 0, 1, and 2, respectively, C denotes a convolution layer with filter size of 3×1×64×1, R denotes a ReLU activation, CC denotes concatenation, BN denotes batch normalization, GAP denotes global average pooling, and NN denotes a neural network.

Figure 11.16 Illustration of the CNN architecture, where C0, C1, and C2 denote convolution layers with filter sizes of 3 times 1 times d Subscript i n Baseline times 32, 3 times 1 times d Subscript i n Baseline times 16, and 3 times 1 times d Subscript i n Baseline times 16 and dilation factors of 0, 1, and 2, respectively, C denotes a convolution layer with filter size of 3 times 1 times 64 times 1, R denotes a ReLU activation, CC denotes concatenation, BN denotes batch normalization, GAP denotes global average pooling, and NN denotes a neural network.

In the current CNN, a simple self‐attention mechanism is used. At the beginning of this module, a global average pooling is performed across each feature map of the incoming three‐dimensional input, and the operation is given by

(11.111)x Subscript j Baseline equals sigma-summation Underscript i equals 1 Overscript upper H Endscripts upper X Subscript i j Baseline comma

where upper X Subscript i j denotes the ith element in the jth feature map, x Subscript j denotes the corresponding scalar output, and upper H denotes the height of the feature map while its width is 1. Pooling on every feature map of the tensor results in an output vector which is processed by a simple neural network with 1 hidden layer and ReLU activation function. At the output of the neural network, a sigmoid function, given by Eq. (11.4), is used to limit the values between 0 and 1. As an alternative, a softmax function can also be used to get a normalized output vector. Finally, each coefficient of this vector acts as an independent multiplier to tune each feature map of the incoming three‐dimensional structure.

A series of inception–attention blocks are cascaded before a final convolution and ReLU layer which estimates a spectral mask, and the mask is multiplied by the input coefficient vector corresponding to the mth frame to suppress noise. The result is sent to the loss layer where the error is calculated with respect to the ground truth coefficient vector.

Model Evaluation

To train the model, a combination of losses is used. The first loss is the mean squared error between the magnitude coefficients given by

(11.112)upper L Subscript mse Baseline left-parenthesis m right-parenthesis equals StartFraction 1 Over upper K EndFraction sigma-summation Underscript k Endscripts left-parenthesis StartAbsoluteValue upper X left-parenthesis m comma k right-parenthesis EndAbsoluteValue minus StartAbsoluteValue upper Y left-parenthesis m comma k right-parenthesis EndAbsoluteValue right-parenthesis squared comma

where StartAbsoluteValue upper X left-parenthesis m comma k right-parenthesis EndAbsoluteValue and StartAbsoluteValue upper Y left-parenthesis m comma k right-parenthesis EndAbsoluteValue denote the magnitude response vectors of the desired and estimated output signal, respectively, in dB, k denotes a frequency bin index, m denotes the frame index, and upper K denotes the number of bins. The second loss function is the absolute error between the absolute values of the transformed magnitude response coefficients corresponding to the estimation and ground truth. This loss function is given by

(11.113)upper L Subscript dft Baseline left-parenthesis m right-parenthesis equals StartFraction 1 Over upper K EndFraction sigma-summation Underscript k Endscripts StartAbsoluteValue StartAbsoluteValue DFT left-parenthesis StartAbsoluteValue upper X left-parenthesis m comma k right-parenthesis EndAbsoluteValue right-parenthesis EndAbsoluteValue minus StartAbsoluteValue DFT left-parenthesis StartAbsoluteValue upper Y left-parenthesis m comma k right-parenthesis EndAbsoluteValue right-parenthesis EndAbsoluteValue EndAbsoluteValue comma

where StartAbsoluteValue dot EndAbsoluteValue denotes the absolute value. This loss function shows an improved noise suppression capability, particularly in unvoiced or silent regions to reduce unpleasant artifacts.

To train the denoising model, a dataset is built with audio files containing both high fidelity speech and music sampled at 48 kHz. Short speech files are collected from the PTDB‐TUG [Pir11] dataset, and music files of longer duration are collected from multiple classical music datasets including the Bach10 [Dua10] and Mirex‐Su [Su16] datasets. Two networks are trained with speech and music files, respectively. To train the first network, 200 speech files are selected and some examples from the remaining speech files are selected for testing. The second network is trained with 20 music files while the remaining music files are selected for testing. In this example, a CNN model with eight inception–attention blocks is used for speech denoising and a model with twelve inception–attention blocks for denoising music signals, based on experiments. The models are trained for 70 epochs with adaptive momentum update. The performance of the models can be evaluated by different objective metrics. Improvement in SNR is an indicator of denoising performance, while perceptual evaluation of audio quality (PEAQ) [Thi00] and speech quality (PESQ) [ITU01] are important methods to evaluate audio and speech quality. The PEAQ method provides an objective difference grade (ODG) which rates an audio between negative 4 and 0. An ODG score of 0 indicates imperceptible differences compared with the reference and a score of negative 4 indicates an audio with annoying artifacts. Similarly, PESQ provides a mean opinion score (MOS) that ranges from 1, which indicates a poor speech quality, to 5, which indicates an excellent speech quality. The test dataset contains 54 audio files, out of which the first 50 files contain short speech and the remaining files contain music of relatively longer duration. Table 11.1 illustrates the performance of the CNNs for speech and music signals. The average SNR of the denoised speech signals produced by the CNN indicates an improvement of 13.8 dB. Similarly, the average ODG score of the speech after noise suppression is improved by 0.62. To perform the wideband PESQ evaluation, the speech signals are resampled to 16 kHz. The average MOS of the denoised signal is nearly improved by 1. The average SNR of the denoised music signals produced by the CNN shows an improvement of approximately 11 dB. However, the average ODG score of the music signals after noise suppression does not show an improvement similar to that of speech. This indicates that music is relatively more difficult to denoise compared to speech because of a low number of pauses and the presence of more high‐frequency content which gets suppressed along with noise and affects the perceptual quality. Figure 11.17 illustrates an example of denoising on a female speech signal corrupted with noise. The SNR of the denoised speech is 20.9 dB, its ODG score is negative 2.91 compared with a score of negative 3.91 for the noisy speech, and its MOS score improves from 1.28 to 2.44. Figure 11.18 illustrates an example of denoising on a musical piece from the Mirex‐Su dataset corrupted with noise. The overall improvement in SNR is approximately 10.8 dB, while the ODG score of the denoised signal is negative 3.45 compared with a score of negative 3.9 for the noisy signal. Figure 11.19 shows the corresponding spectral representations of the noisy, denoised, and original signal.

Table 11.1 Average performance of the denoising CNNs on noisy speech and music

AudioSNR (dB)ODGMOS
Noisy Speech5negative 3.911.48
Denoised Speech18.8negative 3.292.49
Noisy Music5negative 3.91minus
Denoised Music16.3negative 3.64minus
Schematic illustration of example of noise suppression in an audio file from the PTDB-TUG speech dataset.

Figure 11.17 Example of noise suppression in an audio file from the PTDB‐TUG speech dataset.

Schematic illustration of example of noise suppression in an audio file from the Mirex-Su dataset.

Figure 11.18 Example of noise suppression in an audio file from the Mirex‐Su dataset.

Schematic illustration of spectrogram of the example noisy, denoised, and original audio file from the Mirex-Su dataset.

Figure 11.19 Spectrogram of the example noisy, denoised, and original audio file from the Mirex‐Su dataset.

The applications described in this chapter are a few of the many audio processing applications which can be categorized as supervised regression problems. Similar applications with deep learning include audio enhancement in terms of super‐resolution [SD21], supervised audio source separation [LM19], and neural source modeling [Wan20]. Supervised audio classification problems, however, categorize audio signals or inherent features within audio signals into distinct classes which can be used for further processing. Classification problems are also extensively researched and some of the notable applications include pitch detection [SZZ16], music genre classification [Ora17], and speaker recognition [Sny18], among others.

11.5 Exercises

  1. Write a script for a simple feedforward neural network with an input layer of dimension 1, an output layer of dimension 1, a hidden layer of dimension 5, and an activation and a loss function.
    1. Train the network with stochastic gradient descent for 50 epochs to fit the function y equals m a x left-parenthesis sine left-parenthesis x right-parenthesis comma negative 0.2 right-parenthesis comma negative 1 less-than-or-equal-to x less-than-or-equal-to 1, monitor the error, and test the network.
    2. Perform the training with a higher number of epochs.
    3. Perform the training with multiple hidden layers.
  2. Replace a fully connected layer with a convolution layer and train the network for the previous example.
  3. Perform the above exercise and the examples from the chapter using a deep learning toolbox in a framework of your choice (Matlab, Pytorch, Tensorflow).

References

  1. [Aka12] M. Akamine and J. Ajmera: A decision‐tree‐based algorithm for speech/music classification and segmentation. EURASIP Journal on Audio, Speech, and Music Processing, 10, Feb 2012.
  2. [Alg01] V. R. Algazi, R. O. Duda, D. M. Thompson, and C. Avendano: The CIPIC HRTF database. Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics, pages 99–102, Oct 2001.
  3. [Bac91] A.D. Back and A. C. Tsoi: FIR and IIR synapses, a new neural network architecture for time series modeling. Neural Computation, 3(3):375–385, 1991.
  4. [Bah15] D. Bahdanau, K. Cho, and Y. Bengio: Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, May 2015.
  5. [Bha20] P. Bhattacharya, P. Nowak, and U. Zölzer: Optimization of cascaded parametric peak and shelving filters with backpropagation algorithm. Digital Audio Effects 2020 (DaFX), 2020.
  6. [Bol79] S. F. Boll: Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, Signal Processing, ASSP‐27(2):113–120, Apr 1979.
  7. [Chi06] J‐T. Chien and B‐C. Chen: A new independent component analysis for speech recognition and separation. IEEE Transactions on Audio, Speech, and Language Processing, 14(4):1245–1254, 2006.
  8. [Che16] J. Cheng, L. Dong, and M. Lapata: Long short‐term memory networks for machine reading. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 551–561, Nov 2016.
  9. [Car17] H. Caracalla and A. Roebel: Gradient conversion between time and frequency domains using Wirtinger calculus. In Digital Audio Effects 2017 (DaFX), Edinburgh, United Kingdom, Sep 2017.
  10. [Cam96] P. Campolucci, A. Uncini, and F. Piazza: Fast adaptive IIR‐MLP neural networks for signal processing application. Acoustics, Speech, and Signal Processing, IEEE International Conference on, 6:3529–3532, Jun 1996.
  11. [Dau15] Y. N. Dauphin, H. de Vries, J. Chung, and Y. Bengio: Rmsprop and equilibrated adaptive learning rates for non‐convex optimization. CoRR, abs/1502.04390, 2015.
  12. [Duc11] J. Ducji, E. Hazan, and Y. Singer: Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, Jul 2011.
  13. [Dua10] Z. Duan, B. Pardo, and C. Zhang: Multiple fundamental frequency estimation by modeling spectral peaks and non‐peak regions. IEEE Trans. Audio Speech Language Process., 18(8):2121–2133, 2010.
  14. [EM84] Y. Ephra–m and D. Malah: Speech enhancement using a minimum mean‐square error short‐time spectral amplitude estimator. IEEE Transactions on Acoustics, Speech, Signal Processing, ASSP‐ 32(6):1109–1121, Dec 1984.
  15. [Eng20] J. Engel, L. Hantrakul, C. Gu, and A. Roberts: DDSP: Differentiable Digital Signal Processing. International Conference on Learning Representation, pages 1–19, 2020.
  16. [Fev09] C. Fevotte, N. Bertin, and J‐L. Durrieu: Nonnegative matrix factorization with the itakura‐saito divergence: With application to music analysis. Neural Computation, 21(3):793–830, 2009.
  17. [Glo10] X. Glorot and Y. Bengio: Understanding the difficulty of training deep feedforward neural networks. Journal of Machine Learning Research, 9:249–256, 2010.
  18. [GL03] G. Guodong and S.Z. Li: Content‐based audio classification and retrieval by support vector machines. IEEE Transactions on Neural Networks, 14(1):209–215, 2003.
  19. [Gao92] F.X.Y. Gao and W.M. Snelgrove: An adaptive backpropagation cascade IIR filter. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 39(9):606–610, Sep 1992.
  20. [God01] S.J. Godsill, P.J. Wolfe, and W.N.W. Fong: Statistical model‐based approaches to audio restoration and analysis. Journal of New Music Research, 30(4):323–338, 2001.
  21. [Han17] Y. Han, J. Kim, and K. Lee: Deep convolutional neural networks for predominant instrument recognition in polyphonic music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(1):208–221, 2017.
  22. [Hao19] X. Hao, C. Shan, Y. Xu, S. Sun, and L. Xie: An attention‐based neural network approach for single channel speech enhancement. In ICASSP 2019 ‐ 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6895–6899, 2019.
  23. [He15] K. He, X. Zhang, S. Ren, and J. Sun: Delving deep into rectifiers: Surpassing human‐level performance on imagenet classification, 2015.
  24. [ITU01]ITU Perceptual evaluation of speech quality (PESQ): An objective method for end‐to‐end speech quality assessment of narrow‐band telephone networks and speech codecs, 2001.
  25. [Kin15] D.P. Kingma and J.Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR, May 2015.
  26. [Lav09] Y. Lavner and M.E.P. Davies: A decision‐tree‐based algorithm for speech/music classification and segmentation. EURASIP Journal on Audio, Speech, and Music Processing, 239892, Jun 2009.
  27. [Ler12] A. Lerch: An introduction to audio content analysis: applications in signal processing and music informatics. John Wiley & Sons, Ltd., 2012.
  28. [LL01] L. Lu, S.Z. Li, and H‐J. Zhang: Content‐based audio segmentation using support vector machines. In IEEE International Conference on Multimedia and Expo, pages 749–752, 2001.
  29. [LM19] Y. Luo and N. Mesgarani. Conv‐tasnet: Surpassing ideal timefrequency magnitude masking for speech separation. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 27(8):1256–1266, Aug. 2019.
  30. [Lu16] Y‐C. Lu, C‐W. Wu, A. Lerch, and C‐T. Lu: Automatic outlier detection in music genre datasets. In Proceedings of the 17th International Society for Music Information Retrieval Conference, ISMIR, pages 101–107, Aug 2016.
  31. [Mül15] M. Müller: Fundamentals of Music Processing: Audio, Analysis, Algorithms, Applications. Springer Publishing Company, Incorporated, 2015.
  32. [Mac67] J. MacQueen: Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Math, Statistics, and Probability, pages 281–297, 1967.
  33. [Mar20] M.A.R. Martinez, E. Benetos, and J.D. Reiss: Deep learning for black‐box modeling of audio effects. Applied Sciences, 10(2), 2020.
  34. [Mee18] H. S. Meet, N. Shah, and A. H. Patil: Time‐frequency masking‐based speech enhancement using generative adversarial network. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5039–5043, 2018.
  35. [Min19] G. Min, C. Zhang, X. Zhang, and W. Tan: Deep vocoder: Low bit rate compression of speech with deep autoencoder. In 2019 IEEE International Conference on Multimedia Expo Workshops (ICMEW), pages 372–377, 2019.
  36. [Ner20] S. Nercessian: Neural parametric equalizer matching using differentiable biquads. Digital Audio Effects 2020 (DaFX), 2020.
  37. [Nes83] Y. Nesterov: A method for unconstrained convex minimization problem with the rate of convergence O(). Doklady ANSSSR, pages 543–547, 1983.
  38. [Ora17] S. Oramas, O. Nieto, F. Barbieri, and X. Serra: Multi‐label music genre classification from audio, text, and images using deep features. CoRR, abs/1707.04916, 2017.
  39. [Poh06] T. Pohle, P. Knees, M. Schedl, and G. Widmer: Independent component analysis for music similarity computation. In ISMIR 2006, 7th International Conference on Music Information Retrieval, pages 228–233, Oct 2006.
  40. [Par17] Se Rim Park and Jin Won Lee: A fully convolutional neural network for speech enhancement. In Proc. Interspeech 2017, pages 1993–1997, 2017.
  41. [Pir11] G. Pirker, M. Wohlmayr, S. Petrik, and F. Pernkopf: A pitch tracking corpus with evaluation on multipitch tracking scenario. In Interspeech 2011, pages 1509–1512, 2011.
  42. [Ros61] F. Rosenblatt: Principles of Neurodynamics. Perceptrons and the Theory of Brain Mechanisms. Defense Technical Information Center, 1961.
  43. [Ros99] R.E. Rose: Training a recursive filter by use of derivative function, US patent US5905659A, May 1999.
  44. [Ret18] D. Rethage, J. Pons, and X. Serra: A wavenet for speech denoising. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5069–5073, 2018.
  45. [SD21] S. Sulun and M.E.P. Davies: On filter generalization for music bandwidth extension using deep neural networks. IEEE Journal of Selected Topics in Signal Processing, 15(1):132–142, 2021.
  46. [Sny17] D. Snyder, D. Garcia‐Romero, D. Povey, and S. Khudanpur: Deep neural network embeddings for text‐independent speaker verification. In Proc. Interspeech 2017, pages 999–1003, 2017.
  47. [Sny18] D. Snyder, D. Garcia‐Romero, G. Sell, D. Povey, and S. Khudanpur: X‐vectors: Robust dnn embeddings for speaker recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5329–5333, 2018.
  48. [Soo97] I. Y. Soon, S. N. Koh, and C. K Yeo: Wavelet for speech denoising. In Proceedings of IEEE TENCON –97, IEEE Region 10 Annual Conference of Speech and Image Technologies for Computing and Telecommunications, volume 2, pages 479–482, 1997.
  49. [Sze15] C. Szegedy, L. Wei, J. Yangqing, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich: Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
  50. [Su16] L. Su and Y‐H. Yang: Escaping from the abyss of manual annotation: New methodology of building polyphonic datasets for automatic music transcription. In Music, Mind, and Embodiment, pages 309–321. Springer International Publishing, 2016.
  51. [SZZ16] H. Su, H. Zhang, X. Zhang, and G. Gao. Convolutional neural network for robust pitch determination. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 579–583, 2016.
  52. [Thi00] T. Thiede, W. C. Treurniet, R. Bitto, C. Schmidmer, T. Sporer, J. G. Beerends, and C. Colomes: PEAQ ‐ the ITU standard for objective measurement of perceived audio quality. Journal of the audio engineering society, 48(1/2):3–29, Feb 2000.
  53. [Tak03] H. Takeshi, M. Takahiro, I. Yoshihisa, and H. Tetsuya: Musical noise reduction using an adaptive filter. Acoustical Society of America Journal, 114(4), Oct 2003.
  54. [Val18] J‐M. Valin: A hybrid dsp/deep learning approach to real‐time full‐band speech enhancement. In 2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP), pages 1–5, 2018.
  55. [Väl19] V. Välimäki and J. Rämö: Neurally controlled graphic equalizer. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(12), Dec 2019.
  56. [Vir07] T. Virtanen: Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Transactions on Audio, Speech, and Language Processing, 15(3):1066–1074, 2007.
  57. [Wan20] X. Wang, S. Takaki, and J. Yamagishi: Neural source‐filter waveform models for statistical parametric speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:402–415, 2020.
  58. [Xu15] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio: Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 2048–2057, Jul 2015.
  59. [Yu08] G. Yu, S. Mallat, and E. Bacry: Audio denoising by time‐frequency block thresholding. IEEE Transactions on Signal Processing, 56(5):1830–1839, 2008.
  60. [Zha16] J. Zhang, J. Tang, and Li‐R. Dai: RNN‐BLSTM based multi‐pitch estimation. In Interspeech 2016, pages 1785–1789, 2016.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.5.49