Generating music with dilated ConvNets, WaveNet, and NSynth

WaveNet is a deep generative model for producing raw audio waveforms. This breakthrough technology has been introduced (https://deepmind.com/blog/wavenet-generative-model-raw-audio/) by Google DeepMind(https://deepmind.com/) for teaching how to speak to computers. The results are truly impressive and online you can find examples of synthetic voices where the computer learns how to talk with the voice of celebrities, such as Matt Damon.

So, you might wonder why learning to synthesize audio is so difficult. Well, each digital sound we hear is based on 16,000 samples per second (sometimes 48,000 or more) and building a predictive model where we learn to reproduce a sample based on all the previous ones is a very difficult challenge. Nevertheless, there are experiments showing that WaveNet has improved the current state-of-the-art Text-to-Speech (TTS) systems, reducing the difference with human voices by 50 percent for both US English and Mandarin Chinese.

What is even cooler is that DeepMind proved that WaveNet can be also used to teach to computers how to generate the sound of musical instruments, such as piano music.

Now for some definitions. TTS systems are typically divided into two different classes:

Concatenative TTS, where single speech voice fragments are first memorized and then recombined when the voice has to be reproduced. However, this approach does not scale because it is possible to reproduce only the memorized voice fragments and it is not possible to reproduce new speakers or different types of audio without memorizing the fragments from the beginning.
Parametric TTS, where a model is created for storing all the characteristic features of the audio to be synthesized. Before WaveNet, the audio generated with parametric TTS was less natural than concatenative TTS. WaveNet improved the state of the art by modeling directly the production of audio sounds, instead of using intermediate signal processing algorithms used in the past.

In principle, WaveNet can be seen as a stack of 1D convolutional layers (we have seen 2D convolution for images in Chapter 4) with a constant stride of one and with no pooling layers. Note that the input and the output have by construction the same dimension, so the ConvNets are well suited to model sequential data such as audio sounds. However, it has been shown that in order to reach a large size for the receptive field in the output neuron it is necessary to either use a massive number of large filters or prohibitively increase the depth of the network. Remember that the receptive field of a neuron in a layer is the cross-section of the previous layer from which neurons provide inputs. For this reason, pure ConvNets are not so effective in learning how to synthesize audio.

The key intuition beyond WaveNet is the so-called dilated causal convolutions (sometimes known as atrous convolutions), which simply means that some input values are skipped when the filter of a convolutional layer is applied. Atrous is the bastardization of the French expression à trous, meaning with holes. So an AtrousConvolution is a convolution with holes As an example, in one dimension a filter w of size 3 with dilatation 1 would compute the following sum.

So in short, in a D-dilated convolution, usually the stride is 1, but nothing prevents you from using other strides. An example is given in the following figure with increased dilatation (hole) sizes = 0, 1, 2:

An example of dilated network

Thanks to this simple idea of introducing 'holes', it is possible to stack multiple dilated convolutional layers with exponentially increasing filters and learn long range input dependencies without having an excessively deep network.

A WaveNet is, therefore, a ConvNet where the convolutional layers have various dilation factors allowing the receptive field to grow exponentially with depth, and therefore efficiently cover thousands of audio timesteps.

When we train, the inputs are sounds recorded from human speakers. The waveforms are quantized to a fixed integer range. A WaveNet defines an initial convolutional layer accessing only the current and previous input. Then, there is a stack of dilated ConvNet layers, still accessing only current and previous inputs. At the end, there is a series of Dense layers which combines previous results, followed by a softmax activation function for categorical outputs.

At each step, a value is predicted from the network and fed back into the input. At the same time, a new prediction for the next step is computed. The loss function is the cross-entropy between the output for the current step and the input at the next step.

NSynth (https://magenta.tensorflow.org/nsynth) is an evolution of WaveNet recently released by the Google Brain group, which instead of being causal, aims at seeing the entire context of the input chunk. The neural network is truly, complex as depicted in the following image, but for the sake of this introductory discussion it is sufficient to know that the network learns how to reproduce its input by using an approach based on reducing the error during the encoding/decoding phases:

An example of NSynth architecture as seen in https://magenta.tensorflow.org/nsynth

Table of Contents for Generating music with dilated ConvNets, WaveNet, and NSynth

Create new playlist

Sign In

Sign Up

Table of Contents for
Generating music with dilated ConvNets, WaveNet, and NSynth