Building an audio recognition model using siamese networks

In the last tutorial, we saw how to use siamese networks to recognize a face. Now we will see how to use siamese networks to recognize audio. We will train our network to differentiate between the sound of a dog and the sound of a cat. The dataset of cat and dog audio can be downloaded from here: https://www.kaggle.com/mmoreaux/audio-cats-and-dogs#cats_dogs.zip.

Once we have downloaded the data, we fragment our data into three folders: Dogs, Sub_dogs, and Cats. In Dogs and Sub_dogs, we place the dog's barking audio and in the Cats folder, we place the cat's audio. The objective of our network is to recognize whether the audio is a dog's barking or some different sound. As we know, for a siamese network, we need to feed input as a pair; we select an audio from the Dogs and Sub_dogs folders and mark them as a genuine pair and we select an audio from the Dogs and Cats folders and mark them as an imposite pair. That is, (dogs, subdogs) is a genuine pair and (dogs, cats) is an imposite pair.

Now, we will show, step-by-step, how to train our siamese network to recognize whether the audio is the dog's barking sound or a different sound.

For better understanding, you can check the complete code, which is available as a Jupyter Notebook with an explanation here: https://github.com/sudharsan13296/Hands-On-Meta-Learning-With-Python/blob/master/02.%20Face%20and%20Audio%20Recognition%20using%20Siamese%20Networks/2.5%20Audio%20Recognition%20using%20Siamese%20Network.ipynb.

First, we will load all of the necessary libraries:

#basic imports
import glob
import IPython
from random import randint

#data processing
import librosa
import numpy as np

#modelling
from sklearn.model_selection import train_test_split

from keras import backend as K
from keras.layers import Activation
from keras.layers import Input, Lambda, Dense, Dropout, Flatten
from keras.models import Model
from keras.optimizers import RMSprop

Before going ahead, we load and listen to the audio clips:

IPython.display.Audio("data/audio/Dogs/dog_barking_0.wav")

IPython.display.Audio("data/audio/Cats/cat_13.wav")

So, how can we feed this raw audio to our network? How can we extract meaningful features from the raw audio? As we know, neural networks accept only vectorized input, so we need to convert our audio into a feature vector. How can we do that? Well, there are several mechanisms through which we can generate embeddings for audio. One such popular mechanism is Mel-Frequency Cepstral Coefficients (MFCC). MFCC converts the short-term power spectrum of an audio using a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. To learn more about MFCC, check out this nice tutorial: http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/.

We will use the MFCC function from the librosa library for generating the audio embeddings. So, we define a function called audio2vector, which returns the audio embeddings given an audio file:

def audio2vector(file_path, max_pad_len=400):
    
    #read the audio file
    audio, sr = librosa.load(file_path, mono=True)

    #reduce the shape
    audio = audio[::3]
    
    #extract the audio embeddings using MFCC
    mfcc = librosa.feature.mfcc(audio, sr=sr) 
    
    #as the audio embeddings length varies for different audio, we keep the maximum length as 400
    #pad them with zeros

    pad_width = max_pad_len - mfcc.shape[1]
    mfcc = np.pad(mfcc, pad_width=((0, 0), (0, pad_width)), mode='constant')

    return mfcc

We will load one audio file and see the embeddings:

audio_file = 'data/audio/Dogs/dog_barking_0.wav'
audio2vector(audio_file)
array([[-297.54905127, -288.37618855, -314.92037769, ...,    0.        ,
           0.        ,    0.        ],
       [  23.05969394,    9.55913148,   37.2173831 , ...,    0.        ,
           0.        ,    0.        ],
       [-122.06299523, -115.02627567, -108.18703056, ...,    0.        ,
           0.        ,    0.        ],
       ...,
       [  -6.40930836,   -2.8602708 ,   -2.12551478, ...,    0.        ,
           0.        ,    0.        ],
       [   0.70572914,    4.21777791,    4.62429301, ...,    0.        ,
           0.        ,    0.        ],
       [  -6.08997702,  -11.40687886,  -18.2415214 , ...,    0.        ,
           0.        ,    0.        ]])

Now that we have understood how to generate audio embeddings, we need to create the data for our siamese network. As we know, a siamese network accepts the data in a pair, so we define the function for getting our data. We will create the genuine pair as (Dogs, Sub_dogs) and assign the label as 1 and the imposite pair as (Dogs, Cats) and assign the label as 0:

def get_data():
    
    pairs = []
    labels = []
    
    Dogs = glob.glob('data/audio/Dogs/*.wav')
    Sub_dogs = glob.glob('data/audio/Sub_dogs/*.wav')
    Cats = glob.glob('data/audio/Cats/*.wav')
    
    
    np.random.shuffle(Sub_dogs)
    np.random.shuffle(Cats)
    
    for i in range(min(len(Cats),len(Sub_dogs))):
        #imposite pair
        if (i % 2) == 0:
            pairs.append([audio2vector(Dogs[randint(0,3)]),audio2vector(Cats[i])])
            labels.append(0)
            
        #genuine pair
        else:
            pairs.append([audio2vector(Dogs[randint(0,3)]),audio2vector(Sub_dogs[i])])
            labels.append(1)
            
            
    return np.array(pairs), np.array(labels)

X, Y = get_data("/home/sudarshan/sudarshan/Experiments/oneshot-audio/data/")

Next, we split our data for training and testing with 75% training and 25% testing proportions:

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

Now that we have successfully generated our data, we build our siamese network. We define our base network, which is used for feature extraction, and we use three dense layers with a dropout layer in between:

def build_base_network(input_shape):
    input = Input(shape=input_shape)
    x = Flatten()(input)
    x = Dense(128, activation='relu')(x)
    x = Dropout(0.1)(x)
    x = Dense(128, activation='relu')(x)
    x = Dropout(0.1)(x)
    x = Dense(128, activation='relu')(x)
    return Model(input, x)

Next, we feed the audio pair to the base network, which will return the features:

input_dim = X_train.shape[2:]
audio_a = Input(shape=input_dim)
audio_b = Input(shape=input_dim)

base_network = build_base_network(input_dim)
feat_vecs_a = base_network(audio_a)
feat_vecs_b = base_network(audio_b)

feat_vecs_a and feat_vecs_b are the feature vectors of our audio pair. Next, we feed these feature vectors to the energy function to compute a distance between them, and we use Euclidean distance as our energy function:

def euclidean_distance(vects):
    x, y = vects
    return K.sqrt(K.sum(K.square(x - y), axis=1, keepdims=True))


def eucl_dist_output_shape(shapes):
    shape1, shape2 = shapes
    return (shape1[0], 1)

distance = Lambda(euclidean_distance, output_shape=eucl_dist_output_shape)([feat_vecs_a, feat_vecs_b])

Next, we set the epoch length to 13 and we use the RMS prop for optimization:

epochs = 13
rms = RMSprop()

model = Model(input=[audio_a, audio_b], output=distance)

Lastly, we define our loss function as contrastive_loss and compile the model:

def contrastive_loss(y_true, y_pred):
    margin = 1
    return K.mean(y_true * K.square(y_pred) + (1 - y_true) * K.square(K.maximum(margin - y_pred, 0)))

model.compile(loss=contrastive_loss, optimizer=rms)

Now, we train our model:

audio1 = X_train[:, 0]
audio2 = X_train[:, 1]

model.fit([audio_1, audio_2], y_train, validation_split=.25,
          batch_size=128, verbose=2, nb_epoch=epochs)

You can how the loss over epochs:

Train on 8 samples, validate on 3 samples
Epoch 1/13
 - 0s - loss: 23594.8965 - val_loss: 1598.8439
Epoch 2/13
 - 0s - loss: 62360.9570 - val_loss: 816.7302
Epoch 3/13
 - 0s - loss: 17967.6230 - val_loss: 970.0378
Epoch 4/13
 - 0s - loss: 20030.3711 - val_loss: 358.9078
Epoch 5/13
 - 0s - loss: 11196.0547 - val_loss: 339.9991
Epoch 6/13
 - 0s - loss: 3837.2898 - val_loss: 381.9774
Epoch 7/13
 - 0s - loss: 2037.2965 - val_loss: 303.6652
Epoch 8/13
 - 0s - loss: 1434.4321 - val_loss: 229.1388
Epoch 9/13
 - 0s - loss: 2553.0562 - val_loss: 215.1207
Epoch 10/13
 - 0s - loss: 1046.6870 - val_loss: 197.1127
Epoch 11/13
 - 0s - loss: 569.4632 - val_loss: 183.8586
Epoch 12/13
 - 0s - loss: 759.0131 - val_loss: 162.3362
Epoch 13/13
 - 0s - loss: 819.8594 - val_loss: 120.3017

Table of Contents for Building an audio recognition model using siamese networks

Create new playlist

Sign In

Sign Up

Table of Contents for
Building an audio recognition model using siamese networks