Data pre-processing

So to build such a model which can generate lyrics we will need a huge amount of lyrics data, which can easily be extracted from various sources. We collected some 10k songs lyrics and stored in a text file called lyrics_data.txt. You can find the data file in the GitHub repository.

Now that we have our data available, we need to convert this raw text into the one hot encoding:

import numpy as np
import codecs

# Class to perform all preprocessing operations
class Preprocessing:
    vocabulary = {}
    binary_vocabulary = {}
    char_lookup = {}
    size = 0
    separator = '->'
# This will take the data file and convert data into one hot encoding and dump the vocab into the file.
    def generate(self, input_file_path):
        input_file = codecs.open(input_file_path, 'r', 'utf_8')
        index = 0
        for line in input_file:
            for char in line:
                if char not in self.vocabulary:
                    self.vocabulary[char] = index
                    self.char_lookup[index] = char
                    index += 1
        input_file.close()
        self.set_vocabulary_size()
        self.create_binary_representation()

# This method is to load the vocab into the memory
    def retrieve(self, input_file_path):
        input_file = codecs.open(input_file_path, 'r', 'utf_8')
        buffer = ""
        for line in input_file:
            try:
                separator_position = len(buffer) + line.index(self.separator)
                buffer += line
                key = buffer[:separator_position]
                value = buffer[separator_position + len(self.separator):]
                value = np.fromstring(value, sep=',')

                self.binary_vocabulary[key] = value
                self.vocabulary[key] = np.where(value == 1)[0][0]
                self.char_lookup[np.where(value == 1)[0][0]] = key

                buffer = ""
            except ValueError:
                buffer += line
        input_file.close()
        self.set_vocabulary_size()

# Below are some helper functions to perform pre-processing.
    def create_binary_representation(self):
        for key, value in self.vocabulary.iteritems():
            binary = np.zeros(self.size)
            binary[value] = 1
            self.binary_vocabulary[key] = binary

    def set_vocabulary_size(self):
        self.size = len(self.vocabulary)
        print "Vocabulary size: {}".format(self.size)

    def get_serialized_binary_representation(self):
        string = ""
        np.set_printoptions(threshold='nan')
        for key, value in self.binary_vocabulary.iteritems():
            array_as_string = np.array2string(value, separator=',', max_line_width=self.size * self.size)
            string += "{}{}{}
".format(key.encode('utf-8'), self.separator, array_as_string[1:len(array_as_string) - 1])
        return string

So the overall objective of the pre-processing module is to convert the raw text data into the one-hot encoding as shown in the following figure. After the successful execution of the pre-processing module, a binary file will be dumped as {dataset_filename}.vocab. This vocab file is one of the mandatory files which needs to be fed into the model during the training process along with the dataset:

Table of Contents for Data&#xA0;pre-processing

Create new playlist

Sign In

Sign Up

Table of Contents for
Data pre-processing