Data pre-processing

So to build such a model which can generate lyrics we will need a huge amount of lyrics data, which can easily be extracted from various sources. We collected some 10k songs lyrics and stored in a text file called lyrics_data.txt. You can find the data file in the GitHub repository.

Now that we have our data available, we need to convert this raw text into the one hot encoding:

import numpy as np
import codecs

# Class to perform all preprocessing operations
class Preprocessing:
vocabulary = {}
binary_vocabulary = {}
char_lookup = {}
size = 0
separator = '->'
# This will take the data file and convert data into one hot encoding and dump the vocab into the file.
def generate(self, input_file_path):
input_file = codecs.open(input_file_path, 'r', 'utf_8')
index = 0
for line in input_file:
for char in line:
if char not in self.vocabulary:
self.vocabulary[char] = index
self.char_lookup[index] = char
index += 1
input_file.close()
self.set_vocabulary_size()
self.create_binary_representation()

# This method is to load the vocab into the memory
def retrieve(self, input_file_path):
input_file = codecs.open(input_file_path, 'r', 'utf_8')
buffer = ""
for line in input_file:
try:
separator_position = len(buffer) + line.index(self.separator)
buffer += line
key = buffer[:separator_position]
value = buffer[separator_position + len(self.separator):]
value = np.fromstring(value, sep=',')

self.binary_vocabulary[key] = value
self.vocabulary[key] = np.where(value == 1)[0][0]
self.char_lookup[np.where(value == 1)[0][0]] = key

buffer = ""
except ValueError:
buffer += line
input_file.close()
self.set_vocabulary_size()

# Below are some helper functions to perform pre-processing.
def create_binary_representation(self):
for key, value in self.vocabulary.iteritems():
binary = np.zeros(self.size)
binary[value] = 1
self.binary_vocabulary[key] = binary

def set_vocabulary_size(self):
self.size = len(self.vocabulary)
print "Vocabulary size: {}".format(self.size)

def get_serialized_binary_representation(self):
string = ""
np.set_printoptions(threshold='nan')
for key, value in self.binary_vocabulary.iteritems():
array_as_string = np.array2string(value, separator=',', max_line_width=self.size * self.size)
string += "{}{}{} ".format(key.encode('utf-8'), self.separator, array_as_string[1:len(array_as_string) - 1])
return string

So the overall objective of the pre-processing module is to convert the raw text data into the one-hot encoding as shown in the following figure. After the successful execution of the pre-processing module, a binary file will be dumped as {dataset_filename}.vocab. This vocab file is one of the mandatory files which needs to be fed into the model during the training process along with the dataset:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.93.12