Chapter 5. Working with data and creating models in IBM PowerAI

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Working with data and creating models in IBM PowerAI

This chapter illustrates practical examples of neural networks, how to implement them, and the advantages of using IBM PowerAI. It describes why you must understand the context of your project and learn that model definition is just one piece of the overall project.

This chapter describes understanding a project in detail, from a business perspective to the technical levels. This context enables the technical team to verify and confirm the project feasibility before any initial decisions are implemented.

The chapter then describes data preparation. After the project goals and direction are finalized, this stage is the next step. It is an initial critical project phase that if incorrectly done directly impacts the model's accuracy. This chapter highlights many aspects and describes how a different model can change the method that used for data preparation.

The final sections of the chapter illustrate example coded neural network models and the process of data preparation and how that process provides the foundation to grow and advance the model.

This chapter contains the following topics:

•Knowing your requirements and data

•Why is it so important to prepare your data

•Sentiment analysis by using TensorFlow on IBM PowerAI

•Word suggestions by using long and short term memory on TensorFlow

5.1 Knowing your requirements and data

When a department in a company decides to implement a cognitive solution by using machine learning (ML) capabilities, it is a common oversight to not think about the data they have: what kind of data, age of that data, governance, and sensitivity. Also, it is tempting for the technical team to start deciding about algorithms without taking time to appreciate what the data is.

Although it is possible to implement a cognitive system without understanding your data, the result might not be the most accurate or efficient. The first critical task is to understand and agree on the actual business needs and requirements. Employee design thinking sessions can assist with this part of the process. All parties that are involved in the project can highlight and discuss requirements, priorities, concerns, and expectations.

With the business requirements and expectations discussed and agreed upon, the next phase focuses on the data itself. There must be a clear understanding across the involved teams to appreciate the data that is used. People must have agreed expectations of what information the data provides.

This data analysis phase should be done during the discovery and viability of the project and not after committing to the project. Making decisions and dependencies based on assumptions of the data can be an inhibitor to a successful project.

An example of a situation where data can be a barrier for the project is in a case where the system must be able to predict whether an operation is fraudulent and expectations are made that historic data exists. However, when the data is checked, the team finds that there are no connections between the historic data, making it impossible to identify whether a specific transaction is fraudulent. The data that they have is simple field, such as amount, name, and address. There are no fraud tables registering old fraud or indications that characterize fraud.

If there is such a situation, it does not mean that the project can never happen, but it might require more phases or even a project on the existing systems to generate more detailed data and start registering fraud data in the expected format. It might be possible after some data generation to use that data to feed the ML algorithm.

After analysis of the data concludes that the source meets requirements and can be used to fulfill expectations, the next step is the ML algorithm definition. This chapter does not go into the details of which kind of ML algorithm is best for each case or which neural network architecture fits better for such cases. Those topics are described in 1.2, “Neural networks overview” on page 5.

Regarding neural networks, there is typically an architecture that is focused on specific areas. Testing with more than one option can help your decision-making process.

5.2 Why is it so important to prepare your data

After the technical team has a clear objective about what to predict or classify (verified data, data in one place, and the neural network architecture (or architectures) to use), it is time to prepare your data to feed your neural network.

To prepare the data, think of two phases: the first phase is where all the transformation and processing is applied to the data set to extract and load it into an accessible environment or proper database. This process is a classical extract, transform, and load (ETL) process that is required to isolate the data with which you want to work.

The second phase is about making the transformed data ready to be imported by the neural network. This process usually requires considerable processing, and it is normal to have some data expansion because you must represent your data in different formats. It is beyond the scope of this section to describe specific details, so this phase’s required approach depends on the format of data (images, sound, text, or numbers) and the architecture of the selected neural network.

A neural network is more efficient when it uses numbers as input because it is a mathematical model. In this context, it is necessary to convert your data into numbers while keeping it all organized, structured, and in the format that the neural network architecture expects, which is usually a vector array.

For example, if your data was an image, you might convert it to grayscale, then extract its matrix and convert it into an array. This process must be done on all images so that each row of your data represents an image. In addition, it is necessary to create the labels so that the neural network can learn from the data. These labels must be converted into numbers and stored as a hot vector so that they can be fed into the neural network.

This approach is a high-level path of several options to handle the data and get it ready. The following sections show a couple of scenarios to prepare the data and create a neural network model with the data.

5.3 Sentiment analysis by using TensorFlow on IBM PowerAI

This section illustrates the feed-forward neural network (FFNN) model that identifies whether a sentence is negative or positive. This sample uses TensorFlow on IBM PowerAI to create and train the model.

The model was created by using TensorFlow.

5.3.1 Example data set

The data set that is used is from a Stanford project called Sentiment140. This is also where the data set can be downloaded from, and it contains 1.6 million rows with positive and negative sentences.

Note: At the time of writing, the project website mentioned that there can be negative, positive, and neutral sentences. After we analyzed the data set, we found only positive and negative sentences, so we used only those two options to create and train our model.

All the information that is related to the data set format and the code that was created was valid at the time of writing. If you attempt to work through the same example, you must make changes if the data set layout is later updated.

The data set contains two CSV files with six fields each, one for training data and a smaller one for testing. Both files are structured according to the following format (taken from the project website):

0 The polarity of the tweet (0 = negative, 2 = neutral, 4 = positive).

1 The ID of the tweet (2087).

2 The date of the tweet (Sat May 16 23:58:44 Coordinated Universal Time 2009).

3 The query (lyx). If there is no query, then this value is NO_QUERY.

4 The user that tweeted (robotickilldozr).

5 The text of the tweet (Lyx is cool).

All the sentences in the data set are extracted from Twitter, and their sentiment has not been identified by a human who later annotated the file with the correct label. A code was used to identify sentences with any positive emoticons, and they are assumed to be positive and negative emoticons with negative sentiment. You can use your own sentence data set and label them as needed. For more information about the data set and the project, see the Sentiment140 Project.

5.3.2 How the code is structured

The code is divided in to three files: One is about preparing the data, the second is about creating the FFNN model, and the last file is using the created and trained model. For more information, see Figure 5-1.

Figure 5-1 How the code is structured

Considering that you are dealing with 1.6 million rows of data, it is important to use persistence to ensure that all data is not kept in memory, and in case of failure that there is no need to start the process from the beginning.

The following sections describe each one of the steps and show some snippets of code.

5.3.3 Data preparation

This phase ensures that the data is transformed and at the end of the flow is in the proper format that is expected by the FFNN.

Figure 5-2 shows the data preparation phase as a general view to describe the code details.

Figure 5-2 Data preparation flow

Cleaning the data set

The first thing that you must do is to remove the unnecessary columns, such as the Twitter ID, date, and others, and keep only the sentiment and the Twitter text. To do this task, use the clean_dataset function, which is shown in Example 5-1.

Example 5-1 Functions that are responsible for cleaning the original data set

def shuffler(input_ds, output_ds):

df_source = pd.read_csv(input_ds, '<SP>', error_bad_lines=False)

df_shuffled = df_source.iloc[np.random.permutation(len(df_source))]

df_shuffled.to_csv(output_ds, 'µ', index=False)

def clean_dataset(ds, ods):

with open(ds, 'r', 30000, 'latin-1') as raw_ds:

with open('tempds.csv', 'w', 20000) as cleaned_ds:

for line in raw_ds:

result = re.search('^"(d)",.*,"(.*)"$', line)

new_line = result.group(1) + '<SP>' + result.group(2) + ' '

cleaned_ds.write(new_line)

shuffler('tempds.csv', ods)

os.remove('tempds.csv')

print("data set cleanup done”)

The function expects a file as input and generates a CSV file that is separated by a <SP> string to avoid character conflicts. Even though the data set is a CSV file, there are commas in the Twitter text field, so just using the split function does not work, so you must use a regular expression to get what was needed.

The clean_dataset function goes through every line of the data set and applies the regular expression by writing the new line into a temporary file. The last step is to shuffle the generated data set to ensure that the neural network is trained properly. For this example, we use the Pandas library and then write to the output file that is provided for the ods parameter.

This approach is used to avoid too much data being kept in memory, especially for large data sets. A buffer of 30000 bytes is defined and the file is read in chunks, processed, and written to the target file. For the shuffle process, the Pandas library is designed to handle large files.

Generating the dictionary of words

With the cleaned data set generated, the next step is to create the dictionary of words (known as a lexicon). In this example, we used is the create_word_dict function, as shown in Example 5-2.

Example 5-2 Code that is used to create the dictionary of words

def create_word_dict(source_ds):

word_dict = []

with open(source_ds, 'r', 30000, 'latin-1') as ds:

for line in ds:

text = line.split('µ')[1]

words = word_tokenize(text.lower())

lemm_words = [lemm.lemmatize(w) for w in words]

word_dict += list(lemm_words)

word_count = Counter(word_dict)

cleaned_word_dict =

[word for word in word_count if 1000 > word_count[word] > 60]

dict_size = len(cleaned_word_dict)

print("Word dictionary size: {}".format(dict_size))

with open('word_dict.pickle', 'wb') as wd:

pickle.dump(cleaned_word_dict, wd)

print("Word dictionary generated and saved")

return dict_size

In this phase, we go sentence by sentence, extracting all words from it by using the word_tokenize function from the NLTK library, and then passing each one of the words through either stemming or lemmatization, which removes the words that are not needed and then generates a new list with one occurrence for each word. This is necessary to generate the sentence vector for each one of the sentences in the next step of the flow.

Steaming and lemmatization: Both processes focus on word standardization and transforming those words into a common word base. Our example neural network expects numeric values as input, and each different word increases the size of the vector that is generated by one. For example, if we have the words focus, focuses, focused, and focusing, all of them have the same general meaning and there is no relevance about the way each is being used.

So, each of the processes (steaming and lemmatization) achieves this task in a different way. Stemming usually works by chopping off the end of the words to reach the same default word. Lemmatization uses a more robust set of rules to parse a word by using vocabulary and morphological analysis of that word to generate a new default word, which is usually the radical of that word.

Which processor to use depends on your needs and your processing power. Stemming is less processor-intensive, but generates less accurate results. In contrast, lemmatization requires more processing power, but provides more accurate results.

For more information, see Stanford’s Stemming and Lemmatization.

The create_word_dict function starts by reading the CSV file that is generated in 5.3.3, “Data preparation” on page 125 and split the lines to get only the sentences. The sentences go through the word_tokenizer function from the NLTK library, and generate a list of words to which the lemmatize function is applied.

With all the words in their standard form, count their occurrences in the data set to remove the ones that occur too much (usually some words such as a, which have no useful meaning in a sentiment analysis) and the ones with small occurrences, which means that they make no difference in the model. In this example, we use the Counter function from the collection library, which generates a dictionary with each word as a key and its respective count as a value. Then, we save a list with one occurrence of each word as our dictionary of words into the word_dict.pickle file as a list.

Preparing the sentences

The last phase of the process is to read each sentence and transform it into a numeric vector. In this process, each sentence becomes a vector of zeros with the same size as the dictionary of words. This vector contains the number of each word in the dictionary of words. This number is in the same index position as the word is in the dictionary of words vector.

For example, assume that the data set contains two sentences:

•I want to drink water with you.

•They like to drink coffee with me at home.

From the data set, we create a dictionary of words (do not apply any kind of process or exclude words for simplicity). The result can be the following list, assuming that you start from the first sentence:

[I, want, to, drink, water, with, you, they, like, coffee, me, at, home]

Next, transform the first sentence into a hot vector. Create a list (or vector) with the same size as the word dictionary, in this case a list of size 13 filled with zeros:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Then, iterate through the sentence word by word and check whether it exists in the dictionary of words. If so, add 1 to the new sentence vector in the same index position it is in the dictionary list.

The letter I exists in the dictionary and is on index 0, so add 1 to the sentence vector in position 0:

[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Continue repeating the process until you reach the end of the sentence and have then processed the second sentence. By the end of the process, you have the following two vectors:

•Sentence 1: [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]

•Sentence 2: [0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1]

The code reads the sentence training data set and the word dictionary file as input, takes the sentences and passes them through the word tokenization function, lemmatizes the words, and starts creating the vector. The code also gets sentiment labels and transforms them into a list of size two because there are positive and negative sentiments. The sentence vector and sentiment vector are put into a list and saved into a binary file to be read by the model in the following step.

Example 5-3 shows the sentence_to_vector function.

Example 5-3 Code that is used to transform sentences into vectors

def sentence_to_vector(word_dict_file, cleaned_ds, output_file):

with open(cleaned_ds, 'r', 30000, 'latin-1') as ds:

with open(word_dict_file, 'rb') as wd:

word_dict = pickle.load(wd)

num_lines = 0

with open(output_file, 'wb') as hv:

for line in ds:

hot_vector = np.zeros(len(word_dict))

if line.count('µ') == 1:

sentiment, text = line.split('µ')

words = word_tokenize(text.lower())

lemm_words = [lemm.lemmatize(w) for w in words]

for word in lemm_words:

if word in word_dict:

hot_vector[word_dict.index(word)] += 1

hot_vector = list(hot_vector)

clean_sentiment = re.search('.*(d).*', sentiment)

if int(clean_sentiment.group(1)) == 0:

sentiment = [1, 0]

else:

sentiment = [0, 1]

num_lines += 1

pickle.dump([hot_vector, sentiment], hv)

print('Hot vectors file generated with {}

lines'.format(num_lines))

return num_lines

Some considerations

After finishing the training data set process, the test data set goes through the same process so that it has the same format and can be used to check the model’s accuracy. As shown in Example 5-3 on page 128, the test data set must also use the same dictionary of words that is generated by the training data set.

Regarding the data set size, this scenario uses Python and some libraries to do the data handling job and some strategies, such as using files to manage a larger data set in a more efficient way. If you must scale and handle larger data sets, think about using a distributed framework over a distributed architecture to work on all the data to get it ready for a neural network. You can implement a distributed solution by using IBM Power Systems servers with IBM Spectrum Scale to deliver a reliable and scalable environment. This scenario does not go into these details because they are beyond the scope of this book. However, this book describes how TensorFlow can take advantage of distributed processing by using several graphical processing units (GPUs).

5.3.4 Model creation

Now, we must build the example TensorFlow-based model. In the data preparation phase, we saved some information about the dictionary of word’s list size and how many lines there are in the vector files that were generated. That information is going to be used in this phase. Example 5-4 shows the function that is responsible for creating the neural network architecture.

Example 5-4 Creating a neural network architecture

def ff_neural_net(input_data):

neurons_hl1 = 1500

neurons_hl2 = 1500

neurons_hl3 = 1500

output_neurons = 2

l1_weight = tf.Variable(tf.random_normal([line_sizes['dict'], neurons_hl1]), name='w1')

l1_bias = tf.Variable(tf.random_normal([neurons_hl1]), name='b1')

l2_weight = tf.Variable(tf.random_normal([neurons_hl1, neurons_hl2]), name='w2')

l2_bias = tf.Variable(tf.random_normal([neurons_hl2]), name='b2')

l3_weight = tf.Variable(tf.random_normal([neurons_hl2, neurons_hl3]), name='w3')

l3_bias = tf.Variable(tf.random_normal([neurons_hl3]), name='b3')

output_weight = tf.Variable(tf.random_normal([neurons_hl3, output_neurons]), name='wo')

output_bias = tf.Variable(tf.random_normal([output_neurons]), name='bo')

l1 = tf.add(tf.matmul(input_data, l1_weight), l1_bias)

l1 = tf.nn.relu(l1)

l2 = tf.add(tf.matmul(l1, l2_weight), l2_bias)

l2 = tf.nn.relu(l2)

l3 = tf.add(tf.matmul(l2, l3_weight), l3_bias)

l3 = tf.nn.relu(l3)

output = tf.matmul(l3, output_weight) + output_bias

return output

This section does not provide details about TensorFlow because this framework was introduced in “TensorFlow” on page 19.

The ff_neural_net function receives the data as input (a TensorFlow placeholder) and then start defining some characteristics of the neural network. Define the number of neurons and layers for the architecture.

When talking about how many hidden layers and how many neurons each layer must have, the answer is never an exact one, and most of the time it is defined empirically through testing with different options. This case uses an FFNN with two hidden layers. The input layer and the hidden layers have 1500 neurons each, and an output layer with two neurons because there are two values to classify (positive and negative).

Figure 5-3 shows the neural network for this scenario.

Figure 5-3 Architecture of the feed forward neural network being used

Define the weights and biases for each one of the layers by creating TensorFlow variables and specifying their sizes. Each variable is a matrix, and the shape of the first matrix is the size of the sentence vector by how many neurons the first layer has. The second layer has a matrix with the shape of the first layer size by the second layer size, the same for the third layer and the output layer matrix has the shape of the third layer size by the output layer size. The column size of the matrix in the first layer must be the same as the row size of the matrix in the second layer because the neural network performs several matrix multiplication operations.

This is all that you must do to create the feed-forward architecture. TensorFlow knows to adjust the variables that are created in its graph through the training iterations.

The function starts by calling the neural network model and creates a saver object so that the graph and its variables can be saved later.

Example 5-5 shows the function that performs all the training of the model.

Example 5-5 Code that is used to optimize and train the model

def training(in_placeholder):

nn_output = ff_neural_net(in_placeholder)

saver = tf.train.Saver()

# We are using cross entropy to calculate the cost

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=nn_output, labels=y))

optimizer = tf.train.AdamOptimizer(learning_rate=0.001).minimize(cost)

with tf.Session() as sess:

sess.run(tf.global_variables_initializer())

for epoch in range(num_epochs):

epoch_loss = 0

buffer_train = []

buffer_label = []

with open('train_hot_vectors.pickle', 'rb') as train_hot_vec:

for i in range(line_sizes['train']):

hot_vector_line = pickle.load(train_hot_vec)

buffer_train.append(hot_vector_line[0])

buffer_label.append(hot_vector_line[1])

if len(buffer_train) >= batch_size:

_, cost_iter = sess.run([optimizer, cost],

feed_dict={in_placeholder: buffer_train, y: buffer_label})

epoch_loss += cost_iter

buffer_train = []

buffer_label = []

print('Epoch {} completed. Total loss: {}'

.format(epoch+1, epoch_loss))

correct = tf.equal(tf.argmax(nn_output, 1), tf.argmax(y, 1))

accuracy = tf.reduce_mean(tf.cast(correct, 'float'))

with open('test_hot_vectors.pickle', 'rb') as train_hot_vec:

buffer_test = []

buffer_test_label = []

for i in range(line_sizes['test']):

test_hot_vector_line = pickle.load(train_hot_vec)

buffer_test.append(test_hot_vector_line[0])

buffer_test_label.append(test_hot_vector_line[1])

# the accuracy is the percentage of hits

print('Accuracy using test data set: {}'

.format(accuracy.eval({in_placeholder: buffer_test, y: buffer_test_label})))

saver.save(sess, "model.ckpt")

Calculate the cost or error that is generated between the model output and the correct labels. During the calculation of the cost (error), an optimizer that is called Adam is applied, which is a stochastic gradient descent (SGD) approach that is used to reduce the cost by updating all the weights and biases in the neural network by using a back-propagation algorithm.

So far, this process just builds the TensorFlow objects and links them with each other. Now, the process creates a TensorFlow session where the graph is run and imports data. Before the session.run function is called, the process reads the data from the vectorized sentences, and every time the temporary buffer that is defined reaches the batch size, which is also defined by us, the code calls TensorFlow and runs the optimizer, which then runs the whole model.

We also have a buffer list for the label because we must provide the sentence with its respective label so that the network is trained correctly. We also run the cost function to know the total cost that is generated for that specific epoch.

The training phase happens several times, which are the number of times that is defined in the epochs variable. The code is using 10, although you can change it and check whether you achieve better results. Basically, the training data set is submitted in batches to the TensorFlow model, and the error is calculated and then adjusted by the Adam optimizer. The process repeats for epoch times.

After the training loop is done, add to the graph the correct tensor, where it uses the tf.argmax function, which is the index position with the maximum value from the network output and the correct label. The tf.equal function is 1 if both are the same or 0 if they do not match. Using the correct tensor, the accuracy is calculated as the mean of the correct tensor.

Now, we read the test data set file that is vectorized through the same batch process, this time without the epochs. We submit the testing data to be evaluated against our model by using the eval function and passing the data set and its label to the placeholders. When completed, the model is saved by the saver.save function.

5.3.5 Using the model

With the trained model saved and ready to use, a function that is called get_sentiment is created, which receives a sentence as input and prints whether it is positive or negative (Example 5-6).

Example 5-6 Function that is used to use the model

def get_sentiment(input_data):

tf.reset_default_graph()

pl = tf.placeholder('float')

nn_output = ff_neural_net(pl)

saver = tf.train.Saver()

with open('word_dict.pickle', 'rb') as f:

word_dict = pickle.load(f)

with tf.Session() as sess:

saver.restore(sess, "model.ckpt")

words = word_tokenize(input_data.lower())

lemm_words = [lemm.lemmatize(w) for w in words]

hot_vector = np.zeros(len(word_dict))

for word in lemm_words:

if word.lower() in word_dict:

index_value = word_dict.index(word.lower())

hot_vector[index_value] += 1

hot_vector = np.array(list(hot_vector))

result = (sess.run(tf.argmax(nn_output.eval(

feed_dict={pl: [hot_vector]}), 1)))

if result[0] == 0:

print('Negative:', input_data)

elif result[0] == 1:

print('Positive:', input_data)

Start the function by defining the placeholder, which receives our sentence, loads our model, and creates a saver object to load our trained model later. We also open the dictionary of words file.

We then create a TensorFlow session and restore the saved model. Restoring is basically reading the values from the variables that are saved in the model.ckpt file and assigning them to the variables that are created when running the model with the ff_neural_net function.

Using sess.run, we run our model by passing hot_vector and then by using the tf.argmax function, and we get the neural network answer. We know that [1, 0] is positive and [0, 1] is negative, so if argmax returns 0, meaning that the index position 0 is the biggest and it is a positive sentence; if it is 1, then it is a negative sentence. We do an IF to make the return human-readable.

5.3.6 Running the code

To run the code by using IBM PowerAI, source the framework that you want to use, which loads the framework environment with the libraries that are required. To do so, go to /opt/DL, where the frameworks installation is, and open the one that you want to use. In our case, we want to use the tensorflow/bin. From the directory (/opt/DL/tensorflow/bin), run source tensorflow-activate.

To train your model, you must run the training function and pass a TensorFlow placeholder to it. To use your model, you must run the get_sentiment function and provide the sentence that you want to check.

While you train your model, you can look at how your GPUs are being used by running the nvidia-smi command on your terminal window.

If you want to do some testing or want to make changes to the model, you can experiment with a smaller data set. For example, we created a function that is called smaller_dataset_gen that can be used to randomly extract rows out of the large data set and create a smaller one. All that you need to do is run the function and provide the original data set, how many rows it has (1600000 rows if you are using the Sentiment140 one), the name of the target data set, and how many rows you want it to have.

The complete code is available at GitHub. For more information, see Appendix A, “Sentiment analysis code” on page 241.

5.4 Word suggestions by using long and short term memory on TensorFlow

To make a word suggestion to speed up a user typing process or even generate text that is based on some other existing text, it important to have different applications where cognitive systems must be the rule and not the exception. This section describes a word suggestion implementation by using a long and short term memory (LSTM) recurrent neural network (RNN) so that you can understand how it works with TensorFlow and take advantage of IBM PowerAI.

This is a straightforward implementation focusing on clarity so that you can understand the details and build from it. Think of this implementation as a head start.

Our example is all about training our model into a data set of phrases; when using it, we can start typing words to create a phrase and receive predictive suggestions from our model on what the next word can be so we do not have to type it all.

To do so, we use an RNN because the order or sequence the words are in matter to make a good suggestion. We implement a specific RNN called LSTM because a simple RNN cannot keep data from old iterations; they can remember only recent data, LSTM remember what is important and forget what is not. For more information, see 1.2.3, “Types of neural networks and their usage” on page 8.

5.4.1 Our data set

We are using a simple and small data set with about 100 created questions that simulate users asking how to do something within their company system, such as how to reset their password, how to get access to the internet, and where to find some information.

The data is already in the correct format, but in a production environment you must obtain this data from databases and apply transformations to convert them to the required format.

5.4.2 Overall structure of the code

The overall structure of the code is about the same as the one that is used by the sentiment analysis in 5.3, “Sentiment analysis by using TensorFlow on IBM PowerAI” on page 123, so the same diagram is used (Figure 5-1 on page 124). The diagram represents how our code is structured.

There is a data preparation phase to create our dictionaries and put the phrases into the correct Python objects so that they are ready to be used by the neural network. Then, in the second phase we define our LSTM neural network model, train it, and save it so that we can get to the third phase where the model can be used.

5.4.3 Data preparation

In this phase, we create our dictionary of words so that we know each different word that occurs in our data set. We are not excluding any words because every word is important for the system.

In this phase, we transform each phrase from the data set into a list of words and save it into a pickle file so that we can process even larger data sets without having to keep all of them in memory.

Example 5-7 gives you a sample of the code.

Example 5-7 The create_word_list function

def create_word_list(source_ds):

words_list = []

num_lines = 0

count = 0

with open(source_ds, 'r', 20000, 'latin-1') as ds:

with open('ds_to_list.pickle', 'wb', 10000) as ds_list:

for line in ds:

words = word_tokenize(line.lower())

words_list += list(words)

pickle.dump(words, ds_list)

num_lines += 1

word_count = set(words_list)

word_list_final = {}

for i in word_count:

word_list_final[i] = count

count += 1

rev_word_dict = dict(zip(word_list_final.values(),

word_list_final.keys()))

list_size = len(word_list_final)

print("Word dictionary size: {}".format(list_size))

with open('word_dict.pickle', 'wb') as wd:

pickle.dump(word_list_final, wd)

with open('rev_word_dict.pickle', 'wb') as wd:

pickle.dump(rev_word_dict, wd)

print("Word dictionary generated and saved")

return list_size, num_lines

The create_word_list function receives the data set and goes through it line by line. For each line, the functions transforms it into a list of words by using the word_tokenize function from the NLTK library, saves it into a file by using pickle, and adds it to a new list that is called words_list, which is then stored uniquely by using a set.

We create a dictionary and give each word a number from 0 - word_count size. Those numbers are used to feed the neural network. We must create a reverse word dictionary to make things easier to map when we have the word's number and must go back to the word. Basically a new dictionary is created by using the first dictionary’s values as keys and its keys as values.

After this process, we write both dictionaries to files so they can be used in different modules. The word dictionary size and the original data set size are returned by the function so that they can be used for loop control in different modules.

5.4.4 Model creation

After the dictionary of words is created with its respective number for each word, and the data set is in the format of one list per phrase, with each item of the list being one word of the phrase, we create our model by using the LSTM neural network. The lstm_rnn function is where this definition takes place, as shown in Example 5-8.

Example 5-8 The lstm_rnn function

def lstm_rnn(tf_placeholder):

output_weight = tf.Variable(tf.random_normal([num_neurons,

files_details['list']]))

output_biases = tf.Variable(tf.random_normal([files_details['list']]))

tf_placeholder = tf.reshape(tf_placeholder, [-1, interval_size])

tf_placeholder = tf.split(tf_placeholder, interval_size, 1)

rnn_layers = rnn.MultiRNNCell([rnn.BasicLSTMCell(num_neurons),

rnn.BasicLSTMCell(num_neurons)])

outputs, states = rnn.static_rnn(rnn_layers, tf_placeholder, dtype=tf.float32)

return tf.matmul(outputs[-1], output_weight) + output_biases

Complete the following steps:

1. Define the weights and biases for the output layer of the neural network. The weights are a matrix with all its weights randomly initialized, and the biases are an array that also is randomly initialized. They both are TensorFlow variables, so the framework knows that they must be updated and adjusted during the training phase.

2. Create a TensorFlow placeholder to receive the data set (in number format) and reshape it to conform to the format that is expected by the rnn.static_rnn function.

3. TensorFlow sets up the LSTM neural network by using ready to use functions.

4. Use the rnn.MultiRNNCell function to create an RNN with more than one hidden layer. This function receives the neurons for each layer, which in this case are LSTM neurons. The BasicLSTMCell function with the number of neurons must be used. The static_rnn function binds the whole model together with the placeholder to create the neural network. TensorFlow helps define a complete structure of the neural network.

5. Perform the output layer operations, which are multiplying the second hidden layer output by the output weights and sum of the biases. The last position of the outputs variable is used because we are interested in the last output from the RNN. An RNN generates one output after each iteration, so it generates an array as output.

5.4.5 Training

The training phase is similar to the phase that is applied to the sentiment analysis in 5.3, “Sentiment analysis by using TensorFlow on IBM PowerAI” on page 123. The difference is how and what is provided to the neural network. Example 5-9 goes through the code and shows its details.

Example 5-9 The training function

def training(in_placeholder):

rnn_output = lstm_rnn(in_placeholder)

saver = tf.train.Saver()

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(

logits=rnn_output, labels=y))

optimizer = tf.train.RMSPropOptimizer(learning_rate=0.001).minimize(cost)

correct = tf.equal(tf.argmax(rnn_output, 1), tf.argmax(y, 1))

accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()

with tf.Session() as session:

session.run(init)

step = 0

start = 0

end = interval_size

acc_total = 0

loss_total = 0

counter = 0

with open('word_dict.pickle', 'rb') as wd:

word_dict = pickle.load(wd)

while step < epochs:

with open('ds_to_list.pickle', 'rb', 20000) as ds:

for _ in range(files_details['data set']):

line_list = pickle.load(ds)

line_size = len(line_list)

while end < line_size:

sequence = [[word_dict[i]] for i in

line_list[start:end]]

sequence = np.array(sequence)

sequence = np.reshape(sequence,

[-1, interval_size, 1])

label = line_list[end]

label_hot_vector = np.zeros([files_details['list']])

label_hot_vector[word_dict[label]] = 1.0

label_hot_vector = np.reshape(label_hot_vector,

[1, -1])

start += 1

end += 1

_, acc, loss, rnn_predictions = session.run(

[optimizer, accuracy, cost, rnn_output],

feed_dict={in_placeholder: sequence,

y: label_hot_vector})

counter += 1

acc_total += acc

loss_total += loss

print('{}. Loss: {:.4f} and Accuracy: {:.2f}%'.format

(step+1, loss_total / counter,

(100 * acc_total) / counter))

acc_total = 0

loss_total = 0

counter = 0

start = 0

end = interval_size

step += 1

saver.save(session, 'model.ckpt')

print('Training completed')

1. The training function begins by concluding the TensorFlow graph creation. It first calculates the error, which is the difference of the correct label and what the model outputs. The error is calculated by using the softmax_cross_entropy_with_logits function. This generated error is then optimized by the RMSPropOptimizer function. RMSProp is a gradient descent optimization algorithm that is used to reduce the error and adjust the neural network’s weights and biases. You can exchange this algorithm for any other to try it out. In our case, using RMSProp helped us reach high accuracy rates.

2. Still building the TensorFlow graph, we define the correct variable that receives an array (in this case, size 1) with value 1 if the network output is the same as the one that is provided by the label (y placeholder) or 0 if they are different. As with the previous step, the mean is calculated and assigned to the accuracy variable to be used to calculate the general accuracy later during the graph execution phase.

3. We then create a TensorFlow session to start the training process. The dictionary of words file is opened and we start looping through the epochs. The epoch is how many times we want to iterate through all our data set to train our model. There are different ways to define that, from establishing a specific number of times to checking the accuracy and run until the accuracy reaches a minimal variation rate.

We are using a specific number of epochs, 60000, which can reach more than 90% accuracy. Of course, increasing the number of epochs and providing a better data set can improve the accuracy even more.

Within the epochs loop, we start going through our data set, which is already in a list format, and for each phrase the code picks the first three words as the training data, converts them to the word's respective number, and reshapes it to conform to the neural network input. We also provide the label, which is the fourth word.

4. For the label, we transform the word into a hot vector with the same size as the dictionary of words. We use a hot vector to check which position has the biggest value, which is the strongest suggestion, and the rest of the successive suggestions.

The process to create the hot vector is straightforward. After we have the array filled with zeros, it is necessary to add only one to the index position referring to the respective number of the word that is found in the dictionary of words (word_dict).

5. With the training data and label prepared, they are fed into the neural network for training. After all the epochs interactions are processed, the trained model is saved by using the save method from tf.trainer.Saver.

5.4.6 Using the model

To use the trained network model, create the get_word function, as shown in Example 5-10. This function is responsible for providing a CLI experience that prompts the user to type in at least three words, and based on the input calls the neural network model, gets its return and presents the three best options of words so the user can auto-complete the phrase being written.

Example 5-10 The get_word function

def get_word():

with open('word_dict.pickle', 'rb') as wd:

word_dict = pickle.load(wd)

with open('rev_word_dict.pickle', 'rb') as rwd:

rev_word_dict = pickle.load(rwd)

in_placeholder = tf.placeholder('float', [None, interval_size, 1])

rnn_output = lstm_rnn(in_placeholder)

saver = tf.train.Saver()

with tf.Session() as sess:

sess.run(tf.global_variables_initializer())

saver.restore(sess, 'model.ckpt')

phrase_check = 'invalid'

while phrase_check == 'invalid':

prompt = 'Type at least {} words: '.format(interval_size)

phrase = input(prompt)

phrase = phrase.strip()

words = word_tokenize(phrase.lower())

if len(words) < interval_size:

print('We need at least {} words!'.format(interval_size))

else:

words = words[-interval_size:]

phrase_check = 'valid'

next_word = '-'

word_dict_size = len(word_dict)

phrase_to_num = []

for i in words:

if i in word_dict.keys():

phrase_to_num.append(word_dict[i])

else:

phrase_to_num.append(word_dict_size)

word_dict_size += 1

while next_word != '4':

phrase_reshape = np.reshape(np.array(phrase_to_num),

[-1, interval_size, 1])

rnn_predictions = sess.run(rnn_output,

feed_dict={in_placeholder:

phrase_reshape})

answers = get_top(3, list(rnn_predictions[0]))

print('Suggestions are:')

num = 1

for j in answers:

print('{}: {}'.format(num, rev_word_dict[j]))

num += 1

print('{}: Finish phrase '.format(num))

prompt = 'Select the number or type a word: '

next_word = input(prompt)

if next_word in ['1', '2', '3']:

phrase = phrase + ' ' +

rev_word_dict[answers[int(next_word)-1]]

print('Your phrase so far: {}'.format(phrase))

phrase_to_num = phrase_to_num[1:]

phrase_to_num.append(answers[int(next_word)-1])

elif next_word == '4':

print('Final phrase: {}'.format(phrase))

else:

phrase = phrase + ' ' + next_word

print('Your phrase so far: {}'.format(phrase))

phrase_to_num = phrase_to_num[1:]

if next_word in word_dict.keys():

phrase_to_num.append(word_dict[next_word])

else:

phrase_to_num.append(word_dict_size)

word_dict_size += 1

The function begins by loading the dictionary of words files into the model, starting a TensorFlow session and restoring the trained model (weights and biases).

Considering that most of the code here is related to the user experience and rules validating the typed information, this section does not go through the details, although it is important to point out how the network receives the input.

We chose to use three words to check against the network, so even though there are more than three words in the phrase, only the last three are fed into the network. We get a hot vector as its return, order it by using the get_top function, then pick the three highest values and their index positions to match against the reverse word dictionary to get the actual words, and send the output to the user. This process continues on until the user finishes the phrase Clicking option 4.

5.4.7 Running the code

The steps to run this code are the same as the ones that are used for the sentiment analysis code. With IBM PowerAI, you simply can load the TensorFlow environment by running the TensorFlow environment by running source tensorflow-activate from the /opt/DL/tensorflow/bin directory, and then run your code as usual by calling python <your .py file>.

For our case, get_word.py is the starting point. To train the model, uncomment the lines that are shown in Example 5-11 and run python get_word.py.

Example 5-11 Lines of code to be uncommented for training

x = tf.placeholder('float', [None, interval_size, 1])

training(x)

After the model is trained, comment those two lines again and use only the get_word function.

The complete code is found on GitHub. For more information, see Appendix A, “Sentiment analysis code” on page 241.

5.4.8 Final considerations

The word suggestion code can be improved and implemented in different ways. In this section, we transformed each word into a number, although if more word context is required, this number can be represented as a Word2Vec, which is a complete representation. However, it requires a model to convert the words and more processing capacity.

Although the sample scenario does not implement retraining, it can be an interesting exercise. If a user types in a phrase where more than 70% of the words are not in the dictionary of words, the phrase can be inserted into the data set and the model retrained.

The hot vector representation of the label works, but if you grow it too much, then the number of words in the dictionary requires more processing.

Other parameters can be changed to increase the model accuracy, and a test data set is good way to check the accuracy without using only the data in the training set.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 5. Working with data and creating models in IBM PowerAI

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 5. Working with data and creating models in IBM PowerAI