This section is included to assist you in performing the activities present in the book. It includes detailed steps that are to be performed by the students to complete and achieve the objectives of the book.
Solution
Let's perform various pre-processing tasks on the Bank Marketing Subscription dataset. We'll also be splitting the dataset into training and testing data. Follow these steps to complete this activity:
import pandas as pd
Link = 'https://github.com/TrainingByPackt/Data-Science-with-Python/blob/master/Chapter01/Data/Banking_Marketing.csv'
#reading the data into the dataframe into the object data
df = pd.read_csv(Link, header=0)
#Finding number of rows and columns
print("Number of rows and columns : ",df.shape)
The preceding code generates the following output:
#Printing all the columns
print(list(df.columns))
The preceding code generates the following output:
#Basic Statistics of each column
df.describe().transpose()
The preceding code generates the following output:
#Basic Information of each column
print(df.info())
The preceding code generates the following output:
In the preceding figure, you can see that none of the columns contains any null values. Also, the type of each column is provided.
#finding the data types of each column and checking for null
null_ = df.isna().any()
dtypes = df.dtypes
sum_na_ = df.isna().sum()
info = pd.concat([null_,sum_na_,dtypes],axis = 1,keys = ['isNullExist','NullSum','type'])
info
Have a look at the output for this in the following figure:
#removing Null values
df = df.dropna()
#Total number of null in each column
print(df.isna().sum())# No NA
Have a look at the output for this in the following figure:
df.education.value_counts()
Have a look at the output for this in the following figure:
df.education.unique()
The output is as follows:
df.education.replace({"basic.9y":"Basic","basic.6y":"Basic","basic.4y":"Basic"},inplace=True)
df.education.unique()
In the preceding figure, you can see that basic.9y, basic.6y, and basic.4y are grouped together as Basic.
#Select all the non numeric data using select_dtypes function
data_column_category = df.select_dtypes(exclude=[np.number]).columns
The preceding code generates the following output:
cat_vars=data_column_category
for var in cat_vars:
cat_list='var'+'_'+var
cat_list = pd.get_dummies(df[var], prefix=var)
data1=df.join(cat_list)
df=data1
df.columns
The preceding code generates the following output:
#Categorical features
cat_vars=data_column_category
#All features
data_vars=df.columns.values.tolist()
#neglecting the categorical column for which we have done encoding
to_keep = []
for i in data_vars:
if i not in cat_vars:
to_keep.append(i)
#selecting only the numerical and encoded catergorical column
data_final=df[to_keep]
data_final.columns
The preceding code generates the following output:
#Segregating Independent and Target variable
X=data_final.drop(columns='y')
y=data_final['y']
from sklearn. model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print("FULL Dateset X Shape: ", X.shape )
print("Train Dateset X Shape: ", X_train.shape )
print("Test Dateset X Shape: ", X_test.shape )
The output is as follows:
Solution:
x = ['January','February','March','April','May','June']
y = [1000, 1200, 1400, 1600, 1800, 2000]
plt.plot(x, y, '*:b')
plt.xlabel('Month')
plt.ylabel('Items Sold')
plt.title('Items Sold has been Increasing Linearly')
Check out the following screenshot for the resultant output:
Solution:
x = ['Boston Celtics','Los Angeles Lakers', 'Chicago Bulls', 'Golden State Warriors', 'San Antonio Spurs']
y = [17, 16, 6, 6, 5]
import pandas as pd
df = pd.DataFrame({'Team': x,
'Titles': y})
df_sorted = df.sort_values(by=('Titles'), ascending=False)
If we sort with ascending=True, the plot will have larger values to the right. Since we want the larger values on the left, we will be using ascending=False.
team_with_most_titles = df_sorted['Team'][0]
most_titles = df_sorted['Titles'][0]
title = 'The {} have the most titles with {}'.format(team_with_most_titles, most_titles)
import matplotlib.pyplot as plt
plt.bar(df_sorted['Team'], df_sorted['Titles'], color='red')
plt.xlabel('Team')
plt.ylabel('Number of Championships')
plt.xticks(rotation=45)
plt.title(title)
plt.savefig('Titles_by_Team)
When we print the plot to the console using plt.show(), it appears as intended; however, when we open the file we created titled 'Titles_by_Team.png', we see that it crops the x tick labels.
The following figure displays the bar plot with the cropped x tick labels.
plt.savefig('Titles_by_Team', bbox_inches='tight')
Check out the following output for the final result:
Solution:
import pandas as pd
Items_by_Week = pd.read_csv('Items_Sold_by_Week.csv')
Weight_by_Height = pd.read_csv('Weight_by_Height.csv')
y = np.random.normal(loc=0, scale=0.1, size=100)
import matplotlib.pyplot as plt
fig, axes = plt.subplots(nrows=3, ncols=2)
plt.tight_layout()
axes[0,0].set_title('Line')
axes[0,1].set_title('Bar')
axes[1,0].set_title('Horizontal Bar')
axes[1,1].set_title('Histogram')
axes[2,0].set_title('Scatter')
axes[2,1].set_title('Box-and-Whisker')
axes[0,0].plot(Items_by_Week['Week'], Items_by_Week['Items_Sold'])
axes[0,1].bar(Items_by_Week['Week'], Items_by_Week['Items_Sold'])
axes[1,0].barh(Items_by_Week['Week'], Items_by_Week['Items_Sold'])
See the resultant output in the following figure:
axes[1,1].hist(y, bins=20)axes[2,1].boxplot(y)
The resultant output is displayed here:
axes[2,0].scatter(Weight_by_Height['Height'], Weight_by_Height['Weight'])
See the figure here for the resultant output:
See the figure here for the resultant output:
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(8,8))
fig.savefig('Six_Subplots')
The following figure displays the 'Six_Subplots.png' file:
Solution:
predictions = model.predict(X_test)
2. Plot the predicted versus actual values on a scatterplot using the following code:
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
plt.scatter(y_test, predictions)
plt.xlabel('Y Test (True Values)')
plt.ylabel('Predicted Values')
plt.title('Predicted vs. Actual Values (r = {0:0.2f})'.format(pearsonr(y_test, predictions)[0], 2))
plt.show()
Refer to the resultant output here:
There is a much stronger linear correlation between the predicted and actual values in the multiple linear regression model (r = 0.93) relative to the simple linear regression model (r = 0.62).
import seaborn as sns
from scipy.stats import shapiro
sns.distplot((y_test - predictions), bins = 50)
plt.xlabel('Residuals')
plt.ylabel('Density')
plt.title('Histogram of Residuals (Shapiro W p-value = {0:0.3f})'.format(shapiro(y_test - predictions)[1]))
plt.show()
Refer to the resultant output here:
Our residuals are negatively skewed and non-normal, but this is less skewed than in the simple linear model.
from sklearn import metrics
import numpy as np
metrics_df = pd.DataFrame({'Metric': ['MAE',
'MSE',
'RMSE',
'R-Squared'],
'Value': [metrics.mean_absolute_error(y_test, predictions),
metrics.mean_squared_error(y_test, predictions),
np.sqrt(metrics.mean_squared_error(y_test, predictions)),
metrics.explained_variance_score(y_test, predictions)]}).round(3)
print(metrics_df)
Please refer to the resultant output:
The multiple linear regression model performed better on every metric relative to the simple linear regression model.
Solution:
predicted_prob = model.predict_proba(X_test)[:,1]
from sklearn.metrics import confusion_matrix
import numpy as np
cm = pd.DataFrame(confusion_matrix(y_test, predicted_class))
cm['Total'] = np.sum(cm, axis=1)
cm = cm.append(np.sum(cm, axis=0), ignore_index=True)
cm.columns = ['Predicted No', 'Predicted Yes', 'Total']
cm = cm.set_index([['Actual No', 'Actual Yes', 'Total']])
print(cm)
Nice! We have decreased our number of false positives from 6 to 2. Additionally, our false negatives were lowered from 10 to 4 (see in Exercise 26). Be aware that results may vary slightly.
from sklearn.metrics import classification_report
print(classification_report(y_test, predicted_class))
By tuning the hyperparameters of the logistic regression model, we were able to improve upon a logistic regression model that was already performing very well.
Solution:
predicted_class = model.predict(X_test)
from sklearn.metrics import confusion_matrix
import numpy as np
cm = pd.DataFrame(confusion_matrix(y_test, predicted_class))
cm['Total'] = np.sum(cm, axis=1)
cm = cm.append(np.sum(cm, axis=0), ignore_index=True)
cm.columns = ['Predicted No', 'Predicted Yes', 'Total']
cm = cm.set_index([['Actual No', 'Actual Yes', 'Total']])
print(cm)
See the resultant output here:
from sklearn.metrics import classification_report
print(classification_report(y_test, predicted_class))
See the resultant output here:
Here, we demonstrated how to tune the hyperparameters of an SVC model using grid search.
Solution:
import pandas as pd
df = pd.read_csv('weather.csv')
import pandas as pd
df_dummies = pd.get_dummies(df, drop_first=True)
from sklearn.utils import shuffle
df_shuffled = shuffle(df_dummies, random_state=42)
DV = 'Rain'
X = df_shuffled.drop(DV, axis=1)
y = df_shuffled[DV]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
from sklearn.preprocessing import StandardScaler
model = StandardScaler()
X_train_scaled = model.fit_transform(X_train)
X_test_scaled = model.transform(X_test)
Solution:
predicted_prob = model.predict_proba(X_test_scaled)[:,1]
predicted_class = model.predict(X_test)
from sklearn.metrics import confusion_matrix
import numpy as np
cm = pd.DataFrame(confusion_matrix(y_test, predicted_class))
cm['Total'] = np.sum(cm, axis=1)
cm = cm.append(np.sum(cm, axis=0), ignore_index=True)
cm.columns = ['Predicted No', 'Predicted Yes', 'Total']
cm = cm.set_index([['Actual No', 'Actual Yes', 'Total']])
print(cm)
Refer to the resultant output here:
from sklearn.metrics import classification_report
print(classification_report(y_test, predicted_class))
Refer to the resultant output here:
There was only one misclassified observation. Thus, by tuning a decision tree classifier model on our weather.csv dataset, we were able to predict rain (or snow) with great accuracy. We can see that the sole driving feature was temperature in Celsius. This makes sense due to the way in which decision trees use recursive partitioning to make predictions.
Solution:
import numpy as np
grid = {'criterion': ['mse','mae'],
'max_features': ['auto', 'sqrt', 'log2', None],
'min_impurity_decrease': np.linspace(0.0, 1.0, 10),
'bootstrap': [True, False],
'warm_start': [True, False]}
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
model = GridSearchCV(RandomForestRegressor(), grid, scoring='explained_variance', cv=5)
model.fit(X_train_scaled, y_train)
See the output here:
best_parameters = model.best_params_
print(best_parameters)
See the resultant output below:
Solution:
predictions = model.predict(X_test_scaled)
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
plt.scatter(y_test, predictions)
plt.xlabel('Y Test (True Values)')
plt.ylabel('Predicted Values')
plt.title('Predicted vs. Actual Values (r = {0:0.2f})'.format(pearsonr(y_test, predictions)[0], 2))
plt.show()
Refer to the resultant output here:
import seaborn as sns
from scipy.stats import shapiro
sns.distplot((y_test - predictions), bins = 50)
plt.xlabel('Residuals')
plt.ylabel('Density')
plt.title('Histogram of Residuals (Shapiro W p-value = {0:0.3f})'.format(shapiro(y_test - predictions)[1]))
plt.show()
Refer to the resultant output here:
from sklearn import metrics
import numpy as np
metrics_df = pd.DataFrame({'Metric': ['MAE',
'MSE',
'RMSE',
'R-Squared'],
'Value': [metrics.mean_absolute_error(y_test, predictions),
metrics.mean_squared_error(y_test, predictions),
np.sqrt(metrics.mean_squared_error(y_test, predictions)),
metrics.explained_variance_score(y_test, predictions)]}).round(3)
print(metrics_df)
Find the resultant output here:
The random forest regressor model seems to underperform compared to the multiple linear regression, as evidenced by greater MAE, MSE, and RMSE values, as well as less explained variance. Additionally, there was a weaker correlation between the predicted and actual values, and the residuals were further from being normally distributed. Nevertheless, by leveraging ensemble methods using a random forest regressor, we constructed a model that explains 75.8% of the variance in temperature and predicts temperature in Celsius + 3.781 degrees.
Solution:
After the glass dataset has been imported, shuffled, and standardized (see Exercise 58):
import pandas as pd
labels_df = pd.DataFrame()
from sklearn.cluster import KMeans
for i in range(0, 100):
model = KMeans(n_clusters=2)
model.fit(scaled_features)
labels = model.labels_
labels_df['Model_{}_Labels'.format(i+1)] = labels
row_mode = labels_df.mode(axis=1)
labels_df['row_mode'] = row_mode
print(labels_df.head(5))
We have drastically increased the confidence in our predictions by iterating through numerous models, saving the predictions at each iteration, and assigning the final predictions as the mode of these predictions. However, these predictions were generated by models using a predetermined number of clusters. Unless we know the number of clusters a priori, we will want to discover the optimal number of clusters to segment our observations.
Solution:
from sklearn.decomposition import PCA
model = PCA(n_components=best_n_components)
df_pca = model.fit_transform(scaled_features)
from sklearn.cluster import KMeans
import numpy as np
inertia_list = []
for i in range(100):
model = KMeans(n_clusters=x)
The value for x will be dictated by the outer loop which is covered in detail here.
model.fit(df_pca)
inertia = model.inertia_
inertia_list.append(inertia)
mean_inertia_list_PCA = []
for x in range(1, 11):
mean_inertia = np.mean(inertia_list)
mean_inertia_list_PCA.append(mean_inertia)
print(mean_inertia_list_PCA)
Solution:
import pandas as pd
import xgboost as xgb
import numpy as np
from sklearn.metrics import accuracy_score
data = pd.read_csv("../data/adult-data.csv", names=['age', 'workclass', 'education-num', 'occupation', 'capital-gain', 'capital-loss', 'hours-per-week', 'income'])
The reason we are passing the names of the columns is because the data doesn't contain them. We do this to make our lives easy.
from sklearn.preprocessing import LabelEncoder
data['workclass'] = LabelEncoder().fit_transform(data['workclass'])
data['occupation'] = LabelEncoder().fit_transform(data['occupation'])
data['income'] = LabelEncoder().fit_transform(data['income'])
Here, we encode all the categorical string data that we have. There is another method we can use to prevent writing the same piece of code again and again. See if you can find it.
X = data.copy()
X.drop("income", inplace = True, axis = 1)
Y = data.income
X_train, X_test = X[:int(X.shape[0]*0.8)].values, X[int(X.shape[0]*0.8):].values
Y_train, Y_test = Y[:int(Y.shape[0]*0.8)].values, Y[int(Y.shape[0]*0.8):].values
train = xgb.DMatrix(X_train, label=Y_train)
test = xgb.DMatrix(X_test, label=Y_test)
param = {'max_depth':7, 'eta':0.1, 'silent':1, 'objective':'binary:hinge'} num_round = 50
model = xgb.train(param, train, num_round)
preds = model.predict(test)
accuracy = accuracy_score(Y[int(Y.shape[0]*0.8):].values, preds)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
The output is as follows:
Solution:
import pandas as pd
import numpy as np
data = data = pd.read_csv("data/telco-churn.csv")
data.drop('customerID', axis = 1, inplace = True)
from sklearn.preprocessing import LabelEncoder
data['gender'] = LabelEncoder().fit_transform(data['gender'])
data.dtypes
The data types of the variables will be shown as follows:
data.TotalCharges = pd.to_numeric(data.TotalCharges, errors='coerce')
import xgboost as xgb
import matplotlib.pyplot as plt
X = data.copy()
X.drop("Churn", inplace = True, axis = 1)
Y = data.Churn
X_train, X_test = X[:int(X.shape[0]*0.8)].values, X[int(X.shape[0]*0.8):].values
Y_train, Y_test = Y[:int(Y.shape[0]*0.8)].values, Y[int(Y.shape[0]*0.8):].values
train = xgb.DMatrix(X_train, label=Y_train)
test = xgb.DMatrix(X_test, label=Y_test)
test_error = {}
for i in range(20):
param = {'max_depth':i, 'eta':0.1, 'silent':1, 'objective':'binary:hinge'}
num_round = 50
model_metrics = xgb.cv(param, train, num_round, nfold = 10)
test_error[i] = model_metrics.iloc[-1]['test-error-mean']
plt.scatter(test_error.keys(),test_error.values())
plt.xlabel('Max Depth')
plt.ylabel('Test Error')
plt.show()
Check out the output in the following screenshot:
From the graph, it is clear that a max depth of 4 gives the least error. So, we will be using that to train our model.
param = {'max_depth':4, 'eta':0.1, 'silent':1, 'objective':'binary:hinge'}
num_round = 100
model = xgb.train(param, train, num_round)
preds = model.predict(test)
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(Y[int(Y.shape[0]*0.8):].values, preds)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
The output is as follows:
model.save_model('churn-model.model')
Solution:
import pandas as pd
import numpy as np
data = data = pd.read_csv("data/BlackFriday.csv")
data.isnull().sum()
data.drop(['User_ID', 'Product_Category_2', 'Product_Category_3'], axis = 1, inplace = True)
The product category variables have high null values, so we drop them as well.
from collections import defaultdict
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
label_dict = defaultdict(LabelEncoder)
data[['Product_ID', 'Gender', 'Age', 'Occupation', 'City_Category', 'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category_1']] = data[['Product_ID', 'Gender', 'Age', 'Occupation', 'City_Category', 'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category_1']].apply(lambda x: label_dict[x.name].fit_transform(x))
from sklearn.model_selection import train_test_split
X = data
y = X.pop('Purchase')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=9)
cat_cols_dict = {col: list(data[col].unique()) for col in ['Product_ID', 'Gender', 'Age', 'Occupation', 'City_Category', 'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category_1']}
train_input_list = []
test_input_list = []
for col in cat_cols_dict.keys():
raw_values = np.unique(data[col])
value_map = {}
for i in range(len(raw_values)):
value_map[raw_values[i]] = i
train_input_list.append(X_train[col].map(value_map).values)
test_input_list.append(X_test[col].map(value_map).fillna(0).values)
from keras.models import Model
from keras.layers import Input, Dense, Concatenate, Reshape, Dropout
from keras.layers.embeddings import Embedding
cols_out_dict = {
'Product_ID': 20,
'Gender': 1,
'Age': 2,
'Occupation': 6,
'City_Category': 1,
'Stay_In_Current_City_Years': 2,
'Marital_Status': 1,
'Product_Category_1': 9
}
inputs = []
embeddings = []
for col in cat_cols_dict.keys():
inp = Input(shape=(1,), name = 'input_' + col)
embedding = Embedding(len(cat_cols_dict[col]), cols_out_dict[col], input_length=1, name = 'embedding_' + col)(inp)
embedding = Reshape(target_shape=(cols_out_dict[col],))(embedding)
inputs.append(inp)
embeddings.append(embedding)
x = Concatenate()(embeddings)
x = Dense(4, activation='relu')(x)
x = Dense(2, activation='relu')(x)
output = Dense(1, activation='relu')(x)
model = Model(inputs, output)
model.compile(loss='mae', optimizer='adam')
model.fit(train_input_list, y_train, validation_data = (test_input_list, y_test), epochs=20, batch_size=128)
from sklearn.metrics import mean_squared_error
y_pred = model.predict(test_input_list)
np.sqrt(mean_squared_error(y_test, y_pred))
The RMSE is:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
embedding_Product_ID = model.get_layer('embedding_Product_ID').get_weights()[0]
pca = PCA(n_components=2)
Y = pca.fit_transform(embedding_Product_ID[:40])
plt.figure(figsize=(8,8))
plt.scatter(-Y[:, 0], -Y[:, 1])
for i, txt in enumerate(label_dict['Product_ID'].inverse_transform(cat_cols_dict['Product_ID'])[:40]):
plt.annotate(txt, (-Y[i, 0],-Y[i, 1]), xytext = (-20, 8), textcoords = 'offset points')
plt.show()
The plot is as follows:
From the plot, you can see that similar products have been clustered together by the model.
model.save ('black-friday.model')
Solution:
def get_label(file):
class_label = file.split('.')[0]
if class_label == 'dog': label_vector = [1,0]
elif class_label == 'cat': label_vector = [0,1]
return label_vector
Then, create a function to read, resize, and preprocess the images:
import os
import numpy as np
from PIL import Image
from tqdm import tqdm
from random import shuffle
SIZE = 50
def get_data():
data = []
files = os.listdir(PATH)
for image in tqdm(files):
label_vector = get_label(image)
img = Image.open(PATH + image).convert('L')
img = img.resize((SIZE,SIZE))
data.append([np.asarray(img),np.array(label_vector)])
shuffle(data)
return data
SIZE here refers to the dimension of the final square image we will input to the model. We resize the image to have the length and breadth equal to SIZE.
When running os.listdir(PATH), you will find that all the images of cats come first, followed by images of dogs.
data = get_data()
train = data[:7000]
test = data[7000:]
x_train = [data[0] for data in train]
y_train = [data[1] for data in train]
x_test = [data[0] for data in test]
y_test = [data[1] for data in test]
y_train = np.array(y_train)
y_test = np.array(y_test)
x_train = np.array(x_train).reshape(-1, SIZE, SIZE, 1)
x_test = np.array(x_test).reshape(-1, SIZE, SIZE, 1)
from keras.models import Sequential
from keras.layers import Dense, Dropout, Conv2D, MaxPool2D, Flatten, BatchNormalization
model = Sequential()
Add the convolutional layers:
model.add(Conv2D(48, (3, 3), activation='relu', padding='same', input_shape=(50,50,1)))
model.add(Conv2D(48, (3, 3), activation='relu'))
Add the pooling layer:
model.add(MaxPool2D(pool_size=(2, 2)))
model.add(BatchNormalization())
model.add(Dropout(0.10))
model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics = ['accuracy'])
Define the number of epochs you want to train the model for:
EPOCHS = 10
model_details = model.fit(x_train, y_train,
batch_size = 128,
epochs = EPOCHS,
validation_data= (x_test, y_test),
verbose=1)
score = model.evaluate(x_test, y_test)
print("Accuracy: {0:.2f}%".format(score[1]*100))
score = model.evaluate(x_train, y_train)
print("Accuracy: {0:.2f}%".format(score[1]*100))
The test set accuracy for this model is 70.4%. The training set accuracy is really high, at 96%. This means that the model has started to overfit. Improving the model to get the best possible accuracy is left for you as an exercise. You can plot the incorrectly predicted images using the code from previous exercises to get a sense of how well the model performs:
import matplotlib.pyplot as plt
y_pred = model.predict(x_test)
incorrect_indices = np.nonzero(np.argmax(y_pred,axis=1) != np.argmax(y_test,axis=1))[0]
labels = ['dog', 'cat']
image = 5
plt.imshow(x_test[incorrect_indices[image]].reshape(50,50), cmap=plt.get_cmap('gray'))
plt.show()
print("Prediction: {0}".format(labels[np.argmax(y_pred[incorrect_indices[image]])]))
Solution:
from PIL import Image
def get_input(file):
return Image.open(PATH+file)
def get_output(file):
class_label = file.split('.')[0]
if class_label == 'dog': label_vector = [1,0]
elif class_label == 'cat': label_vector = [0,1]
return label_vector
SIZE = 50
def preprocess_input(image):
# Data preprocessing
image = image.convert('L')
image = image.resize((SIZE,SIZE))
# Data augmentation
random_vertical_shift(image, shift=0.2)
random_horizontal_shift(image, shift=0.2)
random_rotate(image, rot_range=45)
random_horizontal_flip(image)
return np.array(image).reshape(SIZE,SIZE,1)
This is for horizontal flip:
import random
def random_horizontal_flip(image):
toss = random.randint(1, 2)
if toss == 1:
return image.transpose(Image.FLIP_LEFT_RIGHT)
else:
return image
This is for rotation:
def random_rotate(image, rot_range):
value = random.randint(-rot_range,rot_range)
return image.rotate(value)
This is for image shift:
import PIL
def random_horizontal_shift(image, shift):
width, height = image.size
rand_shift = random.randint(0,shift*width)
image = PIL.ImageChops.offset(image, rand_shift, 0)
image.paste((0), (0, 0, rand_shift, height))
return image
def random_vertical_shift(image, shift):
width, height = image.size
rand_shift = random.randint(0,shift*height)
image = PIL.ImageChops.offset(image, 0, rand_shift)
image.paste((0), (0, 0, width, rand_shift))
return image
import numpy as np
def custom_image_generator(images, batch_size = 128):
while True:
# Randomly select images for the batch
batch_images = np.random.choice(images, size = batch_size)
batch_input = []
batch_output = []
# Read image, perform preprocessing and get labels
for file in batch_images:
# Function that reads and returns the image
input_image = get_input(file)
# Function that gets the label of the image
label = get_output(file)
# Function that pre-processes and augments the image
image = preprocess_input(input_image)
batch_input.append(image)
batch_output.append(label)
batch_x = np.array(batch_input)
batch_y = np.array(batch_output)
# Return a tuple of (images,labels) to feed the network
yield(batch_x, batch_y)
def get_label(file):
class_label = file.split('.')[0]
if class_label == 'dog': label_vector = [1,0]
elif class_label == 'cat': label_vector = [0,1]
return label_vector
This get_data function is similar to the one we used in Activity 1. The modification here is that we get the list of images to be read as an input parameter, and we return a tuple of images and their labels:
def get_data(files):
data_image = []
labels = []
for image in tqdm(files):
label_vector = get_label(image)
img = Image.open(PATH + image).convert('L')
img = img.resize((SIZE,SIZE))
labels.append(label_vector)
data_image.append(np.asarray(img).reshape(SIZE,SIZE,1))
data_x = np.array(data_image)
data_y = np.array(labels)
return (data_x, data_y)
import os
files = os.listdir(PATH)
random.shuffle(files)
train = files[:7000]
test = files[7000:]
validation_data = get_data(test)
from keras.models import Sequential
model = Sequential()
Add the convolutional layers
from keras.layers import Input, Dense, Dropout, Conv2D, MaxPool2D, Flatten, BatchNormalization
model.add(Conv2D(32, (3, 3), activation='relu', padding='same', input_shape=(50,50,1)))
model.add(Conv2D(32, (3, 3), activation='relu'))
Add the pooling layer:
model.add(MaxPool2D(pool_size=(2, 2)))
model.add(BatchNormalization())
model.add(Dropout(0.10))
model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))
EPOCHS = 10
BATCH_SIZE = 128
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics = ['accuracy'])
model_details = model.fit_generator(custom_image_generator(train, batch_size = BATCH_SIZE),
steps_per_epoch = len(train) // BATCH_SIZE,
epochs = EPOCHS,
validation_data= validation_data,
verbose=1)
The test set accuracy for this model is 72.6%, which is an improvement on the model in Activity 21. You will observe that the training accuracy is really high, at 98%. This means that this model has started to overfit, much like the one in Activity 21. This could be due to a lack of data augmentation. Try changing the data augmentation parameters to see if there is any change in accuracy. Alternatively, you can modify the architecture of the neural network to get better results. You can plot the incorrectly predicted images to get a sense of how well the model performs.
import matplotlib.pyplot as plt
y_pred = model.predict(validation_data[0])
incorrect_indices = np.nonzero(np.argmax(y_pred,axis=1) != np.argmax(validation_data[1],axis=1))[0]
labels = ['dog', 'cat']
image = 7
plt.imshow(validation_data[0][incorrect_indices[image]].reshape(50,50), cmap=plt.get_cmap('gray'))
plt.show()
print("Prediction: {0}".format(labels[np.argmax(y_pred[incorrect_indices[image]])]))
Solution:
import pandas as pd
data = pd.read_csv('../../chapter 7/data/movie_reviews.csv', encoding='latin-1')
data.text = data.text.str.lower()
Keep in mind that "Hello" and "hellow" are not the same to a computer.
import re
def clean_str(string):
string = re.sub(r"https?://S+", '', string)
string = re.sub(r'<a href', ' ', string)
string = re.sub(r'&', '', string)
string = re.sub(r'<br />', ' ', string)
string = re.sub(r'[_"-;%()|+&=*%.,!?:#$@[]/]', ' ', string)
string = re.sub('d','', string)
string = re.sub(r"can't", "cannot", string)
string = re.sub(r"it's", "it is", string)
return string
data.SentimentText = data.SentimentText.apply(lambda x: clean_str(str(x)))
To see how we found these, words refer to Exercise 51.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize
stop_words = stopwords.words('english') + ['movie', 'film', 'time']
stop_words = set(stop_words)
remove_stop_words = lambda r: [[word for word in word_tokenize(sente) if word not in stop_words] for sente in sent_tokenize(r)]
data['SentimentText'] = data['SentimentText'].apply(remove_stop_words)
from gensim.models import Word2Vec
model = Word2Vec(
data['SentimentText'].apply(lambda x: x[0]),
iter=10,
size=16,
window=5,
min_count=5,
workers=10)
model.wv.save_word2vec_format('movie_embedding.txt', binary=False)
def combine_text(text):
try:
return ' '.join(text[0])
except:
return np.nan
data.SentimentText = data.SentimentText.apply(lambda x: combine_text(x))
data = data.dropna(how='any')
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(list(data['SentimentText']))
sequences = tokenizer.texts_to_sequences(data['SentimentText'])
word_index = tokenizer.word_index
from keras.preprocessing.sequence import pad_sequences
reviews = pad_sequences(sequences, maxlen=100)
import numpy as np
def load_embedding(filename, word_index , num_words, embedding_dim):
embeddings_index = {}
file = open(filename, encoding="utf-8")
for line in file:
values = line.split()
word = values[0]
coef = np.asarray(values[1:])
embeddings_index[word] = coef
file.close()
embedding_matrix = np.zeros((num_words, embedding_dim))
for word, pos in word_index.items():
if pos >= num_words:
continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[pos] = embedding_vector
return embedding_matrix
embedding_matrix = load_embedding('movie_embedding.txt', word_index, len(word_index), 16)
from sklearn.model_selection import train_test_split
labels = pd.get_dummies(data.Sentiment)
X_train, X_test, y_train, y_test = train_test_split(reviews,labels, test_size=0.2, random_state=9)
from keras.layers import Input, Dense, Dropout, BatchNormalization, Embedding, Flatten
from keras.models import Model
inp = Input((100,))
embedding_layer = Embedding(len(word_index),
16,
weights=[embedding_matrix],
input_length=100,
trainable=False)(inp)
model = Flatten()(embedding_layer)
model = BatchNormalization()(model)
model = Dropout(0.10)(model)
model = Dense(units=1024, activation='relu')(model)
model = Dense(units=256, activation='relu')(model)
model = Dropout(0.5)(model)
predictions = Dense(units=2, activation='softmax')(model)
model = Model(inputs = inp, outputs = predictions)
model.compile(loss='binary_crossentropy', optimizer='sgd', metrics = ['acc'])
model.fit(X_train, y_train, validation_data = (X_test, y_test), epochs=10, batch_size=256)
from sklearn.metrics import accuracy_score
preds = model.predict(X_test)
accuracy_score(np.argmax(preds, 1), np.argmax(y_test.values, 1))
The accuracy of the model is:
y_actual = pd.Series(np.argmax(y_test.values, axis=1), name='Actual')
y_pred = pd.Series(np.argmax(preds, axis=1), name='Predicted')
pd.crosstab(y_actual, y_pred, margins=True)
Check the following
review_num = 111
print("Review: "+tokenizer.sequences_to_texts([X_test[review_num]])[0])
sentiment = "Positive" if np.argmax(preds[review_num]) else "Negative"
print(" Predicted sentiment = "+ sentiment)
sentiment = "Positive" if np.argmax(y_test.values[review_num]) else "Negative"
print(" Actual sentiment = "+ sentiment)
Check that you receive the following output:
Solution:
import pandas as pd
data = pd.read_csv('tweet-data.csv', encoding='latin-1', header=None)
data.columns = ['sentiment', 'id', 'date', 'q', 'user', 'text']
data = data.drop(['id', 'date', 'q', 'user'], axis=1)
data = data.sample(400000).reset_index(drop=True)
data.text = data.text.str.lower()
import re
def clean_str(string):
string = re.sub(r"https?://S+", '', string)
string = re.sub(r"@w*s", '', string)
string = re.sub(r'<a href', ' ', string)
string = re.sub(r'&', '', string)
string = re.sub(r'<br />', ' ', string)
string = re.sub(r'[_"-;%()|+&=*%.,!?:#$@[]/]', ' ', string)
string = re.sub('d','', string)
return string
data.text = data.text.apply(lambda x: clean_str(str(x)))
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize
stop_words = stopwords.words('english')
stop_words = set(stop_words)
remove_stop_words = lambda r: [[word for word in word_tokenize(sente) if word not in stop_words] for sente in sent_tokenize(r)]
data['text'] = data['text'].apply(remove_stop_words)
def combine_text(text):
try:
return ' '.join(text[0])
except:
return np.nan
data.text = data.text.apply(lambda x: combine_text(x))
data = data.dropna(how='any')
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(list(data['text']))
sequences = tokenizer.texts_to_sequences(data['text'])
word_index = tokenizer.word_index
from keras.preprocessing.sequence import pad_sequences
tweets = pad_sequences(sequences, maxlen=50)
import numpy as np
def load_embedding(filename, word_index , num_words, embedding_dim):
embeddings_index = {}
file = open(filename, encoding="utf-8")
for line in file:
values = line.split()
word = values[0]
coef = np.asarray(values[1:])
embeddings_index[word] = coef
file.close()
embedding_matrix = np.zeros((num_words, embedding_dim))
for word, pos in word_index.items():
if pos >= num_words:
continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[pos] = embedding_vector
return embedding_matrix
embedding_matrix = load_embedding('../../embedding/glove.twitter.27B.50d.txt', word_index, len(word_index), 50)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(tweets, pd.get_dummies(data.sentiment), test_size=0.2, random_state=9)
from keras.models import Sequential
from keras.layers import Dense, Dropout, BatchNormalization, Embedding, Flatten, LSTM
embedding_layer = Embedding(len(word_index),
50,
weights=[embedding_matrix],
input_length=50,
trainable=False)
model = Sequential()
model.add(embedding_layer)
model.add(Dropout(0.5))
model.add(LSTM(100, dropout=0.2))
model.add(Dense(2, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='sgd', metrics = ['acc'])
model.fit(X_train, y_train, validation_data = (X_test, y_test), epochs=10, batch_size=256)
preds = model.predict(X_test)
review_num = 1
print("Tweet: "+tokenizer.sequences_to_texts([X_test[review_num]])[0])
sentiment = "Positive" if np.argmax(preds[review_num]) else "Negative"
print(" Predicted sentiment = "+ sentiment)
sentiment = "Positive" if np.argmax(y_test.values[review_num]) else "Negative"
print(" Actual sentiment = "+ sentiment)
The output is as follows:
Solution:
from PIL import Image
def get_input(file):
return Image.open(PATH+file)
def get_output(file):
class_label = file.split('.')[0]
if class_label == 'dog': label_vector = [1,0]
elif class_label == 'cat': label_vector = [0,1]
return label_vector
SIZE = 200
CHANNELS = 3
def preprocess_input(image):
# Data preprocessing
image = image.resize((SIZE,SIZE))
image = np.array(image).reshape(SIZE,SIZE,CHANNELS)
# Normalize image
image = image/255.0
return image
import numpy as np
def custom_image_generator(images, batch_size = 128):
while True:
# Randomly select images for the batch
batch_images = np.random.choice(images, size = batch_size)
batch_input = []
batch_output = []
# Read image, perform preprocessing and get labels
for file in batch_images:
# Function that reads and returns the image
input_image = get_input(file)
# Function that gets the label of the image
label = get_output(file)
# Function that pre-processes and augments the image
image = preprocess_input(input_image)
batch_input.append(image)
batch_output.append(label)
batch_x = np.array(batch_input)
batch_y = np.array(batch_output)
# Return a tuple of (images,labels) to feed the network
yield(batch_x, batch_y)
from tqdm import tqdm
def get_data(files):
data_image = []
labels = []
for image in tqdm(files):
label_vector = get_output(image)
img = Image.open(PATH + image)
img = img.resize((SIZE,SIZE))
labels.append(label_vector)
img = np.asarray(img).reshape(SIZE,SIZE,CHANNELS)
img = img/255.0
data_image.append(img)
data_x = np.array(data_image)
data_y = np.array(labels)
return (data_x, data_y)
import os
files = os.listdir(PATH)
random.shuffle(files)
train = files[:7000]
test = files[7000:]
validation_data = get_data(test)
7. Plot a few images from the dataset to see whether you loaded the files correctly:
import matplotlib.pyplot as plt
plt.figure(figsize=(20,10))
columns = 5
for i in range(columns):
plt.subplot(5 / columns + 1, columns, i + 1)
plt.imshow(validation_data[0][i])
A random sample of the images is shown here:
from keras.applications.inception_v3 import InceptionV3
base_model = InceptionV3(weights='imagenet', include_top=False, input_shape=(SIZE,SIZE,CHANNELS))
from keras.layers import GlobalAveragePooling2D, Dense, Dropout
from keras.models import Model
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
predictions = Dense(2, activation='softmax')(x)
model = Model(inputs=base_model.input, outputs=predictions)
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics = ['accuracy'])
And then perform the training of the model:
EPOCHS = 50
BATCH_SIZE = 128
model_details = model.fit_generator(custom_image_generator(train, batch_size = BATCH_SIZE),
steps_per_epoch = len(train) // BATCH_SIZE,
epochs = EPOCHS,
validation_data= validation_data,
verbose=1)
score = model.evaluate(validation_data[0], validation_data[1])
print("Accuracy: {0:.2f}%".format(score[1]*100))
The accuracy is as follows:
Solution:
from numpy.random import seed
seed(1)
from tensorflow import set_random_seed
set_random_seed(1)
SIZE is the dimension of the square image input. CHANNELS is the number of channels in the training data images. There are 3 channels in a RGB image.
SIZE = 200
CHANNELS = 3
from PIL import Image
def get_input(file):
return Image.open(PATH+file)
def get_output(file):
class_label = file.split('.')[0]
if class_label == 'dog': label_vector = [1,0]
elif class_label == 'cat': label_vector = [0,1]
return label_vector
def preprocess_input(image):
# Data preprocessing
image = image.resize((SIZE,SIZE))
image = np.array(image).reshape(SIZE,SIZE,CHANNELS)
# Normalize image
image = image/255.0
return image
import numpy as np
def custom_image_generator(images, batch_size = 128):
while True:
# Randomly select images for the batch
batch_images = np.random.choice(images, size = batch_size)
batch_input = []
batch_output = []
# Read image, perform preprocessing and get labels
for file in batch_images:
# Function that reads and returns the image
input_image = get_input(file)
# Function that gets the label of the image
label = get_output(file)
# Function that pre-processes and augments the image
image = preprocess_input(input_image)
batch_input.append(image)
batch_output.append(label)
batch_x = np.array(batch_input)
batch_y = np.array(batch_output)
# Return a tuple of (images,labels) to feed the network
yield(batch_x, batch_y)
from tqdm import tqdm
def get_data(files):
data_image = []
labels = []
for image in tqdm(files):
label_vector = get_output(image)
img = Image.open(PATH + image)
img = img.resize((SIZE,SIZE))
labels.append(label_vector)
img = np.asarray(img).reshape(SIZE,SIZE,CHANNELS)
img = img/255.0
data_image.append(img)
data_x = np.array(data_image)
data_y = np.array(labels)
return (data_x, data_y)
import random
random.shuffle(files)
train = files[:7000]
development = files[7000:8500]
test = files[8500:]
development_data = get_data(development)
test_data = get_data(test)
import matplotlib.pyplot as plt
plt.figure(figsize=(20,10))
columns = 5
for i in range(columns):
plt.subplot(5 / columns + 1, columns, i + 1)
plt.imshow(validation_data[0][i])
Check the output in the following screenshot:
from keras.applications.inception_v3 import InceptionV3
base_model = InceptionV3(weights='imagenet', include_top=False, input_shape=(200,200,3))
10. Add the output dense layer according to our problem:
from keras.models import Model
from keras.layers import GlobalAveragePooling2D, Dense, Dropout
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(256, activation='relu')(x)
keep_prob = 0.5
x = Dropout(rate = 1 - keep_prob)(x)
predictions = Dense(2, activation='softmax')(x)
model = Model(inputs=base_model.input, outputs=predictions)
for layer in base_model.layers[:5]:
layer.trainable = False
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics = ['accuracy'])
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping, TensorBoard
callbacks = [
TensorBoard(log_dir='./logs',
update_freq='epoch'),
EarlyStopping(monitor = "val_loss",
patience = 18,
verbose = 1,
min_delta = 0.001,
mode = "min"),
ReduceLROnPlateau(monitor = "val_loss",
factor = 0.2,
patience = 8,
verbose = 1,
mode = "min"),
ModelCheckpoint(monitor = "val_loss",
filepath = "Dogs-vs-Cats-InceptionV3-{epoch:02d}-{val_loss:.2f}.hdf5",
save_best_only=True,
period = 1)]
Here, we are making use of four callbacks: TensorBoard, EarlyStopping, ReduceLROnPlateau, and ModelCheckpoint.
Perform training on the model. Here we train our model for 50 epochs only and with a batch size of 128:
EPOCHS = 50
BATCH_SIZE = 128
model_details = model.fit_generator(custom_image_generator(train, batch_size = BATCH_SIZE),
steps_per_epoch = len(train) // BATCH_SIZE,
epochs = EPOCHS,
callbacks = callbacks,
validation_data= development_data,
verbose=1)
The training logs on TensorBoard are shown here:
The logs of the development set from the TensorBoard tool are shown here:
The learning rate decrease can be observed from the following plot:
score = model.evaluate(test_data[0], test_data[1])
print("Accuracy: {0:.2f}%".format(score[1]*100))
To understand fully, refer to the following output screenshot:
As you can see, the model gets an accuracy of 93.6% on the test set, which is different from the accuracy of the development set (93.3% from the TensorBoard training logs). The early stopping callback stopped training when there wasn't a significant improvement in the loss of the development set; this helped us save some time. The learning rate was reduced after nine epochs, which helped training, as can be seen here:
18.226.180.161