Create input data

The gensim.models.doc2vec class processes documents in the TaggedDocument format that contains the tokenized documents alongside a unique tag that permits accessing the document vectors after training:

sentences = []
for i, (_, text) in enumerate(sample.values):
sentences.append(TaggedDocument(words=text.split(), tags=[i]))

The training interface works similar to word2vec with additional parameters to specify the Doc2vec algorithm:

model = Doc2vec(documents=sentences,
dm=1, # algorithm: use distributed memory
dm_concat=0, # 1: concat, not sum/avg context vectors
dbow_words=0, # 1: train word vectors, 0: only doc
vectors
alpha=0.025, # initial learning rate
size=300,
window=5,
min_count=10,
epochs=5,
negative=5)
model.save('test.model')

You can also use the train() method to continue the learning process and, for example, iteratively reduce the learning rate:

for _ in range(10):
alpha *= .9
model.train(sentences,
total_examples=model.corpus_count,
epochs=model.epochs,
alpha=alpha)

As a result, we can access the document vectors as features to train a sentiment classifier:

X = np.zeros(shape=(len(sample), size))
y = sample.stars.sub(1) # model needs [0, 5) labels
for i in range(len(sample)):
X[i] = model[i]

We will train a lightgbm gradient boosting machine as follows:

  1. Create lightgbm Dataset objects from the train and test sets:
train_data = lgb.Dataset(data=X_train, label=y_train)
test_data = train_data.create_valid(X_test, label=y_test)
  1. Define the training parameters for a multiclass model with five classes (using defaults otherwise):
params = {'objective'  : 'multiclass',
'num_classes': 5}
  1. Train the model for 250 iterations and monitor the validation set error:
lgb_model = lgb.train(params=params,
train_set=train_data,
num_boost_round=250,
valid_sets=[train_data, test_data],
verbose_eval=25)
  1. Lightgbm predicts probabilities for all five classes. We obtain class predictions using np.argmax() to obtain the column index with the highest predicted probability:
y_pred = np.argmax(lgb_model.predict(X_test), axis=1)
  1. We compute the accuracy score to evaluate the result and see an improvement of more than 100% over the baseline of 20% for five balanced classes:
accuracy_score(y_true=y_test, y_pred=y_pred)
0.44955063467061984
  1. Finally, we take a closer look at predictions for each class using the confusion matrix:
cm = confusion_matrix(y_true=y_test, y_pred=y_pred)
cm = pd.DataFrame(cm / np.sum(cm), index=stars, columns=stars)
  1. And visualize the result as a seaborn heatmap:
sns.heatmap(cm, annot=True, cmap='Blues', fmt='.1%')

In sum, the doc2vec method allowed us to achieve a very substantial improvement in test accuracy over a naive benchmark without much tuning. If we only select top and bottom reviews (with five and one stars, respectively) and train a binary classifier, the AUC score achieves over 0.86 using 250,000 samples from each class.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.104.29