Like in the previous chapter, we need to prepare the training and validation data. In this case, we'll reuse the Spark API to split the data:
val trainValidSplits = inputData.randomSplit(Array(0.8, 0.2))
val (trainData, validData) = (trainValidSplits(0), trainValidSplits(1))
Now, let's perform a grid search using a simple decision tree and a few hyperparameters:
val gridSearch =
for (
hpImpurity <- Array("entropy", "gini");
hpDepth <- Array(5, 20);
hpBins <- Array(10, 50))
yield {
println(s"Building model with: impurity=${hpImpurity}, depth=${hpDepth}, bins=${hpBins}")
val model = new DecisionTreeClassifier()
.setFeaturesCol("reviewVector")
.setLabelCol("label")
.setImpurity(hpImpurity)
.setMaxDepth(hpDepth)
.setMaxBins(hpBins)
.fit(trainData)
val preds = model.transform(validData)
val auc = new BinaryClassificationEvaluator().setLabelCol("label")
.evaluate(preds)
(hpImpurity, hpDepth, hpBins, auc)
}
We can now inspect the result and show the best model AUC:
import com.packtpub.mmlwspark.utils.Tabulizer.table
println(table(Seq("Impurity", "Depth", "Bins", "AUC"),
gridSearch.sortBy(_._4).reverse,
Map.empty[Int,String]))
The output is as follows:
Using this simple grid search on a decision tree, we can see that our poor man's doc2vec produces an AUC of 0.7054. Let's also expose our exact training and test data to H2O and try a deep learning algorithm using the Flow UI:
import org.apache.spark.h2o._
val hc = H2OContext.getOrCreate(sc)
val trainHf = hc.asH2OFrame(trainData, "trainData")
val validHf = hc.asH2OFrame(validData, "validData")
Now that we have successfully published our dataset as H2O frames, let's open the Flow UI and run a deep learning algorithm:
hc.openFlow()
First, note that if we run the getFrames command, we will see the two RDDs that we seamlessly passed from Spark to H2O:
We need to change the type of column label from a numeric column to a categorical one by clicking on Convert to enum for both the frames:
Next, we will run a deep learning model with all of the hyperparameters set to their default value and the first column set to be our label:
If you did not explicitly create a train/test dataset, you can also perform an n folds cross-validation using the nfolds hyperparameter previously:
After running the model training, we can view the model output by clicking View to see the AUC on both the training and validation datasets:
We see a higher AUC for our simple deep learning model of ~ 0.8289. This is a result without any tuning or hyperparameter searching.
What are some other steps that we can perform to improve the AUC even more? We could certainly try a new algorithm with grid searching for hyperparameters, but more interestingly, can we tune the document vectors? The answer is yes and no! It's a partial no because, as you will recall, word2vec is an unsupervised learning task at heart; however, we can get an idea of the strength of our vectors by observing some of the similar words returned. For example, let's take the word drama:
w2vModel.findSynonyms("drama", 5).show()
The output is as follows:
Intuitively, we can look at the results and ask whether these five words are really the best synonyms (that is, the best cosine similarities) of the the word drama. Let's now try rerunning our word2vec model by modifying its input parameters:
val newW2VModel = new Word2Vec()
.setInputCol("reviewTokens")
.setOutputCol("reviewVector")
.setMinCount(3)
.setMaxIter(250)
.setStepSize(0.02)
.fit(movieReviews)
newW2VModel.findSynonyms("drama", 5).show()
The output is as follows:
You should immediately notice that the synonyms are better in terms of similarity to the word in question, but also note that the cosine similarities are significantly higher for the terms as well. Recall that the default number of iterations for word2vec is 1 and now we have set it to 250, allowing our network to really triangulate on some quality word vectors, which can further be improved with more preprocessing steps and further tweaking of our hyperparameters for word2vec, which should produce document vectors of better quality.