HashingTF

HashingTF is a Transformer, which takes a set of terms and converts them into vectors of fixed length by hashing each term using a hash function to generate an index for each term. Then, term frequencies are generated using the indices of the hash table.

In Spark, the HashingTF uses the MurmurHash3 algorithm to hash terms.

In order to use HashingTF, you need to import the following package:

import org.apache.spark.ml.feature.HashingTF

First, you need to initialize a HashingTF specifying the input column and the output column. Here, we choose the filtered words column created by the StopWordsRemover Transformer and generate an output column rawFeaturesDF. We also choose the number of features as 100:

scala> val hashingTF = new HashingTF().setInputCol("filteredWords").setOutputCol("rawFeatures").setNumFeatures(100)
hashingTF: org.apache.spark.ml.feature.HashingTF = hashingTF_b05954cb9375

Next, invoking the transform() function on the input dataset yields an output dataset:

scala> val rawFeaturesDF = hashingTF.transform(noStopWordsDF)
rawFeaturesDF: org.apache.spark.sql.DataFrame = [id: int, sentence: string ... 3 more fields]

The following is the output dataset showing the input column IDs, sentence, and the output column rawFeaturesDF, which contains the features represented by a vector:

scala> rawFeaturesDF.show(false)
|id |sentence |words |filteredWords |rawFeatures |
|1 |Hello there, how do you like the book so far? |[hello, there,, how, do, you, like, the, book, so, far?] |[hello, there,, like, book, far?] |(100,[30,48,70,93],[2.0,1.0,1.0,1.0]) |
|2 |I am new to Machine Learning |[i, am, new, to, machine, learning] |[new, machine, learning] |(100,[25,52,72],[1.0,1.0,1.0]) |
|3 |Maybe i should get some coffee before starting |[maybe, i, should, get, some, coffee, before, starting] |[maybe, get, coffee, starting] |(100,[16,51,59,99],[1.0,1.0,1.0,1.0]) |
|4 |Coffee is best when you drink it hot |[coffee, is, best, when, you, drink, it, hot] |[coffee, best, drink, hot] |(100,[31,51,63,72],[1.0,1.0,1.0,1.0]) |
|5 |Book stores have coffee too so i should go to a book store|[book, stores, have, coffee, too, so, i, should, go, to, a, book, store]|[book, stores, coffee, go, book, store]|(100,[43,48,51,77,93],[1.0,1.0,1.0,1.0,2.0])|

Let's look at the preceding output to have a better understanding. If you just look at columns filteredWords and rawFeatures alone, you can see that,

  1. The array of words [hello, there, like, book, and far] is transformed to raw feature vector (100,[30,48,70,93],[2.0,1.0,1.0,1.0]).
  2. The array of words (book, stores, coffee, go, book, and store) is transformed to raw feature vector (100,[43,48,51,77,93],[1.0,1.0,1.0,1.0,2.0]).

So, what does the vector represent here? The underlying logic is that each word is hashed into an integer and counted for the number of occurrences in the word array.

Spark internally uses a hashMap for this mutable.HashMap.empty[Int, Double], which stores the hash value of each word as Integer key and the number of occurrences as double value. Double is used so that we can use it in conjunction with IDF (we'll talk about it in the next section). Using this map, the array [book, stores, coffee, go, book, store] can be seen as [hashFunc(book), hashFunc(stores), hashFunc(coffee), hashFunc(go), hashFunc(book), hashFunc(store)], which is equal to [43,48,51,77,93]. Then, if you count the number of occurrences too, that is, book-2, coffee-1,go-1,store-1,stores-1.

Combining the preceding information, we can generate a vector (numFeatures, hashValues, Frequencies), which in this case will be (100,[43,48,51,77,93],[1.0,1.0,1.0,1.0,2.0]).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.140.111