Feature extractors

When the data present in a raw dataframe are not explicitly present in the form an ML algorithm expects we use feature extractors to extract those features. Common feature extractors are:

CountVectorizer: A CountVectorizer converts a collection of text documents into a vector representing the word count of text documents. CountVectorizer works in two different ways depending how the value of the dictionary gets populated. Let's first assume that the user has no prior information of the type of data that will populate the dataset of text; in such a scenario the dictionary gets prepared based on occurrence of term frequency across the dataset. VocabSize defines the number of words a dictionary can hold while the optional parameter minDF restricts the entry of words in the dictionary if occurrence of the word within the document is less than the value of minDF:

List<Row> data = Arrays.asList( 
  RowFactory.create(Arrays.asList("w1", "w2", "w3")), 
  RowFactory.create(Arrays.asList("w1", "w2", "w4", "w5", "w2")) 
); 
StructType schema = new StructType(new StructField [] { 
  new StructField("text", new ArrayType(DataTypes.StringType,       true), false, Metadata.empty()) 
                  }); 
                   Dataset<Row> df = spark.createDataFrame(data, schema); 
 
                     CountVectorizerModel cvModel = new CountVectorizer() 
                     .setInputCol("text") 
                     .setOutputCol("feature") 
                     .setVocabSize(5) 
                     .setMinDF(2) 
                     .fit(df); 
 
                   System.out.println("The words in the Vocabulary are :: "+ Arrays.toString(cvModel.vocabulary())); 
 
                   cvModel.transform(df).show(false); 
 
The output of above code is: 
The words in the Vocabulary are :: [w2, w1] 
+--------------------+-------------------+ 
|text                |feature            | 
+--------------------+-------------------+ 
|[w1, w2, w3]         |(2,[0,1],[1.0,1.0])| 
|[w1, w2, w4, w5, w2]|(2,[0,1],[2.0,1.0])| 
+--------------------+-------------------+

When the dictionary is not defined CountVectorizer iterates over the dataset twice to prepare the dictionary based on frequency and size. The process of preparing the feature vector can be optimized if one knows the value of the words that constitute the dictionary and hence save one iteration. In this scenario, an a-priori dictionary is built at the time of instantiating CountVectorizer, so if on the same dataset as demonstrated in the previous example if we run the following transformation:

CountVectorizerModel cvMod = new CountVectorizerModel(new String[]{"w1", "w2", "w3"}) 
                             .setInputCol("text") 
                             .setOutputCol("feature"); 
                   System.out.println("The words in the Vocabulary are :: "+ Arrays.toString(cvMod.vocabulary())); 
                    
                   cvMod.transform(df).show(false);

We get the following output:

The words in the Vocabulary are: [w1, w2, w3]

+--------------------+-------------------------+ 
|text                  |feature                   | 
+--------------------+-------------------------+ 
|[w1, w2, w3]         |(3,[0,1,2],[1.0,1.0,1.0])| 
|[w1, w2, w4, w5, w2]|(3,[0,1],[1.0,2.0])       | 
+--------------------+-------------------------+

Term Frequency-Inverse Document Frequency (TF-IDF): It is used as a feature extraction technique. TF-IDF is used to compute the weight of individual words in a document rather than entire datasets. By weight here it implies the statistical evaluation to gauge the importance of a word within a document in a collection or dataset. As the frequency of words increase in the document, so does its importance, but then in TF-IDF it is offset by the frequency of the word in the entire collection or dataset. TF-IDF comprises of two parts, namely:
Term Frequency (TF) :It is a statistical measure of frequency a term occurs in a document. For example, if in a document spark occurs five times and the total number of words in that document is 20, then:

TF= Number of frequency of word in a document / Total number of words in a document.

=5/20=0.25

In the Spark ML library one can use HashingTF or CountVectorizer to calculate TF:

Inverse Document Frequency (IDF) measures the importance of a term in the collection or a dataset as a whole. It takes care of the fact that certain terms, such as is, are, the, and so on, despite occurring more, have little importance as far as the dataset is concerned. So if the word spark occurs in 100 documents out of the consideration dataset of 100,000 documents, then:

IDF= loge(Total number of documents/Number of documents with the 
  specific term or word) 
   = loge(10000/100) = 4.6 
     Hence the value of TF-IDF = TF*IDF 
        = 0.25*4.6=1.15

In Spark, IDF is an estimator that generates an IDFModel by using the fit() method on a dataset. A feature vector created by HashingTF or CountVerctorizer is then passed to it using the transform() method, which rescales or offsets the original vector values:

List<Row> data = Arrays.asList( 
                RowFactory.create(0.0, "Spark"), 
                RowFactory.create(0.0, "Spark"), 
                RowFactory.create(0.5, "MLIB"), 
                RowFactory.create(0.6, "MLIB"), 
                RowFactory.create(0.7, "MLIB"), 
                RowFactory.create(1.0, "ML") 
           ); 
           StructType schema = new StructType(new StructField[]{ 
           new StructField("label", DataTypes.DoubleType, false, Metadata.empty()), 
            new StructField("words", DataTypes.StringType, false, Metadata.empty()) 
             }); 
          Dataset<Row> rowData = spark.createDataFrame(data, schema); 
 
          Tokenizer tokenizer = new Tokenizer().setInputCol("words").setOutputCol("tokenizedWords"); 
         Dataset<Row> tokenizedData =  
         tokenizer.transform(rowData); 
 
          int numFeatures = 20; 
          HashingTF hashingTF = new HashingTF() 
                     .setInputCol("tokenizedWords") 
                     .setOutputCol("rawFeatures") 
                     .setNumFeatures(numFeatures); 
          Dataset<Row> featurizedData = hashingTF.transform(tokenizedData); 
 
          IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features"); 
                   IDFModel idfModel = idf.fit(featurizedData); 
 
          Dataset<Row> rescaledData = idfModel.transform(featurizedData); 
          rescaledData.select("label", "features").show(false);

Word2Vec being an estimator produces a Word2VecModel by accepting dataset values composed of sentences or a sequence of words together being represented as a document. The model generates a fixed-size vector by transforming the document into the vector using the average of all words in the document. Word2Vec is popularly used for prediction, document similarity calculation, and so on:

List<Row> data = Arrays.asList( 
                     RowFactory.create(Arrays.asList("Learning Apache Spark".split(" "))), 
                     RowFactory.create(Arrays.asList("Spark has API for Java, Scala, and Python".split(" "))), 
                     RowFactory.create(Arrays.asList("API in above laguage are very richly developed".split(" "))) 
                   ); 
                   StructType schema = new StructType(new StructField[]{ 
                     new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty()) 
                   }); 
                   Dataset<Row> documentDF = spark.createDataFrame(data, schema); 
 
                   Word2Vec word2Vec = new Word2Vec() 
                     .setInputCol("text") 
                     .setOutputCol("result") 
                     .setVectorSize(10) 
                     .setMinCount(0); 
 
                   Word2VecModel model = word2Vec.fit(documentDF); 
                   Dataset<Row> result = model.transform(documentDF); 
 
                 result.show(false);

Table of Contents for Feature extractors

Create new playlist

Sign In

Sign Up

Table of Contents for
Feature extractors