StopWordsRemover

Stop words are words that should be excluded from the input, typically because the words appear frequently and don't carry as much meaning. Spark's StopWordsRemover takes as input a sequence of strings, which is tokenized by Tokenizer or RegexTokenizer. Then, it removes all the stop words from the input sequences. The list of stop words is specified by the stopWords parameter. The current implementation for the StopWordsRemover API provides the options for the Danish, Dutch, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish, Turkish, and English languages. To provide an example, we can simply extend the preceding Tokenizer example in the previous section, since they are already tokenized. For this example, however, we will use the RegexTokenizer API.

At first, create a stop word remover instance from the StopWordsRemover () API, as follows:

val remover = new StopWordsRemover()
.setInputCol("words")
.setOutputCol("filtered")

Now, let's remove all the stop words and print the results as follows:

val newDF = remover.transform(regexTokenized)
newDF.select("id", "filtered").show(false)

The preceding line of code prints a snap from the filtered DataFrame excluding the stop words:

Figure 11: Filtered (that is, without stop words) tokens
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.211.165