Tokenization

Tokenizer converts the input string into lowercase and then splits the string with whitespaces into individual tokens. A given sentence is split into words either using the default space delimiter or using a customer regular expression based Tokenizer. In either case, the input column is transformed into an output column. In particular, the input column is usually a String and the output column is a Sequence of Words.

Tokenizers are available by importing two packages shown next, the Tokenizer and the RegexTokenize:

import org.apache.spark.ml.feature.Tokenizer
import org.apache.spark.ml.feature.RegexTokenizer

First, you need to initialize a Tokenizer specifying the input column and the output column:

scala> val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
tokenizer: org.apache.spark.ml.feature.Tokenizer = tok_942c8332b9d8

Next, invoking the transform() function on the input dataset yields an output dataset:

scala> val wordsDF = tokenizer.transform(sentenceDF)
wordsDF: org.apache.spark.sql.DataFrame = [id: int, sentence: string ... 1 more field]

The following is the output dataset showing the input column IDs, sentence, and the output column words, which contain the sequence of words:

scala> wordsDF.show(false)
|id|sentence |words |
|1 |Hello there, how do you like the book so far? |[hello, there,, how, do, you, like, the, book, so, far?] |
|2 |I am new to Machine Learning |[i, am, new, to, machine, learning] |
|3 |Maybe i should get some coffee before starting |[maybe, i, should, get, some, coffee, before, starting] |
|4 |Coffee is best when you drink it hot |[coffee, is, best, when, you, drink, it, hot] |
|5 |Book stores have coffee too so i should go to a book store|[book, stores, have, coffee, too, so, i, should, go, to, a, book, store]|

On the other hand, if you wanted to set up a regular expression based Tokenizer, you have to use the RegexTokenizer instead of Tokenizer. For this, you need to initialize a RegexTokenizer specifying the input column and the output column along with the regex pattern to be used:

scala> val regexTokenizer = new RegexTokenizer().setInputCol("sentence").setOutputCol("regexWords").setPattern("\W")
regexTokenizer: org.apache.spark.ml.feature.RegexTokenizer = regexTok_15045df8ce41

Next, invoking the transform() function on the input dataset yields an output dataset:

scala> val regexWordsDF = regexTokenizer.transform(sentenceDF)
regexWordsDF: org.apache.spark.sql.DataFrame = [id: int, sentence: string ... 1 more field]

The following is the output dataset showing the input column IDs, sentence, and the output column regexWordsDF, which contain the sequence of words:

scala> regexWordsDF.show(false)
|id|sentence |regexWords |
|1 |Hello there, how do you like the book so far? |[hello, there, how, do, you, like, the, book, so, far] |
|2 |I am new to Machine Learning |[i, am, new, to, machine, learning] |
|3 |Maybe i should get some coffee before starting |[maybe, i, should, get, some, coffee, before, starting] |
|4 |Coffee is best when you drink it hot |[coffee, is, best, when, you, drink, it, hot] |
|5 |Book stores have coffee too so i should go to a book store|[book, stores, have, coffee, too, so, i, should, go, to, a, book, store]|

The diagram of a Tokenizer is as follows, wherein the sentence from the input text is split into words using the space delimiter:

Table of Contents for Tokenization

Create new playlist

Sign In

Sign Up

Table of Contents for
Tokenization