Tokenizer

Tokenization is the process of enchanting important components from raw text, such as words, and sentences, and breaking the raw texts into individual terms (also called words). If you want to have more advanced tokenization on regular expression matching, RegexTokenizer is a good option for doing so. By default, the parameter pattern (regex, default: s+) is used as delimiters to split the input text. Otherwise, you can also set parameter gaps to false, indicating the regex pattern denotes tokens rather than splitting gaps. This way, you can find all matching occurrences as the tokenization result.

Suppose you have the following sentences:

  • Tokenization,is the process of enchanting words,from the raw text.
  • If you want,to have more advance tokenization, RegexTokenizer,is a good option.
  • Here,will provide a sample example on how to tokenize sentences.
  • This way, you can find all matching occurrences.

Now, you want to tokenize each meaningful word from the preceding four sentences. Let's create a DataFrame from the earlier sentences, as follows:

val sentence = spark.createDataFrame(Seq(
(0, "Tokenization,is the process of enchanting words,from the raw text"),
(1, " If you want,to have more advance tokenization,RegexTokenizer,
is a good option"),
(2, " Here,will provide a sample example on how to tockenize sentences"),
(3, "This way,you can find all matching occurrences"))).toDF("id",
"sentence")

Now let's create a tokenizer by instantiating the Tokenizer () API, as follows:

val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words") 

Now, count the number of tokens in each sentence using a UDF, as follows: import org.apache.spark.sql.functions._

val countTokens = udf { (words: Seq[String]) => words.length } 

Now tokenize words form each sentence, as follows:

val tokenized = tokenizer.transform(sentence) 

Finally, show each token against each raw sentence, as follows:

tokenized.select("sentence", "words")
.withColumn("tokens", countTokens(col("words")))
.show(false)

The preceding line of code prints a snap from the tokenized DataFrame containing the raw sentence, bag of words, and number of tokens:

Figure 9: Tokenized words from the raw texts

However, if you use RegexTokenizer API, you will get better results. This goes as follows:
Create a regex tokenizer by instantiating the RegexTokenizer () API:

val regexTokenizer = new RegexTokenizer()
.setInputCol("sentence")
.setOutputCol("words")
.setPattern("\W+")
.setGaps(true)

Now tokenize words from each sentence, as follows:

val regexTokenized = regexTokenizer.transform(sentence) 
regexTokenized.select("sentence", "words")
.withColumn("tokens", countTokens(col("words")))
.show(false)

The preceding line of code prints a snap from the tokenized DataFrame using RegexTokenizer containing the raw sentence, bag of words, and number of tokens:

Figure 10: Better tokenization using RegexTokenizer
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.215.178