There are several Java classes that support simple tokenization; some of them are as follows:
Scanner
String
BreakIterator
StreamTokenizer
StringTokenizer
Although these classes provide limited support, it is useful to understand how they can be used. For some tasks, these classes will suffice. Why use a more difficult to understand and less efficient approach when a core Java class can do the job? We will cover each of these classes as they support the tokenization process.
The StreamTokenizer
and StringTokenizer
classes should not be used for new development. Instead, the String
class' split
method is usually a better choice. They have been included here in case you run across them and wonder whether they should be used or not.
The Scanner
class is used to read data from a text source. This might be standard input or it could be from a file. It provides a simple-to-use technique to support tokenization.
The Scanner
class uses whitespace as the default delimiter. An instance of the Scanner
class can be created using a number of different constructors. The constructor in the following sequence uses a simple string. The next
method retrieves the next token from the input stream. The tokens are isolated from the string, stored into a list of strings, and then displayed:
Scanner scanner = new Scanner("Let's pause, and then "+ " reflect."); List<String> list = new ArrayList<>(); while(scanner.hasNext()) { String token = scanner.next(); list.add(token); } for(String token : list) { System.out.println(token); }
When executed, we get the following output:
Let's pause, and then reflect.
This simple implementation has several shortcomings. If we needed our contractions to be identified and possibly split, as demonstrated with the first token, then this implementation fails to do it. Also, the last word of the sentence was returned with a period attached to it.
If we are not happy with the default delimiter, there are several methods we can use to change its behavior. Several of these methods are summarized in the following table Reference source not found
. This list is provided to give you an idea of what is possible.
Method |
Effect |
---|---|
|
Uses the locale to set the default delimiter matching |
|
Sets the delimiters based on a string or a pattern |
|
Specifies the radix to use when working with numbers |
|
Skips input matching a pattern and ignores the delimiters |
|
Finds the next occurrence of a pattern ignoring delimiters |
Here, we will demonstrate the use of the useDelimiter
method. If we use the following statement immediately before the while
statement in the previous section's example, the only delimiters that will be used will be the blank space, apostrophe, and period.
scanner.useDelimiter("[ ,.]");
When executed, the following will be displayed. The blank line reflects the use of the comma delimiter. It has the undesirable effect of returning an empty string as a token in this example:
Let's pause and then reflect
This method uses a pattern as defined in a string. The open and close brackets are used to create a class of characters. This is a regular expression that matches those three characters. An explanation of Java patterns can be found at http://docs.oracle.com/javase/8/docs/api/. The delimiter list can be reset to whitespaces using the reset
method.
We demonstrated the String
class' split
method in Chapter 1, Introduction to NLP. It is duplicated here for convenience:
String text = "Mr. Smith went to 123 Washington avenue."; String tokens[] = text.split("\s+"); for (String token : tokens) { System.out.println(token); }
The output is as follows:
Mr. Smith went to 123 Washington avenue.
The split
method also uses a regular expression. If we replace the text with the same string we used in the previous section, "Let's pause, and then reflect.", we will get the same output.
The split
method has an overloaded version that uses an integer to specify how many times the regular expression pattern is applied to the target text. Using this parameter can stop the operation after the specified number of matches has been made.
The Pattern
class also has a split
method. It will split its argument based on the pattern used to create the Pattern
object.
Another approach for tokenization involves the use of the BreakIterator
class. This class supports the location of integer boundaries for different units of text. In this section, we will illustrate how it can be used to find words.
The class has a single default constructor which is protected. We will use the static getWordInstance
method to get an instance of the class. This method is overloaded with one version using a Locale
object. The class possesses several methods to access boundaries as listed in the following table. It has one field, DONE
, that is used to indicate that the last boundary has been found.
Method |
Usage |
---|---|
|
Returns the first boundary of the text |
|
Returns the next boundary following the current one |
|
Returns the boundary preceding the current one |
|
Associates a string with the |
To demonstrate this class, we declare an instance of the BreakIterator
class and a string to use with it:
BreakIterator wordIterator = BreakIterator.getWordInstance(); String text = "Let's pause, and then reflect.";
The text is then assigned to the instance and the first boundary is determined:
wordIterator.setText(text); int boundary = wordIterator.first();
The loop that follows will store the beginning and ending boundary indexes for word breaks using the begin
and end
variables. The boundary values are integers. Each boundary pair and its associated text are displayed.
When the last boundary is found, the loop terminates:
while (boundary != BreakIterator.DONE) { int begin = boundary; System.out.print(boundary + "-"); boundary = wordIterator.next(); int end = boundary; if(end == BreakIterator.DONE) break; System.out.println(boundary + " [" + text.substring(begin, end) + "]"); }
The output follows where the brackets are used to clearly delineate the text:
0-5 [Let's] 5-6 [ ] 6-11 [pause] 11-12 [,] 12-13 [ ] 13-16 [and] 16-17 [ ] 17-21 [then] 21-22 [ ] 22-29 [reflect] 29-30 [.]
This technique does a fairly good job of identifying the basic tokens.
The StreamTokenizer
class, found in the java.io
package, is designed to tokenize an input stream. It is an older class and is not as flexible as the StringTokenizer
class discussed in the next section. An instance of the class is normally created based on a file and will tokenize the text found in the file. It can be constructed using a string.
The class uses a nextToken
method to return the next token in the stream. The token returned is an integer. The value of the integer reflects the type of token returned. Based on the token type, the token can be handled in different ways.
The StreamTokenizer
class fields are shown in the following table:
Field |
Data Type |
Meaning |
---|---|---|
|
|
Contains a number if the current token is a number |
|
|
Contains the token if the current token is a word token |
|
|
A constant for the end of the stream |
|
|
A constant for the end of the line |
|
|
The number of tokens read |
|
|
A constant indicating a word token |
|
|
The type of token read |
In this example, a tokenizer is created followed by the declaration of the isEOF
variable, which is used to terminate the loop. The nextToken
method returns the token type. Based on the token type, numeric and string tokens are displayed:
try { StreamTokenizer tokenizer = new StreamTokenizer( newStringReader("Let's pause, and then reflect.")); boolean isEOF = false; while (!isEOF) { int token = tokenizer.nextToken(); switch (token) { case StreamTokenizer.TT_EOF: isEOF = true; break; case StreamTokenizer.TT_EOL: break; case StreamTokenizer.TT_WORD: System.out.println(tokenizer.sval); break; case StreamTokenizer.TT_NUMBER: System.out.println(tokenizer.nval); break; default: System.out.println((char) token); } } } catch (IOException ex) { // Handle the exception }
When executed, we get the following output:
Let '
This is not what we would normally expect. The problem is that the tokenizer uses apostrophes (single quote character) and double quotes to denote quoted text. Since there is no corresponding match, it consumes the rest of the string.
We can use the ordinaryChar
method to specify which characters should be treated as common characters. The single quote and comma characters are designated as ordinary characters here:
tokenizer.ordinaryChar('''), tokenizer.ordinaryChar(','),
When these statements are added to the previous code and executed, we get the following output:
Let ' s pause , and then reflect.
The apostrophe is not a problem now. These two characters are treated as delimiters and returned as tokens. There is also a whitespaceChars
method available that specifies which characters are to be treated as whitespaces.
The StringTokenizer
class is found in the java.util
package. It provides more flexibility than the StreamTokenizer
class and is designed to handle strings from any source. The class' constructor accepts the string to be tokenized as its parameter and uses the nextToken
method to return the token. The hasMoreTokens
method returns true
if more tokens exist in the input stream. This is illustrated in the following sequence:
StringTokenizerst = new StringTokenizer("Let's pause, and "+ "then reflect."); while (st.hasMoreTokens()) { System.out.println(st.nextToken()); }
When executed, we get the following output:
Let's pause, and then reflect.
The constructor is overloaded, allowing the delimiters to be specified and whether the delimiters should be returned as a token.
When using these core Java tokenization approaches, it is worthwhile to briefly discuss how well they perform. Measuring performance can be tricky at times due to the various factors that can impact code execution. With that said, an interesting comparison of the performance of several Java core tokenization techniques is found at http://stackoverflow.com/questions/5965767/performance-of-stringtokenizer-class-vs-split-method-in-java. For the problem they were addressing, the indexOf
method was fastest.
3.17.79.20