Simple Java tokenizers

There are several Java classes that support simple tokenization; some of them are as follows:

  • Scanner
  • String
  • BreakIterator
  • StreamTokenizer
  • StringTokenizer

Although these classes provide limited support, it is useful to understand how they can be used. For some tasks, these classes will suffice. Why use a more difficult to understand and less efficient approach when a core Java class can do the job? We will cover each of these classes as they support the tokenization process.

The StreamTokenizer and StringTokenizer classes should not be used for new development. Instead, the String class' split method is usually a better choice. They have been included here in case you run across them and wonder whether they should be used or not.

Using the Scanner class

The Scanner class is used to read data from a text source. This might be standard input or it could be from a file. It provides a simple-to-use technique to support tokenization.

The Scanner class uses whitespace as the default delimiter. An instance of the Scanner class can be created using a number of different constructors. The constructor in the following sequence uses a simple string. The next method retrieves the next token from the input stream. The tokens are isolated from the string, stored into a list of strings, and then displayed:

Scanner scanner = new Scanner("Let's pause, and then "+ " reflect.");
List<String> list = new ArrayList<>();
while(scanner.hasNext()) {
    String token = scanner.next();
    list.add(token);
}
for(String token : list) {
    System.out.println(token);
}

When executed, we get the following output:

Let's
pause,
and
then
reflect.

This simple implementation has several shortcomings. If we needed our contractions to be identified and possibly split, as demonstrated with the first token, then this implementation fails to do it. Also, the last word of the sentence was returned with a period attached to it.

Specifying the delimiter

If we are not happy with the default delimiter, there are several methods we can use to change its behavior. Several of these methods are summarized in the following table Reference source not found. This list is provided to give you an idea of what is possible.

Method

Effect

useLocale

Uses the locale to set the default delimiter matching

useDelimiter

Sets the delimiters based on a string or a pattern

useRadix

Specifies the radix to use when working with numbers

skip

Skips input matching a pattern and ignores the delimiters

findInLine

Finds the next occurrence of a pattern ignoring delimiters

Here, we will demonstrate the use of the useDelimiter method. If we use the following statement immediately before the while statement in the previous section's example, the only delimiters that will be used will be the blank space, apostrophe, and period.

scanner.useDelimiter("[ ,.]");

When executed, the following will be displayed. The blank line reflects the use of the comma delimiter. It has the undesirable effect of returning an empty string as a token in this example:

Let's
pause

and
then
reflect

This method uses a pattern as defined in a string. The open and close brackets are used to create a class of characters. This is a regular expression that matches those three characters. An explanation of Java patterns can be found at http://docs.oracle.com/javase/8/docs/api/. The delimiter list can be reset to whitespaces using the reset method.

Using the split method

We demonstrated the String class' split method in Chapter 1, Introduction to NLP. It is duplicated here for convenience:

String text = "Mr. Smith went to 123 Washington avenue.";
String tokens[] = text.split("\s+");
for (String token : tokens) {
    System.out.println(token);
}

The output is as follows:

Mr.
Smith
went
to
123
Washington
avenue.

The split method also uses a regular expression. If we replace the text with the same string we used in the previous section, "Let's pause, and then reflect.", we will get the same output.

The split method has an overloaded version that uses an integer to specify how many times the regular expression pattern is applied to the target text. Using this parameter can stop the operation after the specified number of matches has been made.

The Pattern class also has a split method. It will split its argument based on the pattern used to create the Pattern object.

Using the BreakIterator class

Another approach for tokenization involves the use of the BreakIterator class. This class supports the location of integer boundaries for different units of text. In this section, we will illustrate how it can be used to find words.

The class has a single default constructor which is protected. We will use the static getWordInstance method to get an instance of the class. This method is overloaded with one version using a Locale object. The class possesses several methods to access boundaries as listed in the following table. It has one field, DONE, that is used to indicate that the last boundary has been found.

Method

Usage

first

Returns the first boundary of the text

next

Returns the next boundary following the current one

previous

Returns the boundary preceding the current one

setText

Associates a string with the BreakIterator instance

To demonstrate this class, we declare an instance of the BreakIterator class and a string to use with it:

BreakIterator wordIterator = BreakIterator.getWordInstance();
String text = "Let's pause, and then reflect.";

The text is then assigned to the instance and the first boundary is determined:

wordIterator.setText(text);
int boundary = wordIterator.first();

The loop that follows will store the beginning and ending boundary indexes for word breaks using the begin and end variables. The boundary values are integers. Each boundary pair and its associated text are displayed.

When the last boundary is found, the loop terminates:

while (boundary != BreakIterator.DONE) {
    int begin = boundary;
    System.out.print(boundary + "-");
    boundary = wordIterator.next();
    int end = boundary;
    if(end == BreakIterator.DONE) break;
    System.out.println(boundary + " ["
    + text.substring(begin, end) + "]");
}

The output follows where the brackets are used to clearly delineate the text:

0-5 [Let's]
5-6 [ ]
6-11 [pause]
11-12 [,]
12-13 [ ]
13-16 [and]
16-17 [ ]
17-21 [then]
21-22 [ ]
22-29 [reflect]
29-30 [.]

This technique does a fairly good job of identifying the basic tokens.

Using the StreamTokenizer class

The StreamTokenizer class, found in the java.io package, is designed to tokenize an input stream. It is an older class and is not as flexible as the StringTokenizer class discussed in the next section. An instance of the class is normally created based on a file and will tokenize the text found in the file. It can be constructed using a string.

The class uses a nextToken method to return the next token in the stream. The token returned is an integer. The value of the integer reflects the type of token returned. Based on the token type, the token can be handled in different ways.

The StreamTokenizer class fields are shown in the following table:

Field

Data Type

Meaning

nval

double

Contains a number if the current token is a number

sval

String

Contains the token if the current token is a word token

TT_EOF

static int

A constant for the end of the stream

TT_EOL

static int

A constant for the end of the line

TT_NUMBER

static int

The number of tokens read

TT_WORD

static int

A constant indicating a word token

ttype

int

The type of token read

In this example, a tokenizer is created followed by the declaration of the isEOF variable, which is used to terminate the loop. The nextToken method returns the token type. Based on the token type, numeric and string tokens are displayed:

try {
    StreamTokenizer tokenizer = new StreamTokenizer(
          newStringReader("Let's pause, and then reflect."));
    boolean isEOF = false;
    while (!isEOF) {
        int token = tokenizer.nextToken();
        switch (token) {
            case StreamTokenizer.TT_EOF:
                isEOF = true;
                break;
            case StreamTokenizer.TT_EOL:
                break;
            case StreamTokenizer.TT_WORD:
                System.out.println(tokenizer.sval);
                break;
            case StreamTokenizer.TT_NUMBER:
                System.out.println(tokenizer.nval);
                break;
            default:
                System.out.println((char) token);
        }
    }
} catch (IOException ex) {
    // Handle the exception
}

When executed, we get the following output:

Let
'

This is not what we would normally expect. The problem is that the tokenizer uses apostrophes (single quote character) and double quotes to denote quoted text. Since there is no corresponding match, it consumes the rest of the string.

We can use the ordinaryChar method to specify which characters should be treated as common characters. The single quote and comma characters are designated as ordinary characters here:

tokenizer.ordinaryChar('''),
tokenizer.ordinaryChar(','),

When these statements are added to the previous code and executed, we get the following output:

Let
'
s
pause
,
and
then
reflect.

The apostrophe is not a problem now. These two characters are treated as delimiters and returned as tokens. There is also a whitespaceChars method available that specifies which characters are to be treated as whitespaces.

Using the StringTokenizer class

The StringTokenizer class is found in the java.util package. It provides more flexibility than the StreamTokenizer class and is designed to handle strings from any source. The class' constructor accepts the string to be tokenized as its parameter and uses the nextToken method to return the token. The hasMoreTokens method returns true if more tokens exist in the input stream. This is illustrated in the following sequence:

StringTokenizerst = new StringTokenizer("Let's pause, and "+ "then reflect.");
while (st.hasMoreTokens()) {
    System.out.println(st.nextToken());
}

When executed, we get the following output:

Let's
pause,
and
then
reflect.

The constructor is overloaded, allowing the delimiters to be specified and whether the delimiters should be returned as a token.

Performance considerations with java core tokenization

When using these core Java tokenization approaches, it is worthwhile to briefly discuss how well they perform. Measuring performance can be tricky at times due to the various factors that can impact code execution. With that said, an interesting comparison of the performance of several Java core tokenization techniques is found at http://stackoverflow.com/questions/5965767/performance-of-stringtokenizer-class-vs-split-method-in-java. For the problem they were addressing, the indexOf method was fastest.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.60.62