Scanning a File

Problem

You need to scan a file with more fine-grained resolution than the readLine( ) method of the BufferedReader class and its subclasses (discussed in Section 9.12).

Solution

Use a StreamTokenizer, readline( ) and a StringTokenizer, regular expressions (Chapter 4), or one of several scanning tools such as JavaCC.

Discussion

While you could, in theory, read the file a character at a time and analyze each character, that is a pretty low-level approach. The read( ) method in the Reader class is defined to return int, so that it can use the time-honored value -1 (defined as EOF in Unix <stdio.h> for years) to indicate that you have read to the end of the file.

void doFile(Reader is) {
    int c;
    while ((c=is.read(  )) != -1) {
        System.out.print((char)c);
    }
}

The cast to char is interesting. The program will compile fine without it, but may not print correctly (depending on the contents of the file).

We discussed the StringTokenizer class extensively in Section 3.3. The combination of readLine( ) and StringTokenizer provides a simple means of scanning a file. Suppose you need to read a file in which each line consists of a name like “[email protected]”, and you want to split the lines into the user part and the host address part. You could use this:

// ScanStringTok.java
protected void process(LineNumberReader is) {
        String s = null;
        try {
            while ((s = is.readLine(  )) != null) {
                StringTokenizer st = new StringTokenizer(s, "@", true);
                String user = (String)st.nextElement(  );
                st.nextElement(  );
                String host = (String)st.nextElement(  );
                System.out.println("User name: " + user +
                    "; host part: " + host);

                // Presumably you would now do something 
                // with the user and host parts...  

            }

        } catch (NoSuchElementException ix) {
            System.err.println("Line " + is.getLineNumber(  ) +
                ": Invalid input " + s);
        } catch (IOException e) {
            System.err.println(e);
        }
}

The StreamTokenizer class in package java.util provides slightly more capabilities for scanning a file. It will read characters and assemble them into words, or tokens. It will return these tokens to you along with a “type code” describing the kind of token it found. This will either be one of four predefined types (StringTokenizer.TT_WORD, TT_NUMBER, TT_WORD, or TT_EOL for the end of line), or the ASCII value of an ordinary character (such as 40 for the space character). Methods such as ordinaryCharacter( ) allow you to specify how to categorize characters, while others such as slashSlashComment( ) allow you to enable or disable features.

The example shows a StreamTokenizer used to implement a simple immediate-mode stack-based calculator:

2 2 + =
4
22 7 / =
3.141592857

I read tokens as they arrive from the StreamTokenizer. Numbers get put on the stack. The four operators (+, -, *, and /) are immediately performed on the two elements at the top of the stack, and the result is put back on the top of the stack. The = operator causes the top element to be printed, but is left on the stack so that you can say:

4 5 * = 2 / =
20.0
10.0

Here is the relevant code from SimpleCalc :

public class SimpleCalc {
    /** The StreamTokenizer */
    protected  StreamTokenizer tf;

    /** The variable name (not used in this version) */
    protected String variable;
    /** The operand stack */
    protected Stack s;

    /** Construct a SimpleCalc from an existing Reader */
    public SimpleCalc(Reader rdr) throws IOException {
        tf = new StreamTokenizer(rdr);
        // Control the input character set:
        tf.slashSlashComments(true);    // treat "//" as comments
        tf.ordinaryChar('-'),        // used for subtraction
        tf.ordinaryChar('/'),    // used for division

        s = new Stack(  );
    }

    protected void doCalc(  ) throws IOException {
        int iType;
        double tmp;

        while ((iType = tf.nextToken(  )) != tf.TT_EOF) {
            switch(iType) {
            case StringTokenizer.TT_NUMBER: 
                // Found a number, push value to stack
                push(tf.nval);
                break;
            case StringTokenizer.TT_WORD:
                // Found a variable, save its name. Not used here. */
                variable = tf.sval;
                break;
            case '+':
                // Found + operator, perform it immediately.
                push(pop() + pop(  ));
                break;
            case '-':
                // Found + operator, perform it (order matters).
                tmp = pop(  );
                push(pop(  ) - tmp);
                break;
            case '*':
                // Multiply works OK
                push(pop() * pop(  ));
                break;
            case '/':
                // Handle division carefully: order matters!
                tmp = pop(  );
                push(pop(  ) / tmp);
                break;
            case '=':
                System.out.println(peek(  ));
                break;
            default:
                System.out.println("What's this? iType = " + iType);
            }
        }
    }
}

While StreamTokenizer is useful, it is limited in the number of different tokens that it knows and has no way of specifying that the tokens must appear in a particular order. To do more advanced scanning, you need to use some special-purpose scanning tools. Such tools have been known and used for a long time in the Unix realm. The best-known examples are yacc and lex, (discussed in the O’Reilly text lex & yacc). These tools let you specify the lexical structure of your input using regular expressions (see Chapter 4). For example, you might say that an email address consists of a series of alphanumerics, followed by an at sign (@), followed by a series of alphanumerics with periods embedded, as:

name:    [A-Za-z0-9]+@[A-Za-z0-0.]

The tool will then write code that recognizes the characters you have described. There is also the grammatical specification, which says, for example, that the keyword ADDRESS must appear, followed by a colon, followed by a “name” token as previously defined.

One widely used scanning tool is JavaCC . Though still owned by Sun, it is being distributed and supported by WebGain (http://www.webgain.com/products/metamata/java_doc.html). JavaCC can be used to write grammars for a wide variety of programs, from simple calculators such as the one earlier in this recipe, through HTML and CORBA/IDL, up to full Java and C/C++ compilers. Examples of these are included with the JavaCC distribution. Unfortunately, the learning curve for parsers in general precludes providing a simple and comprehensive example here. Please refer to the documentation and the numerous examples provided with the JavaCC distribution.

That’s all I have to say on scanning: simple line-at-a-time scanners using StringTokenizer, fancier token-based scanners using StreamTokenizer, and grammar-based scanners based on JavaCC and similar tools. Scan well and prosper!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.3.167