You need to
scan a file with more fine-grained
resolution than the readLine( )
method of the
BufferedReader
class and its subclasses (discussed
in Section 9.12).
Use a StreamTokenizer
, readline( )
and a StringTokenizer
, regular
expressions (Chapter 4), or one of several
scanning tools such as JavaCC.
While you could, in theory, read the file a character at a time and
analyze each character, that is a pretty low-level approach. The
read( )
method in the
Reader
class is defined to return
int
, so that it can use the time-honored value -1
(defined as EOF in Unix <stdio.h> for
years) to indicate that you have read to the end of the file.
void doFile(Reader is) { int c; while ((c=is.read( )) != -1) { System.out.print((char)c); } }
The cast to char
is interesting. The program will
compile fine without it, but may not print correctly (depending on
the contents of the file).
We discussed the StringTokenizer
class extensively
in Section 3.3. The combination of
readLine( )
and StringTokenizer
provides a simple means of scanning a file. Suppose you need to read
a file in which each line consists of a name like
“[email protected]”, and you want to split the lines into
the user part and the host address part. You could use this:
// ScanStringTok.java protected void process(LineNumberReader is) { String s = null; try { while ((s = is.readLine( )) != null) { StringTokenizer st = new StringTokenizer(s, "@", true); String user = (String)st.nextElement( ); st.nextElement( ); String host = (String)st.nextElement( ); System.out.println("User name: " + user + "; host part: " + host); // Presumably you would now do something // with the user and host parts... } } catch (NoSuchElementException ix) { System.err.println("Line " + is.getLineNumber( ) + ": Invalid input " + s); } catch (IOException e) { System.err.println(e); } }
The
StreamTokenizer
class in package
java.util
provides slightly more capabilities for
scanning a file. It will read
characters and assemble them into words,
or tokens. It will return these tokens to you
along with a “type code” describing the kind of token it
found. This will either be one of four predefined types
(StringTokenizer.TT_WORD, TT_NUMBER, TT_WORD, or TT_EOL for the end
of line), or the ASCII value of an ordinary character (such as 40 for
the space character). Methods such as ordinaryCharacter( )
allow you to specify how to
categorize characters, while others such as
slashSlashComment( )
allow you to enable or
disable features.
The example shows a StreamTokenizer
used to
implement a simple immediate-mode
stack-based calculator:
2 2 + = 4 22 7 / = 3.141592857
I read tokens as they arrive from the
StreamTokenizer
. Numbers get put on the stack. The
four operators (+
, -
,
*
, and /
) are immediately
performed on the two elements at the top of the stack, and the result
is put back on the top of the stack. The = operator causes the top
element to be printed, but is left on the stack so that you can say:
4 5 * = 2 / = 20.0 10.0
Here is the relevant code from
SimpleCalc
:
public class SimpleCalc { /** The StreamTokenizer */ protected StreamTokenizer tf; /** The variable name (not used in this version) */ protected String variable; /** The operand stack */ protected Stack s; /** Construct a SimpleCalc from an existing Reader */ public SimpleCalc(Reader rdr) throws IOException { tf = new StreamTokenizer(rdr); // Control the input character set: tf.slashSlashComments(true); // treat "//" as comments tf.ordinaryChar('-'), // used for subtraction tf.ordinaryChar('/'), // used for division s = new Stack( ); } protected void doCalc( ) throws IOException { int iType; double tmp; while ((iType = tf.nextToken( )) != tf.TT_EOF) { switch(iType) { case StringTokenizer.TT_NUMBER: // Found a number, push value to stack push(tf.nval); break; case StringTokenizer.TT_WORD: // Found a variable, save its name. Not used here. */ variable = tf.sval; break; case '+': // Found + operator, perform it immediately. push(pop() + pop( )); break; case '-': // Found + operator, perform it (order matters). tmp = pop( ); push(pop( ) - tmp); break; case '*': // Multiply works OK push(pop() * pop( )); break; case '/': // Handle division carefully: order matters! tmp = pop( ); push(pop( ) / tmp); break; case '=': System.out.println(peek( )); break; default: System.out.println("What's this? iType = " + iType); } } } }
While StreamTokenizer
is useful, it is limited in
the number of different tokens that it knows and has no way of
specifying that the tokens must appear in a particular order. To do
more advanced scanning, you need to use some special-purpose
scanning tools. Such tools have
been known and used for a long time in the Unix realm. The best-known
examples are yacc and lex, (discussed in the O’Reilly text
lex & yacc). These
tools let you specify the lexical structure of your
input using
regular expressions (see Chapter 4). For example, you might say that an email
address consists of a series of alphanumerics, followed by an at sign
(@), followed by a series of alphanumerics with periods embedded, as:
name: [A-Za-z0-9]+@[A-Za-z0-0.]
The tool will then write code that recognizes the characters you have
described. There is also the grammatical specification, which says,
for example, that the keyword ADDRESS
must appear,
followed by a colon, followed by a “name” token as
previously defined.
One widely used scanning tool is JavaCC . Though still owned by Sun, it is being distributed and supported by WebGain (http://www.webgain.com/products/metamata/java_doc.html). JavaCC can be used to write grammars for a wide variety of programs, from simple calculators such as the one earlier in this recipe, through HTML and CORBA/IDL, up to full Java and C/C++ compilers. Examples of these are included with the JavaCC distribution. Unfortunately, the learning curve for parsers in general precludes providing a simple and comprehensive example here. Please refer to the documentation and the numerous examples provided with the JavaCC distribution.
That’s all I have to say on scanning: simple line-at-a-time
scanners using StringTokenizer
, fancier
token-based scanners using StreamTokenizer
, and
grammar-based scanners based on JavaCC and similar tools. Scan well
and
prosper!
3.15.3.167