Chapter 1. Streams and Files

<feature><title></title> <objective>

STREAMS

</objective>
<objective>

TEXT INPUT AND OUTPUT

</objective>
<objective>

READING AND WRITING BINARY DATA

</objective>
<objective>

ZIP ARCHIVES

</objective>
<objective>

OBJECT STREAMS AND SERIALIZATION

</objective>
<objective>

FILE MANAGEMENT

</objective>
<objective>

NEW I/O

</objective>
<objective>

REGULAR EXPRESSIONS

</objective>
</feature>

In this chapter, we cover the Java application programming interfaces (APIs) for input and output. You will learn how to access files and directories and how to read and write data in binary and text format. This chapter also shows you the object serialization mechanism that lets you store objects as easily as you can store text or numeric data. Next, we turn to several improvements that were made in the “new I/O” package java.nio, introduced in Java SE 1.4. We finish the chapter with a discussion of regular expressions, even though they are not actually related to streams and files. We couldn’t find a better place to handle that topic, and apparently neither could the Java team—the regular expression API specification was attached to the specification request for the “new I/O” features of Java SE 1.4.

Streams

In the Java API, an object from which we can read a sequence of bytes is called an input stream. An object to which we can write a sequence of bytes is called an output stream. These sources and destinations of byte sequences can be—and often are—files, but they can also be network connections and even blocks of memory. The abstract classes InputStream and OutputStream form the basis for a hierarchy of input/output (I/O) classes.

Because byte-oriented streams are inconvenient for processing information stored in Unicode (recall that Unicode uses multiple bytes per character), there is a separate hierarchy of classes for processing Unicode characters that inherit from the abstract Reader and Writer classes. These classes have read and write operations that are based on two-byte Unicode code units rather than on single-byte characters.

Reading and Writing Bytes

The InputStream class has an abstract method:

abstract int read()

This method reads one byte and returns the byte that was read, or −1 if it encounters the end of the input source. The designer of a concrete input stream class overrides this method to provide useful functionality. For example, in the FileInputStream class, this method reads one byte from a file. System.in is a predefined object of a subclass of InputStream that allows you to read information from the keyboard.

The InputStream class also has nonabstract methods to read an array of bytes or to skip a number of bytes. These methods call the abstract read method, so subclasses need to override only one method.

Similarly, the OutputStream class defines the abstract method

abstract void write(int b)

which writes one byte to an output location.

Both the read and write methods block until the bytes are actually read or written. This means that if the stream cannot immediately be accessed (usually because of a busy network connection), the current thread blocks. This gives other threads the chance to do useful work while the method is waiting for the stream to again become available.

The available method lets you check the number of bytes that are currently available for reading. This means a fragment like the following is unlikely to block:

int bytesAvailable = in.available();
if (bytesAvailable > 0)
{
   byte[] data = new byte[bytesAvailable];
   in.read(data);
}

When you have finished reading or writing to a stream, close it by calling the close method. This call frees up operating system resources that are in limited supply. If an application opens too many streams without closing them, system resources can become depleted. Closing an output stream also flushes the buffer used for the output stream: any characters that were temporarily placed in a buffer so that they could be delivered as a larger packet are sent off. In particular, if you do not close a file, the last packet of bytes might never be delivered. You can also manually flush the output with the flush method.

Even if a stream class provides concrete methods to work with the raw read and write functions, application programmers rarely use them. The data that you are interested in probably contain numbers, strings, and objects, not raw bytes.

Java gives you many stream classes derived from the basic InputStream and OutputStream classes that let you work with data in the forms that you usually use rather than at the byte level.

The Complete Stream Zoo

Unlike C, which gets by just fine with a single type FILE*, Java has a whole zoo of more than 60 (!) different stream types (see Figures 1-1 and 1-2).

Input and output stream hierarchy

Figure 1-1. Input and output stream hierarchy

Reader and writer hierarchy

Figure 1-2. Reader and writer hierarchy

Let us divide the animals in the stream class zoo by how they are used. There are separate hierarchies for classes that process bytes and characters. As you saw, the InputStream and OutputStream classes let you read and write individual bytes and arrays of bytes. These classes form the basis of the hiearchy shown in Figure 1-1. To read and write strings and numbers, you need more capable subclasses. For example, DataInputStream and DataOutputStream let you read and write all the primitive Java types in binary format. Finally, there are streams that do useful stuff; for example, the ZipInputStream and ZipOutputStream that let you read and write files in the familiar ZIP compression format.

For Unicode text, on the other hand, you use subclasses of the abstract classes Reader and Writer (see Figure 1-2). The basic methods of the Reader and Writer classes are similar to the ones for InputStream and OutputStream.

abstract int read()
abstract void write(int c)

The read method returns either a Unicode code unit (as an integer between 0 and 65535) or −1 when you have reached the end of the file. The write method is called with a Unicode code unit. (See Volume I, Chapter 3 for a discussion of Unicode code units.)

Java SE 5.0 introduced four additional interfaces: Closeable, Flushable, Readable, and Appendable (see Figure 1-3). The first two interfaces are very simple, with methods

void close() throws IOException

and

void flush()

respectively. The classes InputStream, OutputStream, Reader, and Writer all implement the Closeable interface. OutputStream and Writer implement the Flushable interface.

The Closeable, Flushable, Readable, and Appendable interfaces

Figure 1-3. The Closeable, Flushable, Readable, and Appendable interfaces

The Readable interface has a single method

int read(CharBuffer cb)

The CharBuffer class has methods for sequential and random read/write access. It represents an in-memory buffer or a memory-mapped file. (See “The Buffer Data Structure” on page 72 for details.)

The Appendable interface has two methods for appending single characters and character sequences:

Appendable append(char c)
Appendable append(CharSequence s)

The CharSequence interface describes basic properties of a sequence of char values. It is implemented by String, CharBuffer, StringBuilder, and StringBuffer.

Of the stream zoo classes, only Writer implements Appendable.

Combining Stream Filters

FileInputStream and FileOutputStream give you input and output streams attached to a disk file. You give the file name or full path name of the file in the constructor. For example,

FileInputStream fin = new FileInputStream("employee.dat");

looks in the user directory for a file named "employee.dat".

Tip

Tip

Because all the classes in java.io interpret relative path names as starting with the user’s working directory, you may want to know this directory. You can get at this information by a call to System.getProperty("user.dir").

Like the abstract InputStream and OutputStream classes, these classes support only reading and writing on the byte level. That is, we can only read bytes and byte arrays from the object fin.

byte b = (byte) fin.read();

As you will see in the next section, if we just had a DataInputStream, then we could read numeric types:

DataInputStream din = . . .;
double s = din.readDouble();

But just as the FileInputStream has no methods to read numeric types, the DataInputStream has no method to get data from a file.

Java uses a clever mechanism to separate two kinds of responsibilities. Some streams (such as the FileInputStream and the input stream returned by the openStream method of the URL class) can retrieve bytes from files and other more exotic locations. Other streams (such as the DataInputStream and the PrintWriter) can assemble bytes into more useful data types. The Java programmer has to combine the two. For example, to be able to read numbers from a file, first create a FileInputStream and then pass it to the constructor of a DataInputStream.

FileInputStream fin = new FileInputStream("employee.dat");
DataInputStream din = new DataInputStream(fin);
double s = din.readDouble();

If you look at Figure 1-1 again, you can see the classes FilterInputStream and FilterOutputStream. The subclasses of these files are used to add capabilities to raw byte streams.

You can add multiple capabilities by nesting the filters. For example, by default, streams are not buffered. That is, every call to read asks the operating system to dole out yet another byte. It is more efficient to request blocks of data instead and put them in a buffer. If you want buffering and the data input methods for a file, you need to use the following rather monstrous sequence of constructors:

DataInputStream din = new DataInputStream(
   new BufferedInputStream(
      new FileInputStream("employee.dat")));

Notice that we put the DataInputStream last in the chain of constructors because we want to use the DataInputStream methods, and we want them to use the buffered read method.

Sometimes you’ll need to keep track of the intermediate streams when chaining them together. For example, when reading input, you often need to peek at the next byte to see if it is the value that you expect. Java provides the PushbackInputStream for this purpose.

PushbackInputStream pbin = new PushbackInputStream(
   new BufferedInputStream(
      new FileInputStream("employee.dat")));

Now you can speculatively read the next byte

int b = pbin.read();

and throw it back if it isn’t what you wanted.

if (b != '<') pbin.unread(b);

But reading and unreading are the only methods that apply to the pushback input stream. If you want to look ahead and also read numbers, then you need both a pushback input stream and a data input stream reference.

DataInputStream din = new DataInputStream(
   pbin = new PushbackInputStream(
      new BufferedInputStream(
         new FileInputStream("employee.dat"))));

Of course, in the stream libraries of other programming languages, niceties such as buffering and lookahead are automatically taken care of, so it is a bit of a hassle in Java that one has to resort to combining stream filters in these cases. But the ability to mix and match filter classes to construct truly useful sequences of streams does give you an immense amount of flexibility. For example, you can read numbers from a compressed ZIP file by using the following sequence of streams (see Figure 1-4):

ZipInputStream zin = new ZipInputStream(new FileInputStream("employee.zip"));
DataInputStream din = new DataInputStream(zin);
A sequence of filtered streams

Figure 1-4. A sequence of filtered streams

(See “ZIP Archives” on page 32 for more on Java’s ability to handle ZIP files.)

Text Input and Output

When saving data, you have the choice between binary and text format. For example, if the integer 1234 is saved in binary, it is written as the sequence of bytes 00 00 04 D2 (in hexadecimal notation). In text format, it is saved as the string "1234". Although binary I/O is fast and efficient, it is not easily readable by humans. We first discuss text I/O and cover binary I/O in the section “Reading and Writing Binary Data” on page 23.

When saving text strings, you need to consider the character encoding. In the UTF-16 encoding, the string "1234" is encoded as 00 31 00 32 00 33 00 34 (in hex). However, many programs expect that text files are encoded in a different encoding. In ISO 8859-1, the encoding most commonly used in the United States and Western Europe, the string would be written as 31 32 33 34, without the zero bytes.

The OutputStreamWriter class turns a stream of Unicode characters into a stream of bytes, using a chosen character encoding. Conversely, the InputStreamReader class turns an input stream that contains bytes (specifying characters in some character encoding) into a reader that emits Unicode characters.

For example, here is how you make an input reader that reads keystrokes from the console and converts them to Unicode:

InputStreamReader in = new InputStreamReader(System.in);

This input stream reader assumes the default character encoding used by the host system, such as the ISO 8859-1 encoding in Western Europe. You can choose a different encoding by specifying it in the constructor for the InputStreamReader, for example,

InputStreamReader in = new InputStreamReader(new FileInputStream("kremlin.dat"), "ISO8859_5");

See “Character Sets” on page 19 for more information on character encodings.

Because it is so common to attach a reader or writer to a file, a pair of convenience classes, FileReader and FileWriter, is provided for this purpose. For example, the writer definition

FileWriter out = new FileWriter("output.txt");

is equivalent to

FileWriter out = new FileWriter(new FileOutputStream("output.txt"));

How to Write Text Output

For text output, you want to use a PrintWriter. That class has methods to print strings and numbers in text format. There is even a convenience constructor to link a PrintWriter with a FileWriter. The statement

PrintWriter out = new PrintWriter("employee.txt");

is equivalent to

PrintWriter out = new PrintWriter(new FileWriter("employee.txt"));

To write to a print writer, you use the same print, println, and printf methods that you used with System.out. You can use these methods to print numbers (int, short, long, float, double), characters, boolean values, strings, and objects.

For example, consider this code:

String name = "Harry Hacker";
double salary = 75000;
out.print(name);
out.print(' '),
out.println(salary);

This writes the characters

Harry Hacker 75000.0

to the writer out. The characters are then converted to bytes and end up in the file employee.txt.

The println method adds the correct end-of-line character for the target system (" " on Windows, " " on UNIX) to the line. This is the string obtained by the call System.getProperty("line.separator").

If the writer is set to autoflush mode, then all characters in the buffer are sent to their destination whenever println is called. (Print writers are always buffered.) By default, autoflushing is not enabled. You can enable or disable autoflushing by using the PrintWriter(Writer out, boolean autoFlush) constructor:

PrintWriter out = new PrintWriter(new FileWriter("employee.txt"), true); // autoflush

The print methods don’t throw exceptions. You can call the checkError method to see if something went wrong with the stream.

Note

Note

Java veterans might wonder whatever happened to the PrintStream class and to System.out. In Java 1.0, the PrintStream class simply truncated all Unicode characters to ASCII characters by dropping the top byte. Clearly, that was not a clean or portable approach, and it was fixed with the introduction of readers and writers in Java 1.1. For compatibility with existing code, System.in, System.out, and System.err are still streams, not readers and writers. But now the PrintStream class internally converts Unicode characters to the default host encoding in the same way as the PrintWriter does. Objects of type PrintStream act exactly like print writers when you use the print and println methods, but unlike print writers, they allow you to output raw bytes with the write(int) and write(byte[]) methods.

How to Read Text Input

As you know:

  • To write data in binary format, you use a DataOutputStream.

  • To write in text format, you use a PrintWriter.

Therefore, you might expect that there is an analog to the DataInputStream that lets you read data in text format. The closest analog is the Scanner class that we used extensively in Volume I. However, before Java SE 5.0, the only game in town for processing text input was the BufferedReader class—it has a method, readLine, that lets you read a line of text. You need to combine a buffered reader with an input source.

BufferedReader in = new BufferedReader(new FileReader("employee.txt"));

The readLine method returns null when no more input is available. A typical input loop, therefore, looks like this:

String line;
while ((line = in.readLine()) != null)
{
   do something with line
}

However, a BufferedReader has no methods for reading numbers. We suggest that you use a Scanner for reading text input.

Saving Objects in Text Format

In this section, we walk you through an example program that stores an array of Employee records in a text file. Each record is stored in a separate line. Instance fields are separated from each other by delimiters. We use a vertical bar (|) as our delimiter. (A colon (:) is another popular choice. Part of the fun is that everyone uses a different delimiter.) Naturally, we punt on the issue of what might happen if a | actually occurred in one of the strings we save.

Here is a sample set of records:

Harry Hacker|35500|1989|10|1
Carl Cracker|75000|1987|12|15
Tony Tester|38000|1990|3|15

Writing records is simple. Because we write to a text file, we use the PrintWriter class. We simply write all fields, followed by either a | or, for the last field, a . This work is done in the following writeData method that we add to our Employee class.

public void writeData(PrintWriter out) throws IOException
{
   GregorianCalendar calendar = new GregorianCalendar();
   calendar.setTime(hireDay);
   out.println(name + "|"
      + salary + "|"
      + calendar.get(Calendar.YEAR) + "|"
      + (calendar.get(Calendar.MONTH) + 1) + "|"
      + calendar.get(Calendar.DAY_OF_MONTH));
}

To read records, we read in a line at a time and separate the fields. We use a scanner to read each line and then split the line into tokens with the String.split method.

public void readData(Scanner in)
{
   String line = in.nextLine();
   String[] tokens = line.split("\|");
   name = tokens[0];
   salary = Double.parseDouble(tokens[1]);
   int y = Integer.parseInt(tokens[2]);
   int m = Integer.parseInt(tokens[3]);
   int d = Integer.parseInt(tokens[4]);
   GregorianCalendar calendar = new GregorianCalendar(y, m - 1, d);
   hireDay = calendar.getTime();
}

The parameter of the split method is a regular expression describing the separator. We discuss regular expressions in more detail at the end of this chapter. As it happens, the vertical bar character has a special meaning in regular expressions, so it needs to be escaped with a character. That character needs to be escaped by another , yielding the "\|" expression.

The complete program is in Listing 1-1. The static method

void writeData(Employee[] e, PrintWriter out)

first writes the length of the array, then writes each record. The static method

Employee[] readData(BufferedReader in)

first reads in the length of the array, then reads in each record. This turns out to be a bit tricky:

int n = in.nextInt();
in.nextLine(); // consume newline
Employee[] employees = new Employee[n];
for (int i = 0; i < n; i++)
{
   employees[i] = new Employee();
   employees[i].readData(in);
}

The call to nextInt reads the array length but not the trailing newline character. We must consume the newline so that the readData method can get the next input line when it calls the nextLine method.

Example 1-1. TextFileTest.java

  1. import java.io.*;
  2. import java.util.*;
  3.
  4. /**
  5.  * @version 1.12 2007-06-22
  6.  * @author Cay Horstmann
  7.  */
  8. public class TextFileTest
  9. {
 10.    public static void main(String[] args)
 11.    {
 12.       Employee[] staff = new Employee[3];
 13.
 14.       staff[0] = new Employee("Carl Cracker", 75000, 1987, 12, 15);
 15.       staff[1] = new Employee("Harry Hacker", 50000, 1989, 10, 1);
 16.       staff[2] = new Employee("Tony Tester", 40000, 1990, 3, 15);
 17.
 18.       try
 19.       {
 20.          // save all employee records to the file employee.dat
 21.          PrintWriter out = new PrintWriter("employee.dat");
 22.          writeData(staff, out);
 23.          out.close();
 24.
 25.          // retrieve all records into a new array
 26.          Scanner in = new Scanner(new FileReader("employee.dat"));
 27.          Employee[] newStaff = readData(in);
 28.          in.close();
 29.
 30.          // print the newly read employee records
 31.          for (Employee e : newStaff)
 32.             System.out.println(e);
 33.       }
 34.       catch (IOException exception)
 35.       {
 36.          exception.printStackTrace();
 37.       }
 38.    }
 39.
 40.    /**
 41.     * Writes all employees in an array to a print writer
 42.     * @param employees an array of employees
 43.     * @param out a print writer
 44.     */
 45.    private static void writeData(Employee[] employees, PrintWriter out) throws IOException
 46.    {
 47.       // write number of employees
 48.       out.println(employees.length);
 49.
 50.       for (Employee e : employees)
 51.          e.writeData(out);
 52.    }
 53.    /**
 54.     * Reads an array of employees from a scanner
 55.     * @param in the scanner
 56.     * @return the array of employees
 57.     */
 58.    private static Employee[] readData(Scanner in)
 59.    {
 60.       // retrieve the array size
 61.       int n = in.nextInt();
 62.       in.nextLine(); // consume newline
 63.
 64.       Employee[] employees = new Employee[n];
 65.       for (int i = 0; i < n; i++)
 66.       {
 67.          employees[i] = new Employee();
 68.          employees[i].readData(in);
 69.       }
 70.       return employees;
 71.    }
 72. }
 73.
 74. class Employee
 75. {
 76.    public Employee()
 77.    {
 78.    }
 79.
 80.    public Employee(String n, double s, int year, int month, int day)
 81.    {
 82.       name = n;
 83.       salary = s;
 84.       GregorianCalendar calendar = new GregorianCalendar(year, month - 1, day);
 85.       hireDay = calendar.getTime();
 86.    }
 87.
 88.    public String getName()
 89.    {
 90.       return name;
 91.    }
 92.
 93.    public double getSalary()
 94.    {
 95.       return salary;
 96.    }
 97.
 98.    public Date getHireDay()
 99.    {
100.       return hireDay;
101.    }
102.
103.    public void raiseSalary(double byPercent)
104.    {
105.       double raise = salary * byPercent / 100;
106.       salary += raise;
107.    }
108.
109.    public String toString()
110.    {
111.       return getClass().getName() + "[name=" + name + ",salary=" + salary + ",hireDay="
112.             + hireDay + "]";
113.    }
114.
115.    /**
116.     * Writes employee data to a print writer
117.     * @param out the print writer
118.     */
119.    public void writeData(PrintWriter out)
120.    {
121.       GregorianCalendar calendar = new GregorianCalendar();
122.       calendar.setTime(hireDay);
123.       out.println(name + "|" + salary + "|" + calendar.get(Calendar.YEAR) + "|"
124.            + (calendar.get(Calendar.MONTH) + 1) + "|" + calendar.get(Calendar.DAY_OF_MONTH));
125.    }
126.
127.    /**
128.     * Reads employee data from a buffered reader
129.     * @param in the scanner
130.     */
131.    public void readData(Scanner in)
132.    {
133.       String line = in.nextLine();
134.       String[] tokens = line.split("\|");
135.       name = tokens[0];
136.       salary = Double.parseDouble(tokens[1]);
137.       int y = Integer.parseInt(tokens[2]);
138.       int m = Integer.parseInt(tokens[3]);
139.       int d = Integer.parseInt(tokens[4]);
140.       GregorianCalendar calendar = new GregorianCalendar(y, m - 1, d);
141.       hireDay = calendar.getTime();
142.    }
143.
144.    private String name;
145.    private double salary;
146.    private Date hireDay;
147. }

 

Character Sets

In the past, international character sets have been handled rather unsystematically throughout the Java library. The java.nio package—introduced in Java SE 1.4—unifies character set conversion with the introduction of the Charset class. (Note that the s is lower case.)

A character set maps between sequences of two-byte Unicode code units and byte sequences used in a local character encoding. One of the most popular character encodings is ISO-8859-1, a single-byte encoding of the first 256 Unicode characters. Gaining in importance is ISO-8859-15, which replaces some of the less useful characters of ISO-8859-1 with accented letters used in French and Finnish, and, more important, replaces the “international currency” character ¤ with the Euro symbol (€) in code point 0xA4. Other examples for character encodings are the variable-byte encodings commonly used for Japanese and Chinese.

The Charset class uses the character set names standardized in the IANA Character Set Registry (http://www.iana.org/assignments/character-sets). These names differ slightly from those used in previous versions. For example, the “official” name of ISO-8859-1 is now "ISO-8859-1" and no longer "ISO8859_1", which was the preferred name up to Java SE 1.3.

Note

Note

An excellent reference for the “ISO 8859 alphabet soup” is http://czyborra.com/charsets/iso8859.html.

You obtain a Charset by calling the static forName method with either the official name or one of its aliases:

Charset cset = Charset.forName("ISO-8859-1");

Character set names are case insensitive.

For compatibility with other naming conventions, each character set can have a number of aliases. For example, ISO-8859-1 has aliases

ISO8859-1
ISO_8859_1
ISO8859_1
ISO_8859-1
ISO_8859-1:1987
8859_1
latin1
l1
csISOLatin1
iso-ir-100
cp819
IBM819
IBM-819
819

The aliases method returns a Set object of the aliases. Here is the code to iterate through the aliases:

Set<String> aliases = cset.aliases();
for (String alias : aliases)
   System.out.println(alias);

To find out which character sets are available in a particular implementation, call the static availableCharsets method. Use this code to find out the names of all available character sets:

Map<String, Charset> charsets = Charset.availableCharsets();
for (String name : charsets.keySet())
   System.out.println(name);

Table 1-1 lists the character encodings that every Java implementation is required to have. Table 1-2 lists the encoding schemes that the Java Development Kit (JDK) installs by default. The character sets in Table 1-3 are installed only on operating systems that use non-European languages.

Table 1-1. Required Character Encodings

Charset Standard Name

Legacy Name

Description

US-ASCII

ASCII

American Standard Code for Information Exchange

ISO-8859-1

ISO8859_1

ISO 8859-1, Latin alphabet No. 1

UTF-8

UTF8

Eight-bit Unicode Transformation Format

UTF-16

UTF-16

Sixteen-bit Unicode Transformation Format, byte order specified by an optional initial byte-order mark

UTF-16BE

UnicodeBigUnmarked

Sixteen-bit Unicode Transformation Format, big-endian byte order

UTF-16LE

UnicodeLittleUnmarked

Sixteen-bit Unicode Transformation Format, little-endian byte order

Table 1-2. Basic Character Encodings

Charset Standard Name

Legacy Name

Description

ISO8859-2

ISO8859_2

ISO 8859-2, Latin alphabet No. 2

ISO8859-4

ISO8859_4

ISO 8859-4, Latin alphabet No. 4

ISO8859-5

ISO8859_5

ISO 8859-5, Latin/Cyrillic alphabet

ISO8859-7

ISO8859_7

ISO 8859-7, Latin/Greek alphabet

ISO8859-9

ISO8859_9

ISO 8859-9, Latin alphabet No. 5

ISO8859-13

ISO8859_13

ISO 8859-13, Latin alphabet No. 7

ISO8859-15

ISO8859_15

ISO 8859-15, Latin alphabet No. 9

windows-1250

Cp1250

Windows Eastern European

windows-1251

Cp1251

Windows Cyrillic

windows-1252

Cp1252

Windows Latin-1

windows-1253

Cp1253

Windows Greek

windows-1254

Cp1254

Windows Turkish

windows-1257

Cp1257

Windows Baltic

Table 1-3. Extended Character Encodings

Charset Standard Name

Legacy Name

Description

Big5

Big5

Big5, Traditional Chinese

Big5-HKSCS

Big5_HKSCS

Big5 with Hong Kong extensions, Traditional Chinese

EUC-JP

EUC_JP

JIS X 0201, 0208, 0212, EUC encoding, Japanese

EUC-KR

EUC_KR

KS C 5601, EUC encoding, Korean

GB18030

GB18030

Simplified Chinese, PRC Standard

GBK

GBK

GBK, Simplified Chinese

ISCII91

ISCII91

ISCII91 encoding of Indic scripts

ISO-2022-JP

ISO2022JP

JIS X 0201, 0208 in ISO 2022 form, Japanese

ISO-2022-KR

ISO2022KR

ISO 2022 KR, Korean

ISO8859-3

ISO8859_3

ISO 8859-3, Latin alphabet No. 3

ISO8859-6

ISO8859_6

ISO 8859-6, Latin/Arabic alphabet

ISO8859-8

ISO8859_8

ISO 8859-8, Latin/Hebrew alphabet

Shift_JIS

SJIS

Shift-JIS, Japanese

TIS-620

TIS620

TIS620, Thai

windows-1255

Cp1255

Windows Hebrew

windows-1256

Cp1256

Windows Arabic

windows-1258

Cp1258

Windows Vietnamese

windows-31j

MS932

Windows Japanese

x-EUC-CN

EUC_CN

GB2312, EUC encoding, Simplified Chinese

x-EUC-JP-LINUX

EUC_JP_LINUX

JIS X 0201, 0208, EUC encoding, Japanese

x-EUC-TW

EUC_TW

CNS11643 (Plane 1-3), EUC encoding, Traditional Chinese

x-MS950-HKSCS

MS950_HKSCS

Windows Traditional Chinese with Hong Kong extensions

x-mswin-936

MS936

Windows Simplified Chinese

x-windows-949

MS949

Windows Korean

x-windows-950

MS950

Windows Traditional Chinese

Local encoding schemes cannot represent all Unicode characters. If a character cannot be represented, it is transformed to a ?.

Once you have a character set, you can use it to convert between Unicode strings and encoded byte sequences. Here is how you encode a Unicode string:

String str = . . .;
ByteBuffer buffer = cset.encode(str);
byte[] bytes = buffer.array();

Conversely, to decode a byte sequence, you need a byte buffer. Use the static wrap method of the ByteBuffer array to turn a byte array into a byte buffer. The result of the decode method is a CharBuffer. Call its toString method to get a string.

byte[] bytes = . . .;
ByteBuffer bbuf = ByteBuffer.wrap(bytes, offset, length);
CharBuffer cbuf = cset.decode(bbuf);
String str = cbuf.toString();

Reading and Writing Binary Data

The DataOutput interface defines the following methods for writing a number, character, boolean value, or string in binary format:

writeChars
writeByte
writeInt
writeShort
writeLong
writeFloat
writeDouble
writeChar
writeBoolean
writeUTF

For example, writeInt always writes an integer as a 4-byte binary quantity regardless of the number of digits, and writeDouble always writes a double as an 8-byte binary quantity. The resulting output is not humanly readable, but the space needed will be the same for each value of a given type and reading it back in will be faster than parsing text.

Note

Note

There are two different methods of storing integers and floating-point numbers in memory, depending on the platform you are using. Suppose, for example, you are working with a 4-byte int, say the decimal number 1234, or 4D2 in hexadecimal (1234 = 4 × 256 + 13 × 16 + 2). This can be stored in such a way that the first of the 4 bytes in memory holds the most significant byte (MSB) of the value: 00 00 04 D2. This is the so-called big-endian method. Or we can start with the least significant byte (LSB) first: D2 04 00 00. This is called, naturally enough, the little-endian method. For example, the SPARC uses big-endian; the Pentium, little-endian. This can lead to problems. When a C or C++ file is saved, the data are saved exactly as the processor stores them. That makes it challenging to move even the simplest data files from one platform to another. In Java, all values are written in the big-endian fashion, regardless of the processor. That makes Java data files platform independent.

The writeUTF method writes string data by using a modified version of 8-bit Unicode Transformation Format. Instead of simply using the standard UTF-8 encoding (which is shown in Table 1-4), character strings are first represented in UTF-16 (see Table 1-5) and then the result is encoded using the UTF-8 rules. The modified encoding is different for characters with code higher than 0xFFFF. It is used for backward compatibility with virtual machines that were built when Unicode had not yet grown beyond 16 bits.

Table 1-4. UTF-8 Encoding

Character Range

Encoding

0...7F

0a6a5a4a3a2a1a0

80...7FF

110a10a9a8a7a6 10a5a4a3a2a1a0

800...FFFF

1110a15a14a13a12 10a11a10a9a8a7a6 10a5a4a3a2a1a0

10000...10FFFF

11110a20a19a18 10a17a16a15a14a13a12 10a11a10a9a8a7a6 10a5a4a3a2a1a0

Table 1-5. UTF-16 Encoding

Character Range

Encoding

0...FFFF

a15a14a13a12a11a10a9a8 a7a6a5a4a3a2a1a0

10000...10FFFF

110110b19b18 b17b16a15a14a13a12a11a10 110111a9a8 a7a6a5a4a3a2a1a0 where b19b18b17b16 = a20a19a18a17a16 -1

Because nobody else uses this modification of UTF-8, you should only use the writeUTF method to write strings that are intended for a Java virtual machine; for example, if you write a program that generates bytecodes. Use the writeChars method for other purposes.

Note

Note

See RFC 2279 (http://ietf.org/rfc/rfc2279.txt) and RFC 2781 (http://ietf.org/rfc/rfc2781.txt) for definitions of UTF-8 and UTF-16.

To read the data back in, use the following methods, defined in the DataInput interface:

readInt
readShort
readLong
readFloat
readDouble
readChar
readBoolean
readUTF

The DataInputStream class implements the DataInput interface. To read binary data from a file, you combine a DataInputStream with a source of bytes such as a FileInputStream:

DataInputStream in = new DataInputStream(new FileInputStream("employee.dat"));

Similarly, to write binary data, you use the DataOutputStream class that implements the DataOutput interface:

DataOutputStream out = new DataOutputStream(new FileOutputStream("employee.dat"));

Random-Access Files

The RandomAccessFile class lets you find or write data anywhere in a file. Disk files are random access, but streams of data from a network are not. You open a random-access file either for reading only or for both reading and writing. You specify the option by using the string "r" (for read access) or "rw" (for read/write access) as the second argument in the constructor.

RandomAccessFile in = new RandomAccessFile("employee.dat", "r");
RandomAccessFile inOut = new RandomAccessFile("employee.dat", "rw");

When you open an existing file as a RandomAccessFile, it does not get deleted.

A random-access file has a file pointer that indicates the position of the next byte that will be read or written. The seek method sets the file pointer to an arbitrary byte position within the file. The argument to seek is a long integer between zero and the length of the file in bytes.

The getFilePointer method returns the current position of the file pointer.

The RandomAccessFile class implements both the DataInput and DataOutput interfaces. To read and write from a random-access file, you use methods such as readInt/writeInt and readChar/writeChar that we discussed in the preceding section.

We now walk through an example program that stores employee records in a random access file. Each record will have the same size. This makes it easy to read an arbitrary record. Suppose you want to position the file pointer to the third record. Simply set the file pointer to the appropriate byte position and start reading.

long n = 3;
in.seek((n - 1) * RECORD_SIZE);
Employee e = new Employee();
e.readData(in);

If you want to modify the record and then save it back into the same location, remember to set the file pointer back to the beginning of the record:

in.seek((n - 1) * RECORD_SIZE);
e.writeData(out);

To determine the total number of bytes in a file, use the length method. The total number of records is the length divided by the size of each record.

long nbytes = in.length(); // length in bytes
int nrecords = (int) (nbytes / RECORD_SIZE);

Integers and floating-point values have a fixed size in binary format, but we have to work harder for strings. We provide two helper methods to write and read strings of a fixed size.

The writeFixedString writes the specified number of code units, starting at the beginning of the string. (If there are too few code units, the method pads the string, using zero values.)

public static void writeFixedString(String s, int size, DataOutput out)
   throws IOException
{
   for (int i = 0; i < size; i++)
   {
      char ch = 0;
      if (i < s.length()) ch = s.charAt(i);
      out.writeChar(ch);
   }
}

The readFixedString method reads characters from the input stream until it has consumed size code units or until it encounters a character with a zero value. Then, it skips past the remaining zero values in the input field. For added efficiency, this method uses the StringBuilder class to read in a string.

public static String readFixedString(int size, DataInput in)
   throws IOException
{
   StringBuilder b = new StringBuilder(size);
   int i = 0;
   boolean more = true;
   while (more && i < size)
   {
      char ch = in.readChar();
      i++;
      if (ch == 0) more = false;
      else b.append(ch);
   }
   in.skipBytes(2 * (size - i));
   return b.toString();
}

We placed the writeFixedString and readFixedString methods inside the DataIO helper class.

To write a fixed-size record, we simply write all fields in binary.

public void writeData(DataOutput out) throws IOException
{
   DataIO.writeFixedString(name, NAME_SIZE, out);
   out.writeDouble(salary);

   GregorianCalendar calendar = new GregorianCalendar();
   calendar.setTime(hireDay);
   out.writeInt(calendar.get(Calendar.YEAR));
   out.writeInt(calendar.get(Calendar.MONTH) + 1);
   out.writeInt(calendar.get(Calendar.DAY_OF_MONTH));
}

Reading the data back is just as simple.

public void readData(DataInput in) throws IOException
{
   name = DataIO.readFixedString(NAME_SIZE, in);
   salary = in.readDouble();
   int y = in.readInt();
   int m = in.readInt();
   int d = in.readInt();
   GregorianCalendar calendar = new GregorianCalendar(y, m - 1, d);
   hireDay = calendar.getTime();
}

Let us compute the size of each record. We will use 40 characters for the name strings. Therefore, each record contains 100 bytes:

  • 40 characters = 80 bytes for the name

  • 1 double = 8 bytes for the salary

  • 3 int = 12 bytes for the date

The program shown in Listing 1-2 writes three records into a data file and then reads them from the file in reverse order. To do this efficiently requires random access—we need to get at the third record first.

Example 1-2. RandomFileTest.java

  1. import java.io.*;
  2. import java.util.*;
  3.
  4. /**
  5.   * @version 1.11 2004-05-11
  6.   * @author Cay Horstmann
  7.   */
  8.
  9. public class RandomFileTest
 10. {
 11.    public static void main(String[] args)
 12.    {
 13.       Employee[] staff = new Employee[3];
 14.
 15.       staff[0] = new Employee("Carl Cracker", 75000, 1987, 12, 15);
 16.       staff[1] = new Employee("Harry Hacker", 50000, 1989, 10, 1);
 17.       staff[2] = new Employee("Tony Tester", 40000, 1990, 3, 15);
 18.
 19.       try
 20.       {
 21.          // save all employee records to the file employee.dat
 22.          DataOutputStream out = new DataOutputStream(new FileOutputStream("employee.dat"));
 23.          for (Employee e : staff)
 24.             e.writeData(out);
 25.          out.close();
 26.
 27.          // retrieve all records into a new array
 28.          RandomAccessFile in = new RandomAccessFile("employee.dat", "r");
 29.          // compute the array size
 30.          int n = (int)(in.length() / Employee.RECORD_SIZE);
 31.          Employee[] newStaff = new Employee[n];
 32.
 33.          // read employees in reverse order
 34.          for (int i = n - 1; i >= 0; i--)
 35.          {
 36.             newStaff[i] = new Employee();
 37.             in.seek(i * Employee.RECORD_SIZE);
 38.             newStaff[i].readData(in);
 39.          }
 40.          in.close();
 41.
 42.          // print the newly read employee records
 43.          for (Employee e : newStaff)
 44.             System.out.println(e);
 45.       }
 46.       catch (IOException e)
 47.       {
 48.          e.printStackTrace();
 49.       }
 50.    }
 51. }
 52.
 53. class Employee
 54. {
 55.    public Employee() {}
 56.
 57.    public Employee(String n, double s, int year, int month, int day)
 58.    {
 59.       name = n;
 60.       salary = s;
 61.       GregorianCalendar calendar = new GregorianCalendar(year, month - 1, day);
 62.       hireDay = calendar.getTime();
 63.    }
 64.
 65.    public String getName()
 66.    {
 67.       return name;
 68.    }
 69.
 70.    public double getSalary()
 71.    {
 72.       return salary;
 73.    }
 74.
 75.    public Date getHireDay()
 76.    {
 77.       return hireDay;
 78.    }
 79.
 80.    /**
 81.       Raises the salary of this employee.
 82.       @byPercent the percentage of the raise
 83.    */
 84.    public void raiseSalary(double byPercent)
 85.    {
 86.       double raise = salary * byPercent / 100;
 87.       salary += raise;
 88.    }
 89.
 90.    public String toString()
 91.    {
 92.       return getClass().getName()
 93.          + "[name=" + name
 94.          + ",salary=" + salary
 95.          + ",hireDay=" + hireDay
 96.          + "]";
 97.    }
 98.
 99.    /**
100.       Writes employee data to a data output
101.       @param out the data output
102.    */
103.    public void writeData(DataOutput out) throws IOException
104.    {
105.       DataIO.writeFixedString(name, NAME_SIZE, out);
106.       out.writeDouble(salary);
107.
108.       GregorianCalendar calendar = new GregorianCalendar();
109.       calendar.setTime(hireDay);
110.       out.writeInt(calendar.get(Calendar.YEAR));
111.       out.writeInt(calendar.get(Calendar.MONTH) + 1);
112.       out.writeInt(calendar.get(Calendar.DAY_OF_MONTH));
113.    }
114.
115.    /**
116.       Reads employee data from a data input
117.       @param in the data input
118.    */
119.    public void readData(DataInput in) throws IOException
120.    {
121.       name = DataIO.readFixedString(NAME_SIZE, in);
122.       salary = in.readDouble();
123.       int y = in.readInt();
124.       int m = in.readInt();
125.       int d = in.readInt();
126.       GregorianCalendar calendar = new GregorianCalendar(y, m - 1, d);
127.       hireDay = calendar.getTime();
128.    }
129.
130.    public static final int NAME_SIZE = 40;
131.    public static final int RECORD_SIZE = 2 * NAME_SIZE + 8 + 4 + 4 + 4;
132.
133.    private String name;
134.    private double salary;
135.    private Date hireDay;
136. }
137.
138. class DataIO
139. {
140.    public static String readFixedString(int size, DataInput in)
141.       throws IOException
142.    {
143.       StringBuilder b = new StringBuilder(size);
144.       int i = 0;
145.       boolean more = true;
146.       while (more && i < size)
147.       {
148.          char ch = in.readChar();
149.          i++;
150.          if (ch == 0) more = false;
151.          else b.append(ch);
152.       }
153.       in.skipBytes(2 * (size - i));
154.       return b.toString();
155.    }
156.
157.    public static void writeFixedString(String s, int size, DataOutput out)
158.       throws IOException
159.    {
160.       for (int i = 0; i < size; i++)
161.       {
162.          char ch = 0;
163.          if (i < s.length()) ch = s.charAt(i);
164.          out.writeChar(ch);
165.       }
166.    }
167. }

 

ZIP Archives

ZIP archives store one or more files in (usually) compressed format. Each ZIP archive has a header with information such as the name of the file and the compression method that was used. In Java, you use a ZipInputStream to read a ZIP archive. You need to look at the individual entries in the archive. The getNextEntry method returns an object of type ZipEntry that describes the entry. The read method of the ZipInputStream is modified to return −1 at the end of the current entry (instead of just at the end of the ZIP file). You must then call closeEntry to read the next entry. Here is a typical code sequence to read through a ZIP file:

ZipInputStream zin = new ZipInputStream(new FileInputStream(zipname));
ZipEntry entry;
while ((entry = zin.getNextEntry()) != null)
{
   analyze entry;
   read the contents of zin;
   zin.closeEntry();
}
zin.close();

To read the contents of a ZIP entry, you will probably not want to use the raw read method; usually, you will use the methods of a more competent stream filter. For example, to read a text file inside a ZIP file, you can use the following loop:

Scanner in = new Scanner(zin);
while (in.hasNextLine())
   do something with in.nextLine();

Note

Note

The ZIP input stream throws a ZipException when there is an error in reading a ZIP file. Normally this error occurs when the ZIP file has been corrupted.

To write a ZIP file, you use a ZipOutputStream. For each entry that you want to place into the ZIP file, you create a ZipEntry object. You pass the file name to the ZipEntry constructor; it sets the other parameters such as file date and decompression method. You can override these settings if you like. Then, you call the putNextEntry method of the ZipOutputStream to begin writing a new file. Send the file data to the ZIP stream. When you are done, call closeEntry. Repeat for all the files you want to store. Here is a code skeleton:

FileOutputStream fout = new FileOutputStream("test.zip");
ZipOutputStream zout = new ZipOutputStream(fout);
for all files
{
   ZipEntry ze = new ZipEntry(filename);
   zout.putNextEntry(ze);
   send data to zout;
   zout.closeEntry();
}
zout.close();

Note

Note

JAR files (which were discussed in Volume I, Chapter 10) are simply ZIP files with another entry, the so-called manifest. You use the JarInputStream and JarOutputStream classes to read and write the manifest entry.

ZIP streams are a good example of the power of the stream abstraction. When you read the data that are stored in compressed form, you don’t worry that the data are being decompressed as they are being requested. And the source of the bytes in ZIP formats need not be a file—the ZIP data can come from a network connection. In fact, whenever the class loader of an applet reads a JAR file, it reads and decompresses data from the network.

Note

Note

The article at http://www.javaworld.com/javaworld/jw-10-2000/jw-1027-toolbox.html shows you how to modify a ZIP archive.

The program shown in Listing 1-3 lets you open a ZIP file. It then displays the files stored in the ZIP archive in the combo box at the bottom of the screen. If you select one of the files, the contents of the file are displayed in the text area, as shown in Figure 1-5.

The ZipTest program

Figure 1-5. The ZipTest program

Example 1-3. ZipTest.java

  1. import java.awt.*;
  2. import java.awt.event.*;
  3. import java.io.*;
  4. import java.util.*;
  5. import java.util.List;
  6. import java.util.zip.*;
  7. import javax.swing.*;
  8.
  9. /**
 10.  * @version 1.32 2007-06-22
 11.  * @author Cay Horstmann
 12.  */
 13. public class ZipTest
 14. {
 15.    public static void main(String[] args)
 16.    {
 17.       EventQueue.invokeLater(new Runnable()
 18.          {
 19.             public void run()
 20.             {
 21.                ZipTestFrame frame = new ZipTestFrame();
 22.                frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
 23.                frame.setVisible(true);
 24.             }
 25.          });
 26.    }
 27. }
 28.
 29. /**
 30.  * A frame with a text area to show the contents of a file inside a ZIP archive, a combo
 31.  * box to select different files in the archive, and a menu to load a new archive.
 32.  */
 33. class ZipTestFrame extends JFrame
 34. {
 35.    public ZipTestFrame()
 36.    {
 37.       setTitle("ZipTest");
 38.       setSize(DEFAULT_WIDTH, DEFAULT_HEIGHT);
 39.
 40.       // add the menu and the Open and Exit menu items
 41.       JMenuBar menuBar = new JMenuBar();
 42.       JMenu menu = new JMenu("File");
 43.
 44.       JMenuItem openItem = new JMenuItem("Open");
 45.       menu.add(openItem);
 46.       openItem.addActionListener(new ActionListener()
 47.          {
 48.             public void actionPerformed(ActionEvent event)
 49.             {
 50.                JFileChooser chooser = new JFileChooser();
 51.                chooser.setCurrentDirectory(new File("."));
 52.                int r = chooser.showOpenDialog(ZipTestFrame.this);
 53.                if (r == JFileChooser.APPROVE_OPTION)
 54.                {
 55.                   zipname = chooser.getSelectedFile().getPath();
 56.                   fileCombo.removeAllItems();
 57.                   scanZipFile();
 58.                }
 59.             }
 60.          });
 61.
 62.       JMenuItem exitItem = new JMenuItem("Exit");
 63.       menu.add(exitItem);
 64.       exitItem.addActionListener(new ActionListener()
 65.          {
 66.             public void actionPerformed(ActionEvent event)
 67.             {
 68.                System.exit(0);
 69.             }
 70.          });
 71.
 72.       menuBar.add(menu);
 73.       setJMenuBar(menuBar);
 74.
 75.       // add the text area and combo box
 76.       fileText = new JTextArea();
 77.       fileCombo = new JComboBox();
 78.       fileCombo.addActionListener(new ActionListener()
 79.          {
 80.             public void actionPerformed(ActionEvent event)
 81.             {
 82.                loadZipFile((String) fileCombo.getSelectedItem());
 83.             }
 84.          });
 85.
 86.       add(fileCombo, BorderLayout.SOUTH);
 87.       add(new JScrollPane(fileText), BorderLayout.CENTER);
 88.    }
 89.
 90.    /**
 91.     * Scans the contents of the ZIP archive and populates the combo box.
 92.     */
 93.    public void scanZipFile()
 94.    {
 95.       new SwingWorker<Void, String>()
 96.          {
 97.             protected Void doInBackground() throws Exception
 98.             {
 99.                ZipInputStream zin = new ZipInputStream(new FileInputStream(zipname));
100.                ZipEntry entry;
101.                while ((entry = zin.getNextEntry()) != null)
102.                {
103.                   publish(entry.getName());
104.                   zin.closeEntry();
105.                }
106.                zin.close();
107.                return null;
108.             }
109.
110.             protected void process(List<String> names)
111.             {
112.                for (String name : names)
113.                   fileCombo.addItem(name);
114.
115.             }
116.          }.execute();
117.    }
118.
119.    /**
120.     * Loads a file from the ZIP archive into the text area
121.     * @param name the name of the file in the archive
122.     */
123.     public void loadZipFile(final String name)
124.     {
125.        fileCombo.setEnabled(false);
126.        fileText.setText("");
127.        new SwingWorker<Void, Void>()
128.           {
129.              protected Void doInBackground() throws Exception
130.              {
131.                 try
132.                 {
133.                    ZipInputStream zin = new ZipInputStream(new FileInputStream(zipname));
134.                    ZipEntry entry;
135.
136.                    // find entry with matching name in archive
137.                    while ((entry = zin.getNextEntry()) != null)
138.                    {
139.                       if (entry.getName().equals(name))
140.                       {
141.                          // read entry into text area
142.                          Scanner in = new Scanner(zin);
143.                          while (in.hasNextLine())
144.                          {
145.                             fileText.append(in.nextLine());
146.                             fileText.append("
");
147.                          }
148.                       }
149.                       zin.closeEntry();
150.                    }
151.                    zin.close();
152.                 }
153.                 catch (IOException e)
154.                 {
155.                    e.printStackTrace();
156.                 }
157.                 return null;
158.              }
159.
160.              protected void done()
161.              {
162.                 fileCombo.setEnabled(true);
163.              }
164.           }.execute();
165.     }
166.
167.     public static final int DEFAULT_WIDTH = 400;
168.     public static final int DEFAULT_HEIGHT = 300;
169.     private JComboBox fileCombo;
170.     private JTextArea fileText;
171.     private String zipname;
172. }

 

Object Streams and Serialization

Using a fixed-length record format is a good choice if you need to store data of the same type. However, objects that you create in an object-oriented program are rarely all of the same type. For example, you might have an array called staff that is nominally an array of Employee records but contains objects that are actually instances of a subclass such as Manager.

It is certainly possible to come up with a data format that allows you to store such polymorphic collections, but fortunately, we don’t have to. The Java language supports a very general mechanism, called object serialization, that makes it possible to write any object to a stream and read it again later. (You will see later in this chapter where the term “serialization” comes from.)

To save object data, you first need to open an ObjectOutputStream object:

ObjectOutputStream out = new ObjectOutputStream(new FileOutputStream("employee.dat"));

Now, to save an object, you simply use the writeObject method of the ObjectOutputStream class as in the following fragment:

Employee harry = new Employee("Harry Hacker", 50000, 1989, 10, 1);
Manager boss = new Manager("Carl Cracker", 80000, 1987, 12, 15);
out.writeObject(harry);
out.writeObject(boss);

To read the objects back in, first get an ObjectInputStream object:

ObjectInputStream in = new ObjectInputStream(new FileInputStream("employee.dat"));

Then, retrieve the objects in the same order in which they were written, using the readObject method.

Employee e1 = (Employee) in.readObject();
Employee e2 = (Employee) in.readObject();

There is, however, one change you need to make to any class that you want to save and restore in an object stream. The class must implement the Serializable interface:

class Employee implements Serializable { . . . }

The Serializable interface has no methods, so you don’t need to change your classes in any way. In this regard, it is similar to the Cloneable interface that we discussed in Volume I, Chapter 6. However, to make a class cloneable, you still had to override the clone method of the Object class. To make a class serializable, you do not need to do anything else.

Note

Note

You can write and read only objects with the writeObject/readObject methods. For primitive type values, you use methods such as writeInt/readInt or writeDouble/readDouble. (The object stream classes implement the DataInput/DataOutput interfaces.)

Behind the scenes, an ObjectOutputStream looks at all fields of the objects and saves their contents. For example, when writing an Employee object, the name, date, and salary fields are written to the output stream.

However, there is one important situation that we need to consider: What happens when one object is shared by several objects as part of its state?

To illustrate the problem, let us make a slight modification to the Manager class. Let’s assume that each manager has a secretary:

class Manager extends Employee
{
   . . .
   private Employee secretary;
}

Each Manager object now contains a reference to the Employee object that describes the secretary. Of course, two managers can share the same secretary, as is the case in Figure 1-6 and the following code:

harry = new Employee("Harry Hacker", . . .);
Manager carl = new Manager("Carl Cracker", . . .);
carl.setSecretary(harry);
Manager tony = new Manager("Tony Tester", . . .);
tony.setSecretary(harry);
Two managers can share a mutual employee

Figure 1-6. Two managers can share a mutual employee

Saving such a network of objects is a challenge. Of course, we cannot save and restore the memory addresses for the secretary objects. When an object is reloaded, it will likely occupy a completely different memory address than it originally did.

Instead, each object is saved with a serial number, hence the name object serialization for this mechanism. Here is the algorithm:

  • Associate a serial number with each object reference that you encounter (as shown in Figure 1-7).

    An example of object serialization

    Figure 1-7. An example of object serialization

  • When encountering an object reference for the first time, save the object data to the stream.

  • If it has been saved previously, just write “same as previously saved object with serial number x.”

When reading back the objects, the procedure is reversed.

  • When an object is specified in the stream for the first time, construct it, initialize it with the stream data, and remember the association between the sequence number and the object reference.

  • When the tag “same as previously saved object with serial number x,” is encountered, retrieve the object reference for the sequence number.

Note

Note

In this chapter, we use serialization to save a collection of objects to a disk file and retrieve it exactly as we stored it. Another very important application is the transmittal of a collection of objects across a network connection to another computer. Just as raw memory addresses are meaningless in a file, they are also meaningless when communicating with a different processor. Because serialization replaces memory addresses with serial numbers, it permits the transport of object collections from one machine to another. We study that use of serialization when discussing remote method invocation in Chapter 5.

Listing 1-4 is a program that saves and reloads a network of Employee and Manager objects (some of which share the same employee as a secretary). Note that the secretary object is unique after reloading—when newStaff[1] gets a raise, that is reflected in the secretary fields of the managers.

Example 1-4. ObjectStreamTest.java

  1. import java.io.*;
  2. import java.util.*;
  3.
  4. /**
  5.  * @version 1.10 17 Aug 1998
  6.  * @author Cay Horstmann
  7.  */
  8. class ObjectStreamTest
  9. {
 10.    public static void main(String[] args)
 11.    {
 12.       Employee harry = new Employee("Harry Hacker", 50000, 1989, 10, 1);
 13.       Manager carl = new Manager("Carl Cracker", 80000, 1987, 12, 15);
 14.       carl.setSecretary(harry);
 15.       Manager tony = new Manager("Tony Tester", 40000, 1990, 3, 15);
 16.       tony.setSecretary(harry);
 17.
 18.       Employee[] staff = new Employee[3];
 19.
 20.       staff[0] = carl;
 21.       staff[1] = harry;
 22.       staff[2] = tony;
 23.
 24.       try
 25.       {
 26.          // save all employee records to the file employee.dat
 27.         ObjectOutputStream out = new ObjectOutputStream(new FileOutputStream("employee.dat"));
 28.          out.writeObject(staff);
 29.          out.close();
 30.
 31.          // retrieve all records into a new array
 32.          ObjectInputStream in = new ObjectInputStream(new FileInputStream("employee.dat"));
 33.          Employee[] newStaff = (Employee[]) in.readObject();
 34.          in.close();
 35.
 36.          // raise secretary's salary
 37.          newStaff[1].raiseSalary(10);
 38.
 39.          // print the newly read employee records
 40.          for (Employee e : newStaff)
 41.             System.out.println(e);
 42.       }
 43.       catch (Exception e)
 44.       {
 45.          e.printStackTrace();
 46.       }
 47.    }
 48. }
 49.
 50. class Employee implements Serializable
 51. {
 52.    public Employee()
 53.    {
 54.    }
 55.
 56.    public Employee(String n, double s, int year, int month, int day)
 57.    {
 58.       name = n;
 59.       salary = s;
 60.       GregorianCalendar calendar = new GregorianCalendar(year, month - 1, day);
 61.       hireDay = calendar.getTime();
 62.    }
 63.
 64.    public String getName()
 65.    {
 66.       return name;
 67.    }
 68.
 69.    public double getSalary()
 70.    {
 71.       return salary;
 72.    }
 73.
 74.    public Date getHireDay()
 75.    {
 76.       return hireDay;
 77.    }
 78.
 79.    public void raiseSalary(double byPercent)
 80.    {
 81.       double raise = salary * byPercent / 100;
 82.       salary += raise;
 83.    }
 84.
 85.    public String toString()
 86.    {
 87.       return getClass().getName() + "[name=" + name + ",salary=" + salary + ",hireDay="
 88.             + hireDay + "]";
 89.    }
 90.
 91.    private String name;
 92.    private double salary;
 93.    private Date hireDay;
 94. }
 95.
 96. class Manager extends Employee
 97. {
 98.    /**
 99.     * Constructs a Manager without a secretary
100.     * @param n the employee's name
101.     * @param s the salary
102.     * @param year the hire year
103.     * @param month the hire month
104.     * @param day the hire day
105.     */
106.    public Manager(String n, double s, int year, int month, int day)
107.    {
108.       super(n, s, year, month, day);
109.       secretary = null;
110.    }
111.
112.    /**
113.     * Assigns a secretary to the manager
114.     * @param s the secretary
115.     */
116.    public void setSecretary(Employee s)
117.    {
118.       secretary = s;
119.    }
120.
121.    public String toString()
122.    {
123.       return super.toString() + "[secretary=" + secretary + "]";
124.    }
125.
126.    private Employee secretary;
127. }

 

Understanding the Object Serialization File Format

Object serialization saves object data in a particular file format. Of course, you can use the writeObject/readObject methods without having to know the exact sequence of bytes that represents objects in a file. Nonetheless, we found studying the data format to be extremely helpful for gaining insight into the object streaming process. Because the details are somewhat technical, feel free to skip this section if you are not interested in the implementation.

Every file begins with the two-byte “magic number”

AC ED

followed by the version number of the object serialization format, which is currently

00 05

(We use hexadecimal numbers throughout this section to denote bytes.) Then, it contains a sequence of objects, in the order that they were saved.

String objects are saved as

74

two-byte length

characters

For example, the string “Harry” is saved as

74 00 05 Harry

The Unicode characters of the string are saved in “modified UTF-8” format.

When an object is saved, the class of that object must be saved as well. The class description contains

  • The name of the class.

  • The serial version unique ID, which is a fingerprint of the data field types and method signatures.

  • A set of flags describing the serialization method.

  • A description of the data fields.

The fingerprint is obtained by ordering descriptions of the class, superclass, interfaces, field types, and method signatures in a canonical way, and then applying the so-called Secure Hash Algorithm (SHA) to that data.

SHA is a fast algorithm that gives a “fingerprint” to a larger block of information. This fingerprint is always a 20-byte data packet, regardless of the size of the original data. It is created by a clever sequence of bit operations on the data that makes it essentially 100 percent certain that the fingerprint will change if the information is altered in any way. (For more details on SHA, see, for example, Cryptography and Network Security: Principles and Practice, by William Stallings [Prentice Hall, 2002].) However, the serialization mechanism uses only the first 8 bytes of the SHA code as a class fingerprint. It is still very likely that the class fingerprint will change if the data fields or methods change.

When reading an object, its fingerprint is compared against the current fingerprint of the class. If they don’t match, then the class definition has changed after the object was written, and an exception is generated. Of course, in practice, classes do evolve, and it might be necessary for a program to read in older versions of objects. We discuss this later in the section entitled “Versioning” on page 54.

Here is how a class identifier is stored:

  • 72

  • 2-byte length of class name

  • class name

  • 8-byte fingerprint

  • 1-byte flag

  • 2-byte count of data field descriptors

  • data field descriptors

  • 78 (end marker)

  • superclass type (70 if none)

The flag byte is composed of three bit masks, defined in java.io.ObjectStreamConstants:

static final byte SC_WRITE_METHOD = 1;
   // class has writeObject method that writes additional data
static final byte SC_SERIALIZABLE = 2;
   // class implements Serializable interface
static final byte SC_EXTERNALIZABLE = 4;
   // class implements Externalizable interface

We discuss the Externalizable interface later in this chapter. Externalizable classes supply custom read and write methods that take over the output of their instance fields. The classes that we write implement the Serializable interface and will have a flag value of 02. The serializable java.util.Date class defines its own readObject/writeObject methods and has a flag of 03.

Each data field descriptor has the format:

  • 1-byte type code

  • 2-byte length of field name

  • field name

  • class name (if field is an object)

The type code is one of the following:

B

byte

C

char

D

double

F

float

I

int

J

long

L

object

S

short

Z

boolean

[

array

When the type code is L, the field name is followed by the field type. Class and field name strings do not start with the string code 74, but field types do. Field types use a slightly different encoding of their names, namely, the format used by native methods.

For example, the salary field of the Employee class is encoded as:

D 00 06 salary

Here is the complete class descriptor of the Employee class:

72 00 08 Employee

 
 

E6 D2 86 7D AE AC 18 1B 02

Fingerprint and flags

 

00 03

Number of instance fields

 

D 00 06 salary

Instance field type and name

 

L 00 07 hireDay

Instance field type and name

 

74 00 10 Ljava/util/Date;

Instance field class name—Date

 

L 00 04 name

Instance field type and name

 

74 00 12 Ljava/lang/String;

Instance field class name—String

 

78

End marker

 

70

No superclass

These descriptors are fairly long. If the same class descriptor is needed again in the file, an abbreviated form is used:

71

4-byte serial number

The serial number refers to the previous explicit class descriptor. We discuss the numbering scheme later.

An object is stored as

73

class descriptor

object data

For example, here is how an Employee object is stored:

40 E8 6A 00 00 00 00 00

salary field value—double

73

hireDay field value—new object

 

71 00 7E 00 08

Existing class java.util.Date

 

77 08 00 00 00 91 1B 4E B1 80 78

External storage—details later

74 00 0C Harry Hacker

name field value—String

As you can see, the data file contains enough information to restore the Employee object.

Arrays are saved in the following format:

75

class descriptor

4-byte number of entries

entries

The array class name in the class descriptor is in the same format as that used by native methods (which is slightly different from the class name used by class names in other class descriptors). In this format, class names start with an L and end with a semicolon.

For example, an array of three Employee objects starts out like this:

75

Array

 

72 00 0B [LEmployee;

New class, string length, class name Employee[]

  

FC BF 36 11 C5 91 11 C7 02

Fingerprint and flags

  

00 00

Number of instance fields

  

78

End marker

  

70

No superclass

  

00 00 00 03

Number of array entries

Note that the fingerprint for an array of Employee objects is different from a fingerprint of the Employee class itself.

All objects (including arrays and strings) and all class descriptors are given serial numbers as they are saved in the output file. The numbers start at 00 7E 00 00.

We already saw that a full class descriptor for any given class occurs only once. Subsequent descriptors refer to it. For example, in our previous example, a repeated reference to the Date class was coded as

71 00 7E 00 08

The same mechanism is used for objects. If a reference to a previously saved object is written, it is saved in exactly the same way; that is, 71 followed by the serial number. It is always clear from the context whether the particular serial reference denotes a class descriptor or an object.

Finally, a null reference is stored as

70

Here is the commented output of the ObjectRefTest program of the preceding section. If you like, run the program, look at a hex dump of its data file employee.dat, and compare it with the commented listing. The important lines toward the end of the output show the reference to a previously saved object.

AC ED 00 05

File header

75

Array staff (serial #1)

 

72 00 0B [LEmployee;

New class, string length, class name Employee[] (serial #0)

  

FC BF 36 11 C5 91 11 C7 02

Fingerprint and flags

  

00 00

Number of instance fields

  

78

End marker

  

70

No superclass

  

00 00 00 03

Number of array entries

73

staff[0]—new object (serial #7)

 

72 00 07 Manager

New class, string length, class name (serial #2)

  

36 06 AE 13 63 8F 59 B7 02

Fingerprint and flags

  

00 01

Number of data fields

  

L 00 09 secretary

Instance field type and name

  

74 00 0A LEmployee;

Instance field class name—String (serial #3)

  

78

End marker

  

72 00 08 Employee

Superclass—new class, string length, class name (serial #4)

   

E6 D2 86 7D AE AC 18 1B 02

Fingerprint and flags

   

00 03

Number of instance fields

   

D 00 06 salary

Instance field type and name

   

L 00 07 hireDay

Instance field type and name

   

74 00 10 Ljava/util/Date;

Instance field class name—String (serial #5)

   

L 00 04 name

Instance field type and name

   

74 00 12 Ljava/lang/String;

Instance field class name—String (serial #6)

   

78

End marker

   

70

No superclass

 

40 F3 88 00 00 00 00 00

salary field value—double

 

73

hireDay field value—new object (serial #9)

  

72 00 0E java.util.Date

New class, string length, class name (serial #8)

   

68 6A 81 01 4B 59 74 19 03

Fingerprint and flags

   

00 00

No instance variables

   

78

End marker

   

70

No superclass

  

77 08

External storage, number of bytes

  

00 00 00 83 E9 39 E0 00

Date

  

78

End marker

 

74 00 0C Carl Cracker

name field value—String (serial #10)

 

73

secretary field value—new object (serial #11)

  

71 00 7E 00 04

existing class (use serial #4)

  

40 E8 6A 00 00 00 00 00

salary field value—double

  

73

hireDay field value—new object (serial #12)

   

71 00 7E 00 08

Existing class (use serial #8)

   

77 08

External storage, number of bytes

   

00 00 00 91 1B 4E B1 80

Date

   

78

End marker

   

74 00 0C Harry Hacker

name field value—String (serial #13)

 

71 00 7E 00 0B

staff[1]—existing object (use serial #11)

 

73

staff[2]—new object (serial #14)

  

71 00 7E 00 02

Existing class (use serial #2)

  

40 E3 88 00 00 00 00 00

salary field value—double

  

73

hireDay field value—new object (serial #15)

   

71 00 7E 00 08

Existing class (use serial #8)

   

77 08

External storage, number of bytes

   

00 00 00 94 6D 3E EC 00 00

Date

   

78

End marker

  

74 00 0B Tony Tester

name field value—String (serial #16)

  

71 00 7E 00 0B

secretary field value—existing object (use serial #11)

Of course, studying these codes can be about as exciting as reading the average phone book. It is not important to know the exact file format (unless you are trying to create an evil effect by modifying the data), but it is still instructive to know that the object stream contains a detailed description of all the objects that it contains, with sufficient detail to allow reconstruction of both objects and arrays of objects.

What you should remember is this:

  • The object stream output contains the types and data fields of all objects.

  • Each object is assigned a serial number.

  • Repeated occurrences of the same object are stored as references to that serial number.

Modifying the Default Serialization Mechanism

Certain data fields should never be serialized, for example, integer values that store file handles or handles of windows that are only meaningful to native methods. Such information is guaranteed to be useless when you reload an object at a later time or transport it to a different machine. In fact, improper values for such fields can actually cause native methods to crash. Java has an easy mechanism to prevent such fields from ever being serialized. Mark them with the keyword transient. You also need to tag fields as transient if they belong to nonserializable classes. Transient fields are always skipped when objects are serialized.

The serialization mechanism provides a way for individual classes to add validation or any other desired action to the default read and write behavior. A serializable class can define methods with the signature

private void readObject(ObjectInputStream in)
   throws IOException, ClassNotFoundException;
private void writeObject(ObjectOutputStream out)
   throws IOException;

Then, the data fields are no longer automatically serialized, and these methods are called instead.

Here is a typical example. A number of classes in the java.awt.geom package, such as Point2D.Double, are not serializable. Now suppose you want to serialize a class LabeledPoint that stores a String and a Point2D.Double. First, you need to mark the Point2D.Double field as transient to avoid a NotSerializableException.

public class LabeledPoint implements Serializable
{
   . . .
   private String label;
   private transient Point2D.Double point;
}

In the writeObject method, we first write the object descriptor and the String field, state, by calling the defaultWriteObject method. This is a special method of the ObjectOutputStream class that can only be called from within a writeObject method of a serializable class. Then we write the point coordinates, using the standard DataOutput calls.

private void writeObject(ObjectOutputStream out)
   throws IOException
{
   out.defaultWriteObject();
   out.writeDouble(point.getX());
   out.writeDouble(point.getY());
}

In the readObject method, we reverse the process:

private void readObject(ObjectInputStream in)
   throws IOException
{
   in.defaultReadObject();
   double x = in.readDouble();
   double y = in.readDouble();
   point = new Point2D.Double(x, y);
}

Another example is the java.util.Date class that supplies its own readObject and writeObject methods. These methods write the date as a number of milliseconds from the epoch (January 1, 1970, midnight UTC). The Date class has a complex internal representation that stores both a Calendar object and a millisecond count to optimize lookups. The state of the Calendar is redundant and does not have to be saved.

The readObject and writeObject methods only need to save and load their data fields. They should not concern themselves with superclass data or any other class information.

Rather than letting the serialization mechanism save and restore object data, a class can define its own mechanism. To do this, a class must implement the Externalizable interface. This in turn requires it to define two methods:

public void readExternal(ObjectInputStream in)
  throws IOException, ClassNotFoundException;
public void writeExternal(ObjectOutputStream out)
  throws IOException;

Unlike the readObject and writeObject methods that were described in the preceding section, these methods are fully responsible for saving and restoring the entire object, including the superclass data. The serialization mechanism merely records the class of the object in the stream. When reading an externalizable object, the object stream creates an object with the default constructor and then calls the readExternal method. Here is how you can implement these methods for the Employee class:

public void readExternal(ObjectInput s)
   throws IOException
{
   name = s.readUTF();
   salary = s.readDouble();
   hireDay = new Date(s.readLong());
}

public void writeExternal(ObjectOutput s)
   throws IOException
{
  s.writeUTF(name);
  s.writeDouble(salary);
  s.writeLong(hireDay.getTime());
}

Tip

Tip

Serialization is somewhat slow because the virtual machine must discover the structure of each object. If you are concerned about performance and if you read and write a large number of objects of a particular class, you should investigate the use of the Externalizable interface. The tech tip http://java.sun.com/developer/TechTips/2000/tt0425.html demonstrates that in the case of an employee class, using external reading and writing was about 35 to 40 percent faster than the default serialization.

Caution

Caution

Unlike the readObject and writeObject methods, which are private and can only be called by the serialization mechanism, the readExternal and writeExternal methods are public. In particular, readExternal potentially permits modification of the state of an existing object.

Serializing Singletons and Typesafe Enumerations

You have to pay particular attention when serializing and deserializing objects that are assumed to be unique. This commonly happens when you are implementing singletons and typesafe enumerations.

If you use the enum construct of Java SE 5.0, then you need not worry about serialization—it just works. However, suppose you maintain legacy code that contains an enumerated type such as

public class Orientation
{
   public static final Orientation HORIZONTAL = new Orientation(1);
   public static final Orientation VERTICAL  = new Orientation(2);
   private Orientation(int v) { value = v; }
   private int value;
}

This idiom was common before enumerations were added to the Java language. Note that the constructor is private. Thus, no objects can be created beyond Orientation.HORIZONTAL and Orientation.VERTICAL. In particular, you can use the == operator to test for object equality:

if (orientation == Orientation.HORIZONTAL) . . .

There is an important twist that you need to remember when a typesafe enumeration implements the Serializable interface. The default serialization mechanism is not appropriate. Suppose we write a value of type Orientation and read it in again:

Orientation original = Orientation.HORIZONTAL;
ObjectOutputStream out = . . .;
out.write(value);
out.close();
ObjectInputStream in = . . .;
Orientation saved = (Orientation) in.read();

Now the test

if (saved == Orientation.HORIZONTAL) . . .

will fail. In fact, the saved value is a completely new object of the Orientation type and not equal to any of the predefined constants. Even though the constructor is private, the serialization mechanism can create new objects!

To solve this problem, you need to define another special serialization method, called readResolve. If the readResolve method is defined, it is called after the object is deserialized. It must return an object that then becomes the return value of the readObject method. In our case, the readResolve method will inspect the value field and return the appropriate enumerated constant:

protected Object readResolve() throws ObjectStreamException
{
   if (value == 1) return Orientation.HORIZONTAL;
   if (value == 2) return Orientation.VERTICAL;
   return null; // this shouldn't happen
}

Remember to add a readResolve method to all typesafe enumerations in your legacy code and to all classes that follow the singleton design pattern.

Versioning

If you use serialization to save objects, you will need to consider what happens when your program evolves. Can version 1.1 read the old files? Can the users who still use 1.0 read the files that the new version is now producing? Clearly, it would be desirable if object files could cope with the evolution of classes.

At first glance it seems that this would not be possible. When a class definition changes in any way, then its SHA fingerprint also changes, and you know that object streams will refuse to read in objects with different fingerprints. However, a class can indicate that it is compatible with an earlier version of itself. To do this, you must first obtain the fingerprint of the earlier version of the class. You use the stand-alone serialver program that is part of the JDK to obtain this number. For example, running

serialver Employee

prints

Employee: static final long serialVersionUID = -1814239825517340645L;

If you start the serialver program with the -show option, then the program brings up a graphical dialog box (see Figure 1-8).

The graphical version of the serialver program

Figure 1-8. The graphical version of the serialver program

All later versions of the class must define the serialVersionUID constant to the same fingerprint as the original.

class Employee implements Serializable // version 1.1
{
   . . .
   public static final long serialVersionUID = -1814239825517340645L;
}

When a class has a static data member named serialVersionUID, it will not compute the fingerprint manually but instead will use that value.

Once that static data member has been placed inside a class, the serialization system is now willing to read in different versions of objects of that class.

If only the methods of the class change, there is no problem with reading the new object data. However, if data fields change, then you may have problems. For example, the old file object may have more or fewer data fields than the one in the program, or the types of the data fields may be different. In that case, the object stream makes an effort to convert the stream object to the current version of the class.

The object stream compares the data fields of the current version of the class with the data fields of the version in the stream. Of course, the object stream considers only the nontransient and nonstatic data fields. If two fields have matching names but different types, then the object stream makes no effort to convert one type to the other—the objects are incompatible. If the object in the stream has data fields that are not present in the current version, then the object stream ignores the additional data. If the current version has data fields that are not present in the streamed object, the added fields are set to their default (null for objects, zero for numbers, and false for boolean values).

Here is an example. Suppose we have saved a number of employee records on disk, using the original version (1.0) of the class. Now we change the Employee class to version 2.0 by adding a data field called department. Figure 1-9 shows what happens when a 1.0 object is read into a program that uses 2.0 objects. The department field is set to null. Figure 1-10 shows the opposite scenario: A program using 1.0 objects reads a 2.0 object. The additional department field is ignored.

Reading an object with fewer data fields

Figure 1-9. Reading an object with fewer data fields

Reading an object with more data fields

Figure 1-10. Reading an object with more data fields

Is this process safe? It depends. Dropping a data field seems harmless—the recipient still has all the data that it knew how to manipulate. Setting a data field to null might not be so safe. Many classes work hard to initialize all data fields in all constructors to non-null values, so that the methods don’t have to be prepared to handle null data. It is up to the class designer to implement additional code in the readObject method to fix version incompatibilities or to make sure the methods are robust enough to handle null data.

Using Serialization for Cloning

There is an amusing use for the serialization mechanism: It gives you an easy way to clone an object provided the class is serializable. Simply serialize it to an output stream and then read it back in. The result is a new object that is a deep copy of the existing object. You don’t have to write the object to a file—you can use a ByteArrayOutputStream to save the data into a byte array.

As Listing 1-5 shows, to get clone for free, simply extend the SerialCloneable class, and you are done.

You should be aware that this method, although clever, will usually be much slower than a clone method that explicitly constructs a new object and copies or clones the data fields.

Example 1-5. SerialCloneTest.java

 1. import java.io.*;
 2. import java.util.*;
 3.
 4. public class SerialCloneTest
 5. {
 6.    public static void main(String[] args)
 7.    {
 8.       Employee harry = new Employee("Harry Hacker", 35000, 1989, 10, 1);
 9.       // clone harry
10.       Employee harry2 = (Employee) harry.clone();
11.
12.       // mutate harry
13.       harry.raiseSalary(10);
14.
15.       // now harry and the clone are different
16.       System.out.println(harry);
17.       System.out.println(harry2);
18.    }
19. }
20.
21. /**
22.    A class whose clone method uses serialization.
23. */
24. class SerialCloneable implements Cloneable, Serializable
25. {
26.    public Object clone()
27.    {
28.       try
29.       {
30.          // save the object to a byte array
31.          ByteArrayOutputStream bout = new ByteArrayOutputStream();
32.          ObjectOutputStream out = new ObjectOutputStream(bout);
33.          out.writeObject(this);
34.          out.close();
35.
36.          // read a clone of the object from the byte array
37.          ByteArrayInputStream bin = new ByteArrayInputStream(bout.toByteArray());
38.          ObjectInputStream in = new ObjectInputStream(bin);
39.          Object ret = in.readObject();
40.          in.close();
41.
42.          return ret;
43.       }
44.       catch (Exception e)
45.       {
46.          return null;
47.       }
48.    }
49. }
50.
51. /**
52.    The familiar Employee class, redefined to extend the
53.    SerialCloneable class.
54. */
55. class Employee extends SerialCloneable
56. {
57.    public Employee(String n, double s, int year, int month, int day)
58.    {
59.       name = n;
60.       salary = s;
61.       GregorianCalendar calendar = new GregorianCalendar(year, month - 1, day);
62.       hireDay = calendar.getTime();
63.    }
64.
65.    public String getName()
66.    {
67.       return name;
68.    }
69.
70.    public double getSalary()
71.    {
72.       return salary;
73.    }
74.
75.    public Date getHireDay()
76.    {
77.       return hireDay;
78.    }
79.
80.    public void raiseSalary(double byPercent)
81.    {
82.       double raise = salary * byPercent / 100;
83.       salary += raise;
84.    }
85.
86.    public String toString()
87.    {
88.       return getClass().getName()
89.          + "[name=" + name
90.          + ",salary=" + salary
91.          + ",hireDay=" + hireDay
92.          + "]";
93.    }
94.
95.    private String name;
96.    private double salary;
97.    private Date hireDay;
98. }

 

File Management

You have learned how to read and write data from a file. However, there is more to file management than reading and writing. The File class encapsulates the functionality that you will need to work with the file system on the user’s machine. For example, you use the File class to find out when a file was last modified or to remove or rename the file. In other words, the stream classes are concerned with the contents of the file, whereas the File class is concerned with the storage of the file on a disk.

Note

Note

As is so often the case in Java, the File class takes the least common denominator approach. For example, under Windows, you can find out (or set) the read-only flag for a file, but while you can find out if it is a hidden file, you can’t hide it without using a native method.

The simplest constructor for a File object takes a (full) file name. If you don’t supply a path name, then Java uses the current directory. For example,

File f = new File("test.txt");

gives you a file object with this name in the current directory. (The “current directory” is the current directory of the process that executes the virtual machine. If you launched the virtual machine from the command line, it is the directory from which you started the java executable.)

Caution

Caution

Because the backslash character is the escape character in Java strings, be sure to use \ for Windows-style path names ("C:\Windows\win.ini"). In Windows, you can also use a single forward slash ("C:/Windows/win.ini") because most Windows file handling system calls will interpret forward slashes as file separators. However, this is not recommended—the behavior of the Windows system functions is subject to change, and on other operating systems, the file separator might be different. Instead, for portable programs, you should use the file separator character for the platform on which your program runs. It is stored in the constant string File.separator.

A call to this constructor does not create a file with this name if it doesn’t exist. Actually, creating a file from a File object is done with one of the stream class constructors or the createNewFile method in the File class. The createNewFile method only creates a file if no file with that name exists, and it returns a boolean to tell you whether it was successful.

On the other hand, once you have a File object, the exists method in the File class tells you whether a file exists with that name. For example, the following trial program would almost certainly print “false” on anyone’s machine, and yet it can print out a path name to this nonexistent file.

import java.io.*;

public class Test
{
   public static void main(String args[])
   {
      File f = new File("afilethatprobablydoesntexist");
      System.out.println(f.getAbsolutePath());
      System.out.println(f.exists());
   }
}

There are two other constructors for File objects:

File(String path, String name)

which creates a File object with the given name in the directory specified by the path parameter. (If the path parameter is null, this constructor creates a File object, using the current directory.)

Finally, you can use an existing File object in the constructor:

File(File dir, String name)

where the File object represents a directory and, as before, if dir is null, the constructor creates a File object in the current directory.

Somewhat confusingly, a File object can represent either a file or a directory (perhaps because the operating system that the Java designers were most familiar with happens to implement directories as files). You use the isDirectory and isFile methods to tell whether the file object represents a file or a directory. This is surprising—in an object-oriented system, you might have expected a separate Directory class, perhaps extending the File class.

To make an object representing a directory, you simply supply the directory name in the File constructor:

File tempDir = new File(File.separator + "temp");

If this directory does not yet exist, you can create it with the mkdir method:

tempDir.mkdir();

If a file object represents a directory, use list() to get an array of the file names in that directory. The program in Listing 1-6 uses all these methods to print out the directory substructure of whatever path is entered on the command line. (It would be easy enough to change this program into a utility class that returns a list of the subdirectories for further processing.)

Tip

Tip

Always use File objects, not strings, when manipulating file or directory names. For example, the equals method of the File class knows that some file systems are not case significant and that a trailing / in a directory name doesn’t matter.

Example 1-6. FindDirectories.java

 1. import java.io.*;
 2.
 3. /**
 4.  * @version 1.00 05 Sep 1997
 5.  * @author Gary Cornell
 6.  */
 7. public class FindDirectories
 8. {
 9.    public static void main(String[] args)
10.    {
11.       // if no arguments provided, start at the parent directory
12.       if (args.length == 0) args = new String[] { ".." };
13.
14.       try
15.       {
16.          File pathName = new File(args[0]);
17.          String[] fileNames = pathName.list();
18.
19.          // enumerate all files in the directory
20.          for (int i = 0; i < fileNames.length; i++)
21.          {
22.             File f = new File(pathName.getPath(), fileNames[i]);
23.
24.             // if the file is again a directory, call the main method recursively
25.             if (f.isDirectory())
26.             {
27.                System.out.println(f.getCanonicalPath());
28.                main(new String[] { f.getPath() });
29.             }
30.          }
31.       }
32.       catch (IOException e)
33.       {
34.          e.printStackTrace();
35.       }
36.    }
37. }

Rather than listing all files in a directory, you can use a FileNameFilter object as a parameter to the list method to narrow down the list. These objects are simply instances of a class that satisfies the FilenameFilter interface.

All a class needs to do to implement the FilenameFilter interface is define a method called accept. Here is an example of a simple FilenameFilter class that allows only files with a specified extension:

public class ExtensionFilter implements FilenameFilter
{
   public ExtensionFilter(String ext)
   {
      extension = "." + ext;
   }

   public boolean accept(File dir, String name)
   {
      return name.endsWith(extension);
   }

   private String extension;
}

When writing portable programs, it is a challenge to specify file names with subdirectories. As we mentioned earlier, it turns out that you can use a forward slash (the UNIX separator) as the directory separator in Windows as well, but other operating systems might not permit this, so we don’t recommend using a forward slash.

Caution

Caution

If you do use forward slashes as directory separators in Windows when constructing a File object, the getAbsolutePath method returns a file name that contains forward slashes, which will look strange to Windows users. Instead, use the getCanonicalPath method—it replaces the forward slashes with backslashes.

It is much better to use the information about the current directory separator that the File class stores in a static instance field called separator. In a Windows environment, this is a backslash (); in a UNIX environment, it is a forward slash (/). For example:

File foo = new File("Documents" + File.separator + "data.txt")

Of course, if you use the second alternate version of the File constructor

File foo = new File("Documents", "data.txt")

then the constructor will supply the correct separator.

The API notes that follow give you what we think are the most important remaining methods of the File class; their use should be straightforward.

New I/O

Java SE 1.4 introduced a number of features for improved input/output processing, collectively called the “new I/O,” in the java.nio package. (Of course, the “new” moniker is somewhat regrettable because, a few years down the road, the package wasn’t new any longer.)

The package includes support for the following features:

  • Character set encoders and decoders

  • Nonblocking I/O

  • Memory-mapped files

  • File locking

We already covered character encoding and decoding in the section “Character Sets” on page 19. Nonblocking I/O is discussed in Chapter 3 because it is particularly important when communicating across a network. In the following sections, we examine memory-mapped files and file locking in detail.

Memory-Mapped Files

Most operating systems can take advantage of the virtual memory implementation to “map” a file, or a region of a file, into memory. Then the file can be accessed as if it were an in-memory array, which is much faster than the traditional file operations.

At the end of this section, you can find a program that computes the CRC32 checksum of a file, using traditional file input and a memory-mapped file. On one machine, we got the timing data shown in Table 1-6 when computing the checksum of the 37-Mbyte file rt.jar in the jre/lib directory of the JDK.

Table 1-6. Timing Data for File Operations

Method

Time

Plain Input Stream

110 seconds

Buffered Input Stream

9.9 seconds

Random Access File

162 seconds

Memory Mapped file

7.2 seconds

As you can see, on this particular machine, memory mapping is a bit faster than using buffered sequential input and dramatically faster than using a RandomAccessFile.

Of course, the exact values will differ greatly from one machine to another, but it is obvious that the performance gain can be substantial if you need to use random access. For sequential reading of files of moderate size, on the other hand, there is no reason to use memory mapping.

The java.nio package makes memory mapping quite simple. Here is what you do.

First, get a channel from the file. A channel is an abstraction for disk files that lets you access operating system features such as memory mapping, file locking, and fast data transfers between files. You get a channel by calling the getChannel method that has been added to the FileInputStream, FileOutputStream, and RandomAccessFile class.

FileInputStream in = new FileInputStream(. . .);
FileChannel channel = in.getChannel();

Then you get a MappedByteBuffer from the channel by calling the map method of the FileChannel class. You specify the area of the file that you want to map and a mapping mode. Three modes are supported:

  • FileChannel.MapMode.READ_ONLY: The resulting buffer is read-only. Any attempt to write to the buffer results in a ReadOnlyBufferException.

  • FileChannel.MapMode.READ_WRITE: The resulting buffer is writable, and the changes will be written back to the file at some time. Note that other programs that have mapped the same file might not see those changes immediately. The exact behavior of simultaneous file mapping by multiple programs is operating-system dependent.

  • FileChannel.MapMode.PRIVATE: The resulting buffer is writable, but any changes are private to this buffer and are not propagated to the file.

Once you have the buffer, you can read and write data, using the methods of the ByteBuffer class and the Buffer superclass.

Buffers support both sequential and random data access. A buffer has a position that is advanced by get and put operations. For example, you can sequentially traverse all bytes in the buffer as

while (buffer.hasRemaining())
{
   byte b = buffer.get();
   . . .
}

Alternatively, you can use random access:

for (int i = 0; i < buffer.limit(); i++)
{
   byte b = buffer.get(i);
   . . .
}

You can also read and write arrays of bytes with the methods

get(byte[] bytes)
get(byte[], int offset, int length)

Finally, there are methods

getInt
getLong
getShort
getChar
getFloat
getDouble

to read primitive type values that are stored as binary values in the file. As we already mentioned, Java uses big-endian ordering for binary data. However, if you need to process a file containing binary numbers in little-endian order, simply call

buffer.order(ByteOrder.LITTLE_ENDIAN);

To find out the current byte order of a buffer, call

ByteOrder b = buffer.order()

Caution

Caution

This pair of methods does not use the set/get naming convention.

To write numbers to a buffer, use one of the methods

putInt
putLong
putShort
putChar
putFloat
putDouble

Listing 1-7 computes the 32-bit cyclic redundancy checksum (CRC32) of a file. That quantity is a checksum that is often used to determine whether a file has been corrupted. Corruption of a file makes it very likely that the checksum has changed. The java.util.zip package contains a class CRC32 that computes the checksum of a sequence of bytes, using the following loop:

CRC32 crc = new CRC32();
while (more bytes)
   crc.update(next byte)
long checksum = crc.getValue();

Note

Note

For a nice explanation of the CRC algorithm, see http://www.relisoft.com/Science/CrcMath.html.

The details of the CRC computation are not important. We just use it as an example of a useful file operation.

Run the program as

java NIOTest filename

Example 1-7. NIOTest.java

 1. import java.io.*;
 2. import java.nio.*;
 3. import java.nio.channels.*;
 4. import java.util.zip.*;
 5.
 6. /**
 7.  * This program computes the CRC checksum of a file. <br>
 8.  * Usage: java NIOTest filename
 9.  * @version 1.01 2004-05-11
10.  * @author Cay Horstmann
11.  */
12. public class NIOTest
13. {
14.    public static long checksumInputStream(String filename) throws IOException
15.    {
16.       InputStream in = new FileInputStream(filename);
17.       CRC32 crc = new CRC32();
18.
19.       int c;
20.       while ((c = in.read()) != -1)
21.          crc.update(c);
22.       return crc.getValue();
23.    }
24.
25.    public static long checksumBufferedInputStream(String filename) throws IOException
26.    {
27.       InputStream in = new BufferedInputStream(new FileInputStream(filename));
28.       CRC32 crc = new CRC32();
29.
30.       int c;
31.       while ((c = in.read()) != -1)
32.          crc.update(c);
33.       return crc.getValue();
34.    }
35.
36.    public static long checksumRandomAccessFile(String filename) throws IOException
37.    {
38.       RandomAccessFile file = new RandomAccessFile(filename, "r");
39.       long length = file.length();
40.       CRC32 crc = new CRC32();
41.
42.       for (long p = 0; p < length; p++)
43.       {
44.          file.seek(p);
45.          int c = file.readByte();
46.          crc.update(c);
47.       }
48.       return crc.getValue();
49.    }
50.
51.    public static long checksumMappedFile(String filename) throws IOException
52.    {
53.       FileInputStream in = new FileInputStream(filename);
54.       FileChannel channel = in.getChannel();
55.
56.       CRC32 crc = new CRC32();
57.       int length = (int) channel.size();
58.       MappedByteBuffer buffer = channel.map(FileChannel.MapMode.READ_ONLY, 0, length);
59.
60.       for (int p = 0; p < length; p++)
61.       {
62.          int c = buffer.get(p);
63.          crc.update(c);
64.       }
65.       return crc.getValue();
66.    }
67.
68.    public static void main(String[] args) throws IOException
69.    {
70.       System.out.println("Input Stream:");
71.       long start = System.currentTimeMillis();
72.       long crcValue = checksumInputStream(args[0]);
73.       long end = System.currentTimeMillis();
74.       System.out.println(Long.toHexString(crcValue));
75.       System.out.println((end - start) + " milliseconds");
76.
77.       System.out.println("Buffered Input Stream:");
78.       start = System.currentTimeMillis();
79.       crcValue = checksumBufferedInputStream(args[0]);
80.       end = System.currentTimeMillis();
81.       System.out.println(Long.toHexString(crcValue));
82.       System.out.println((end - start) + " milliseconds");
83.
84.       System.out.println("Random Access File:");
85.       start = System.currentTimeMillis();
86.       crcValue = checksumRandomAccessFile(args[0]);
87.       end = System.currentTimeMillis();
88.       System.out.println(Long.toHexString(crcValue));
89.       System.out.println((end - start) + " milliseconds");
90.
91.       System.out.println("Mapped File:");
92.       start = System.currentTimeMillis();
93.       crcValue = checksumMappedFile(args[0]);
94.       end = System.currentTimeMillis();
95.       System.out.println(Long.toHexString(crcValue));
96.       System.out.println((end - start) + " milliseconds");
97.    }
98. }

 

The Buffer Data Structure

When you use memory mapping, you make a single buffer that spans the entire file, or the area of the file in which you are interested. You can also use buffers to read and write more modest chunks of information.

In this section, we briefly describe the basic operations on Buffer objects. A buffer is an array of values of the same type. The Buffer class is an abstract class with concrete subclasses ByteBuffer, CharBuffer, DoubleBuffer, FloatBuffer, IntBuffer, LongBuffer, and ShortBuffer.

Note

Note

The StringBuffer class is not related to these buffers.

In practice, you will most commonly use ByteBuffer and CharBuffer. As shown in Figure 1-11, a buffer has

  • A capacity that never changes.

  • A position at which the next value is read or written.

  • A limit beyond which reading and writing is meaningless.

  • Optionally, a mark for repeating a read or write operation.

A buffer

Figure 1-11. A buffer

These values fulfill the condition

0 ≤ markpositionlimitcapacity

The principal purpose for a buffer is a “write, then read” cycle. At the outset, the buffer’s position is 0 and the limit is the capacity. Keep calling put to add values to the buffer. When you run out of data or you reach the capacity, it is time to switch to reading.

Call flip to set the limit to the current position and the position to 0. Now keep calling get while the remaining method (which returns limit - position) is positive. When you have read all values in the buffer, call clear to prepare the buffer for the next writing cycle. The clear method resets the position to 0 and the limit to the capacity.

If you want to reread the buffer, use rewind or mark/reset—see the API notes for details.

File Locking

Consider a situation in which multiple simultaneously executing programs need to modify the same file. Clearly, the programs need to communicate in some way, or the file can easily become damaged.

File locks control access to a file or a range of bytes within a file. However, file locking varies greatly among operating systems, which explains why file locking capabilities were absent from prior versions of the JDK.

File locking is not all that common in application programs. Many applications use a database for data storage, and the database has mechanisms for resolving concurrent access. If you store information in flat files and are worried about concurrent access, you might find it simpler to start using a database rather than designing complex file locking schemes.

Still, there are situations in which file locking is essential. Suppose your application saves a configuration file with user preferences. If a user invokes two instances of the application, it could happen that both of them want to write the configuration file at the same time. In that situation, the first instance should lock the file. When the second instance finds the file locked, it can decide to wait until the file is unlocked or simply skip the writing process.

To lock a file, call either the lock or tryLock method of the FileChannel class:

FileLock lock = channel.lock();

or

FileLock lock = channel.tryLock();

The first call blocks until the lock becomes available. The second call returns immediately, either with the lock or null if the lock is not available. The file remains locked until the channel is closed or the release method is invoked on the lock.

You can also lock a portion of the file with the call

FileLock lock(long start, long size, boolean exclusive)

or

FileLock tryLock(long start, long size, boolean exclusive)

The exclusive flag is true to lock the file for both reading and writing. It is false for a shared lock, which allows multiple processes to read from the file, while preventing any process from acquiring an exclusive lock. Not all operating systems support shared locks. You may get an exclusive lock even if you just asked for a shared one. Call the isShared method of the FileLock class to find out which kind you have.

Note

Note

If you lock the tail portion of a file and the file subsequently grows beyond the locked portion, the additional area is not locked. To lock all bytes, use a size of Long.MAX_VALUE.

Keep in mind that file locking is system dependent. Here are some points to watch for:

  • On some systems, file locking is merely advisory. If an application fails to get a lock, it may still write to a file that another application has currently locked.

  • On some systems, you cannot simultaneously lock a file and map it into memory.

  • File locks are held by the entire Java virtual machine. If two programs are launched by the same virtual machine (such as an applet or application launcher), then they can’t each acquire a lock on the same file. The lock and tryLock methods will throw an OverlappingFileLockException if the virtual machine already holds another overlapping lock on the same file.

  • On some systems, closing a channel releases all locks on the underlying file held by the Java virtual machine. You should therefore avoid multiple channels on the same locked file.

  • Locking files on a networked file system is highly system dependent and should probably be avoided.

Regular Expressions

Regular expressions are used to specify string patterns. You can use regular expressions whenever you need to locate strings that match a particular pattern. For example, one of our sample programs locates all hyperlinks in an HTML file by looking for strings of the pattern <a href="...">.

Of course, for specifying a pattern, the ... notation is not precise enough. You need to specify precisely what sequence of characters is a legal match. You need to use a special syntax whenever you describe a pattern.

Here is a simple example. The regular expression

[Jj]ava.+

matches any string of the following form:

  • The first letter is a J or j.

  • The next three letters are ava.

  • The remainder of the string consists of one or more arbitrary characters.

For example, the string "javanese" matches the particular regular expression, but the string "Core Java" does not.

As you can see, you need to know a bit of syntax to understand the meaning of a regular expression. Fortunately, for most purposes, a small number of straightforward constructs are sufficient.

  • A character class is a set of character alternatives, enclosed in brackets, such as [Jj], [0-9], [A-Za-z], or [^0-9]. Here the - denotes a range (all characters whose Unicode value falls between the two bounds), and ^ denotes the complement (all characters except the ones specified).

  • There are many predefined character classes such as d (digits) or p{Sc} (Unicode currency symbol). See Tables 1-7 and 1-8.

    Table 1-7. Regular Expression Syntax

    Syntax

    Explanation

    Characters

     

    c

    The character c

    unnnn, xnn, n, nn, nnn

    The code unit with the given hex or octal value

    , , , f, a, e

    The control characters tab, newline, return, form feed, alert, and escape

    cc

    The control character corresponding to the character c

    Character Classes

     

    [C1C2. . .]

    Any of the characters represented by C1, C2, . . . The Ci are characters, character ranges (c1-c2), or character classes

    [^. . .]

    Complement of character class

    [ . . . && . . .]

    Intersection of two character classes

    Predefined Character Classes

     

    .

    Any character except line terminators (or any character if the DOTALL flag is set)

    d

    A digit [0-9]

    D

    A nondigit [^0-9]

    s

    A whitespace character [ fx0B]

    S

    A nonwhitespace character

    w

    A word character [a-zA-Z0-9_]

    W

    A nonword character

    p{name}

    A named character class—see Table 1-8

    P{name}

    The complement of a named character class

    Boundary Matchers

     

    ^ $

    Beginning, end of input (or beginning, end of line in multiline mode)

    

    A word boundary

    B

    A nonword boundary

    A

    Beginning of input

    z

    End of input

    

    End of input except final line terminator

    G

    End of previous match

    Quantifiers

     

    X?

    Optional X

    X*

    X, 0 or more times

    X+

    X, 1 or more times

    X{n} X{n,} X{n,m}

    X n times, at least n times, between n and m times

    Quantifier Suffixes

     

    ?

    Turn default (greedy) match into reluctant match

    +

    Turn default (greedy) match into possessive match

    Set Operations

     

    XY

    Any string from X, followed by any string from Y

    X|Y

    Any string from X or Y

    Grouping

     

    (X)

    Capture the string matching X as a group

    n

    The match of the nth group

    Escapes

     

    c

    The character c (must not be an alphabetic character)

    Q . . . E

    Quote . . . verbatim

    (? . . . )

    Special construct—see API notes of Pattern class

    Table 1-8. Predefined Character Class Names

    Character Class Name

    Explanation

    Lower

    ASCII lower case [a-z]

    Upper

    ASCII upper case [A-Z]

    Alpha

    ASCII alphabetic [A-Za-z]

    Digit

    ASCII digits [0-9]

    Alnum

    ASCII alphabetic or digit [A-Za-z0-9]

    XDigit

    Hex digits [0-9A-Fa-f]

    Print or Graph

    Printable ASCII character [x21-x7E]

    Punct

    ASCII nonalpha or digit [p{Print}&&P{Alnum}]

    ASCII

    All ASCII [x00-x7F]

    Cntrl

    ASCII Control character [x00-x1F]

    Blank

    Space or tab [ ]

    Space

    Whitespace [ fx0B]

    javaLowerCase

    Lower case, as determined by Character.isLowerCase()

    javaUpperCase

    Upper case, as determined by Character.isUpperCase()

    javaWhitespace

    Whitespace, as determined by Character.isWhitespace()

    javaMirrored

    Mirrored, as determined by Character.isMirrored()

    InBlock

    Block is the name of a Unicode character block, with spaces removed, such as BasicLatin or Mongolian. See http://www.unicode.org for a list of block names.

    Category or InCategory

    Category is the name of a Unicode character category such as L (letter) or Sc (currency symbol). See http://www.unicode.org for a list of category names.

  • Most characters match themselves, such as the ava characters in the preceding example.

  • The . symbol matches any character (except possibly line terminators, depending on flag settings).

  • Use as an escape character, for example . matches a period and \ matches a backslash.

  • ^ and $ match the beginning and end of a line, respectively.

  • If X and Y are regular expressions, then XY means “any match for X followed by a match for Y”. X | Y means “any match for X or Y”.

  • You can apply quantifiers X+ (1 or more), X* (0 or more), and X? (0 or 1) to an expression X.

  • By default, a quantifier matches the largest possible repetition that makes the overall match succeed. You can modify that behavior with suffixes ? (reluctant or stingy match—match the smallest repetition count) and + (possessive or greedy match—match the largest count even if that makes the overall match fail).

    For example, the string cab matches [a-z]*ab but not [a-z]*+ab. In the first case, the expression [a-z]* only matches the character c, so that the characters ab match the remainder of the pattern. But the greedy version [a-z]*+ matches the characters cab, leaving the remainder of the pattern unmatched.

  • You can use groups to define subexpressions. Enclose the groups in ( ); for example, ([+-]?)([0-9]+). You can then ask the pattern matcher to return the match of each group or to refer back to a group with , where n is the group number (starting with 1).

For example, here is a somewhat complex but potentially useful regular expression—it describes decimal or hexadecimal integers:

[+-]?[0-9]+|0[Xx][0-9A-Fa-f]+

Unfortunately, the expression syntax is not completely standardized between the various programs and libraries that use regular expressions. Although there is consensus on the basic constructs, there are many maddening differences in the details. The Java regular expression classes use a syntax that is similar to, but not quite the same as, the one used in the Perl language. Table 1-7 shows all constructs of the Java syntax. For more information on the regular expression syntax, consult the API documentation for the Pattern class or the book Mastering Regular Expressions by Jeffrey E. F. Friedl (O’Reilly and Associates, 1997).

The simplest use for a regular expression is to test whether a particular string matches it. Here is how you program that test in Java. First construct a Pattern object from the string denoting the regular expression. Then get a Matcher object from the pattern, and call its matches method:

Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(input);
if (matcher.matches()) . . .

 

The input of the matcher is an object of any class that implements the CharSequence interface, such as a String, StringBuilder, or CharBuffer.

When compiling the pattern, you can set one or more flags, for example,

Pattern pattern = Pattern.compile(patternString,
   Pattern.CASE_INSENSITIVE + Pattern.UNICODE_CASE);

The following six flags are supported:

  • CASE_INSENSITIVE: Match characters independently of the letter case. By default, this flag takes only US ASCII characters into account.

  • UNICODE_CASE: When used in combination with CASE_INSENSITIVE, use Unicode letter case for matching.

  • MULTILINE: ^ and $ match the beginning and end of a line, not the entire input.

  • UNIX_LINES: Only ' ' is recognized as a line terminator when matching ^ and $ in multiline mode.

  • DOTALL: When using this flag, the . symbol matches all characters, including line terminators.

  • CANON_EQ: Takes canonical equivalence of Unicode characters into account. For example, u followed by ¨ (diaeresis) matches ü.

If the regular expression contains groups, then the Matcher object can reveal the group boundaries. The methods

int start(int groupIndex)
int end(int groupIndex)

yield the starting index and the past-the-end index of a particular group.

You can simply extract the matched string by calling

String group(int groupIndex)

Group 0 is the entire input; the group index for the first actual group is 1. Call the groupCount method to get the total group count.

Nested groups are ordered by the opening parentheses. For example, given the pattern

((1?[0-9]):([0-5][0-9]))[ap]m

and the input

11:59am

the matcher reports the following groups

Group Index

Start

End

String

0

0

7

11;59am

1

0

5

11:59

2

0

2

11

3

3

5

59

Listing 1-8 prompts for a pattern, then for strings to match. It prints out whether or not the input matches the pattern. If the input matches and the pattern contains groups, then the program prints the group boundaries as parentheses, such as

((11):(59))am

Example 1-8. RegexTest.java

 1. import java.util.*;
 2. import java.util.regex.*;
 3.
 4. /**
 5.  * This program tests regular expression matching.
 6.  *  Enter a pattern and strings to match, or hit Cancel
 7.  *  to exit. If the pattern contains groups, the group
 8.  *  boundaries are displayed in the match.
 9.  *  @version 1.01 2004-05-11
10.  *  @author Cay Horstmann
11.  */
12. public class RegExTest
13. {
14.    public static void main(String[] args)
15.    {
16.       Scanner in = new Scanner(System.in);
17.       System.out.println("Enter pattern: ");
18.       String patternString = in.nextLine();
19.
20.       Pattern pattern = null;
21.       try
22.       {
23.          pattern = Pattern.compile(patternString);
24.       }
25.       catch (PatternSyntaxException e)
26.       {
27.          System.out.println("Pattern syntax error");
28.          System.exit(1);
29.       }
30.
31.       while (true)
32.       {
33.          System.out.println("Enter string to match: ");
34.          String input = in.nextLine();
35.          if (input == null || input.equals("")) return;
36.          Matcher matcher = pattern.matcher(input);
37.          if (matcher.matches())
38.          {
39.             System.out.println("Match");
40.             int g = matcher.groupCount();
41.             if (g > 0)
42.             {
43.                for (int i = 0; i < input.length(); i++)
44.                {
45.                   for (int j = 1; j <= g; j++)
46.                      if (i == matcher.start(j))
47.                         System.out.print('('),
48.                   System.out.print(input.charAt(i));
49.                   for (int j = 1; j <= g; j++)
50.                      if (i + 1 == matcher.end(j))
51.                         System.out.print(')'),
52.                }
53.                System.out.println();
54.             }
55.          }
56.          else
57.             System.out.println("No match");
58.       }
59.    }
60. }

 

Usually, you don’t want to match the entire input against a regular expression, but you want to find one or more matching substrings in the input. Use the find method of the Matcher class to find the next match. If it returns true, use the start and end methods to find the extent of the match.

while (matcher.find())
{
   int start = matcher.start();
   int end = matcher.end();
   String match = input.substring(start, end);
   . . .
}

Listing 1-9 puts this mechanism to work. It locates all hypertext references in a web page and prints them. To run the program, supply a URL on the command line, such as

java HrefMatch http://www.horstmann.com

Example 1-9. HrefMatch.java

 1. import java.io.*;
 2. import java.net.*;
 3. import java.util.regex.*;
 4.
 5. /**
 6.  * This program displays all URLs in a web page by matching a regular expression that describes
 7.  * the <a href=...> HTML tag. Start the program as <br>
 8.  * java HrefMatch URL
 9.  * @version 1.01 2004-06-04
10.  * @author Cay Horstmann
11.  */
12. public class HrefMatch
13. {
14.    public static void main(String[] args)
15.    {
16.       try
17.       {
18.          // get URL string from command line or use default
19.          String urlString;
20.          if (args.length > 0) urlString = args[0];
21.          else urlString = "http://java.sun.com";
22.
23.          // open reader for URL
24.          InputStreamReader in = new InputStreamReader(new URL(urlString).openStream());
25.
26.          // read contents into string builder
27.          StringBuilder input = new StringBuilder();
28.          int ch;
29.          while ((ch = in.read()) != -1)
30.             input.append((char) ch);
31.
32.          // search for all occurrences of pattern
33.          String patternString = "<a\s+href\s*=\s*("[^"]*"|[^\s>]*)\s*>"
34.          Pattern pattern = Pattern.compile(patternString, Pattern.CASE_INSENSITIVE);
35.          Matcher matcher = pattern.matcher(input);
36.
37.          while (matcher.find())
38.          {
39.             int start = matcher.start();
40.             int end = matcher.end();
41.             String match = input.substring(start, end);
42.             System.out.println(match);
43.          }
44.       }
45.       catch (IOException e)
46.       {
47.          e.printStackTrace();
48.       }
49.       catch (PatternSyntaxException e)
50.       {
51.          e.printStackTrace();
52.       }
53.    }
54. }

The replaceAll method of the Matcher class replaces all occurrences of a regular expression with a replacement string. For example, the following instructions replace all sequences of digits with a # character.

Pattern pattern = Pattern.compile("[0-9]+");
Matcher matcher = pattern.matcher(input);
String output = matcher.replaceAll("#");

The replacement string can contain references to groups in the pattern: $n is replaced with the nth group. Use $ to include a $ character in the replacement text.

The replaceFirst method replaces only the first occurrence of the pattern.

Finally, the Pattern class has a split method that splits an input into an array of strings, using the regular expression matches as boundaries. For example, the following instructions split the input into tokens, where the delimiters are punctuation marks surrounded by optional whitespace.

Pattern pattern = Pattern.compile("\s*\p{Punct}\s*");
String[] tokens = pattern.split(input);

You have now seen how to carry out input and output operations in Java, and you had an overview of the regular expression package that was a part of the “new I/O” specification. In the next chapter, we turn to the processing of XML data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.80.45