Chapters 12 and 13 introduced you to Java’s classic I/O APIs. Chapter 12 presented classic I/O in terms of java.io’s File, RandomAccessFile, stream, and writer/reader types. Chapter 13 presented classic I/O in terms of java.net’s socket and URL types.
Modern operating systems offer powerful I/O features that are not supported by Java’s classic I/O APIs. Features include memory-mapped file I/O (the ability to map part of a process (executing application)’s virtual memory (see http://en.wikipedia.org/wiki/Virtual_memory) to some portion of a file so that writes to or reads from that portion of the process’s memory space actually write/read the associated portion of the file), readiness selection (a step above nonblocking I/O that offloads to the operating system the work involved in checking for I/O stream readiness to perform write and read operations), and file locking (the ability for one process to prevent other processes from accessing a file or to limit the access in some way).
Java 1.4 introduced a more powerful I/O architecture that supports memory-mapped file I/O, readiness selection, file locking, and more. This architecture largely consists of buffers, channels, selectors, regular expressions, and charsets, and it is commonly known as new I/O (NIO) .
Regular expressions are included as part of NIO (see JSR 51 at http://jcp.org/en/jsr/detail?id=51) because NIO is all about performance, and regular expressions are useful for scanning text (read from an I/O source) in a highly performant manner.
In this chapter, we introduce you to NIO in terms of buffers, channels, selectors, regular expressions, and charsets. We also discuss the simple printf-style formatting facility proposed in JSR 51 but not implemented until Java 5.
Working with Buffers
NIO is based on buffers. A buffer is an object that stores a fixed amount of data to be sent to or received from an I/O service (a means for performing input/output). It sits between an application and a channel that writes the buffered data to the service or reads the data from the service and deposits it into the buffer.
Capacity: The total number of data items that can be stored in the buffer. The capacity is specified when the buffer is created and cannot be changed later.
Limit: The number of “live” data items in the buffer. No items starting from the zero-based limit should be written or read.
Position: The zero-based index of the next data item that can be read or the location where the data item can be written.
Mark: A zero-based position that can be recalled. The mark is initially undefined.
These four properties are related as follows:
0 <= mark <= position <= limit <= capacity
Figure 14-1’s buffer can store a maximum of seven elements. The mark is initially undefined, the position is initially set to 0, and the limit is initially set to the capacity (7), which specifies the maximum number of bytes that can be stored in the buffer. You can only access positions 0 through 6. Position 7 lies beyond the buffer.
Buffer and Its Children
Buffers are implemented by classes that are derived from the abstract java.nio.Buffer class . The API documentation describes Buffer’s methods.
The documentation also shows that all buffers can be read but not all buffers can be written—for example, a buffer backed by a memory-mapped file that’s read-only. You must not write to a read-only buffer; otherwise, ReadOnlyBufferException is thrown. Call isReadOnly() when you’re unsure that a buffer is writable before attempting to write to that buffer.
Buffers are not thread-safe. You must employ synchronization when you want to access a buffer from multiple threads.
The java.nio package includes several abstract classes that extend Buffer, one for each primitive type except for Boolean: ByteBuffer, CharBuffer, DoubleBuffer, FloatBuffer, IntBuffer, LongBuffer, and ShortBuffer. Furthermore, this package includes MappedByteBuffer as an abstract ByteBuffer subclass.
Operating systems perform byte-oriented I/O, and you use ByteBuffer to create byte-oriented buffers that store the bytes to write to a destination or that are read from a source. The other primitive-type buffer classes let you create multibyte view buffers (discussed later) so that you can conceptually perform I/O in terms of characters, double-precision floating-point values, 32-bit integers, and so on. However, the I/O operation is really being carried out as a flow of bytes.
Demonstrating a Byte-Oriented Buffer
Listing 14-1’s main() method first needs to obtain a buffer. It cannot instantiate the Buffer class because that class is abstract. Instead, it uses the ByteBuffer class and its allocate() class method to allocate the 7-byte buffer shown in Figure 14-1. main() then calls assorted Buffer methods to demonstrate capacity, limit, position, and remaining elements.
The final output line reveals that the ByteBuffer instance assigned to buffer is actually an instance of the package-private java.nio.HeapByteBuffer class.
The documentation tells you more about the buffers’ capabilities.
Working with Channels
Channels partner with buffers to achieve high-performance I/O. A channel is an object that represents an open connection to a hardware device, a file, a network socket, an application component, or another entity that’s capable of performing write, read, and other I/O operations. Channels efficiently transfer data between byte buffers and I/O service sources or destinations.
Channels are the gateways through which native I/O services are accessed. Channels use byte buffers as the endpoints for sending and receiving data.
There often exists a one-to-one correspondence between an operating system file handle or file descriptor and a channel. When you work with channels in a file context, the channel will often be connected to an open file descriptor. Despite channels being more abstract than file descriptors, they are still capable of modeling an operating system’s native I/O facilities.
Channel and Its Children
Java supports channels via its java.nio.channels and java.nio.channels.spi packages. Applications interact with the types located in the former package; developers who are defining new selector providers work with the latter package. (We will discuss selectors later in this chapter.)
void close(): Closes this channel. When this channel is already closed, invoking close() has no effect. When another thread has already invoked close(), a new close() invocation blocks until the first invocation finishes, after which close() returns without effect. This method throws IOException when an I/O error occurs. After the channel is closed, any further attempts to invoke I/O operations upon it result in java.nio.channels.ClosedChannelException being thrown.
boolean isOpen(): Returns this channel’s open status. This method returns true when the channel is open; otherwise, it returns false.
WritableByteChannel declares an abstract int write(ByteBuffer buffer) method that writes a sequence of bytes from buffer to the current channel. This method returns the number of bytes actually written. It throws java.nio.channels.NonWritableChannelException when the channel was not opened for writing, java.nio.channels.ClosedChannelException when the channel is closed, java.nio.channels.AsynchronousCloseException when another thread closes the channel during the write, java.nio.channels.ClosedByInterruptException when another thread interrupts the current thread while the write operation is in progress (thereby closing the channel and setting the current thread’s interrupt status), and java.io.IOException when some other I/O error occurs.
ReadableByteChannel declares an abstract int read(ByteBuffer buffer) method that reads bytes from the current channel into buffer. This method returns the number of bytes actually read (or -1 when there are no more bytes to read). It throws java.nio.channels.NonReadableChannelException when the channel was not opened for reading; ClosedChannelException when the channel is closed; AsynchronousCloseException when another thread closes the channel during the read; ClosedByInterruptException when another thread interrupts the current thread while the write operation is in progress, thereby closing the channel and setting the current thread’s interrupt status; and IOException when some other I/O error occurs.
A channel whose class implements only WritableByteChannel or ReadableByteChannel is unidirectional. Attempting to read from a writable byte channel or write to a readable byte channel results in a thrown exception.
You can use the instanceof operator to determine if a channel instance implements either interface. Because it’s somewhat awkward to test for both interfaces, Java supplies the java.nio.channels.ByteChannel interface, which is an empty marker interface that subtypes WritableByteChannel and ReadableByteChannel. When you need to learn whether or not a channel is bidirectional, it’s more convenient to specify an expression such as channel instanceof ByteChannel.
Channel is also extended by the java.nio.channels.InterruptibleChannel interface. InterruptibleChannel describes a channel that can be asynchronously closed and interrupted. This interface overrides its Channel superinterface’s close() method header, presenting the following additional stipulation to Channel’s contract for this method: any thread currently blocked in an I/O operation upon this channel will receive AsynchronousCloseException (an IOException descendant).
A channel that implements this interface is asynchronously closeable : when a thread is blocked in an I/O operation on an interruptible channel, another thread may invoke the channel’s close() method. This causes the blocked thread to receive a thrown AsynchronousCloseException instance.
A channel that implements this interface is also interruptible : when a thread is blocked in an I/O operation on an interruptible channel, another thread may invoke the blocked thread’s interrupt() method. Doing this causes the channel to be closed, the blocked thread to receive a thrown ClosedByInterruptException instance, and the blocked thread to have its interrupt status set. (When a thread’s interrupt status is already set and it invokes a blocking I/O operation on a channel, the channel is closed and the thread will immediately receive a thrown ClosedByInterruptException instance; its interrupt status will remain set.)
NIO’s designers chose to shut down a channel when a blocked thread is interrupted because they couldn’t find a way to handle interrupted I/O operations reliably in the same manner across platforms. The only way to guarantee deterministic behavior was to shut down the channel.
You can determine whether or not a channel supports asynchronous closing and interruption by using the instanceof operator in an expression such as channel instanceof InterruptibleChannel.
- The java.nio.channels package provides a Channels utility class that offers two methods for obtaining channels from streams—for each of the following methods, the underlying stream is closed when the channel is closed, and the channel isn’t buffered:
WritableByteChannel newChannel(OutputStream outputStream) returns a writable byte channel for the given outputStream.
ReadableByteChannel newChannel(InputStream inputStream) returns a readable byte channel for the given inputStream.
Various classic I/O classes have been retrofitted to support channel creation. For example, RandomAccessFile declares a FileChannel getChannel() method for returning a file channel instance, and java.net.Socket declares a SocketChannel getChannel() method for returning a socket channel.
Copying Bytes from an Input Channel to an Output Channel
Listing 14-2 presents two approaches to copying bytes from the standard input stream to the standard output stream. In the first approach, which is exemplified by the copy() method, the goal is to minimize native I/O calls (via the write() method calls), although more data may end up being copied as a result of the compact() method calls. In the second approach, as demonstrated by copyAlt(), the goal is to eliminate data copying, although more native I/O calls might occur.
The copy() method while loop makes a single call to write(). Because write() might not completely drain the buffer, compact() is called to compact the buffer before the next read. Compaction ensures that unwritten buffer content isn’t overwritten during the next read operation. Following the while loop, copy() flips the buffer in preparation for draining any remaining content and then works with hasRemaining() and write() to drain the buffer completely.
The copyAlt() method while loop contains a nested while loop that works with hasRemaining() and write() to continue draining the buffer until the buffer is empty. This is followed by a clear() method call that empties the buffer so that it can be filled on the next read() call.
It’s important to realize that a single write() method call may not output the entire content of a buffer. Similarly, a single read() call may not completely fill a buffer.
The first command line copies standard input to standard output. The second command line copies the contents of ChannelDemo.java to ChannelDemo.bak. After testing the copy() method, replace copy(src, dest); with copyAlt(src, dest); and repeat.
Working with Selectors
I/O is either block-oriented (such as file I/O) or stream-oriented (such as network I/O). Streams are often slower than block devices (such as fixed disks) and read/write operations often cause the calling thread to block until input is available or output has been fully written. To compensate, modern operating systems let streams operate in nonblocking mode , which makes it possible for a thread to read or write data without blocking. The operation fully succeeds, or it indicates that the read/write isn’t possible at that time. Either way, the thread is able to perform other useful work instead of waiting.
Nonblocking mode doesn’t let an application determine if it can perform an operation without actually performing the operation. For example, when a nonblocking read operation succeeds, the application learns that the read operation is possible but also has read some data that must be managed. This duality prevents you from separating code that checks for stream readiness from the data-processing code without making your code significantly complicated.
Nonblocking mode serves as a foundation for performing readiness selection , which offloads to the operating system the work involved in checking for I/O stream readiness to perform write, read, and other operations. The operating system is instructed to observe a group of streams and return some indication of which streams are ready to perform a specific operation (such as read) or operations (such as accept and read). This capability lets a thread multiplex a potentially huge number of active streams by using the readiness information provided by the operating system. In this way, network servers can handle large numbers of network connections; they are vastly scalable.
Modern operating systems make readiness selection available to applications by providing system calls such as the POSIX select() call.
Selectors let you achieve readiness selection in a Java context. In this section, we first introduce you to selector fundamentals and then provide a demonstration.
Selector Fundamentals
A selector is an object created from a subclass of the abstract java.nio.channels.Selector class. It maintains a set of channels, which it examines to determine which of them are ready for reading, writing, completing a connection sequence, accepting another connection, or some combination of these tasks. The actual work is delegated to the operating system via a POSIX select() or similar system call.
The ability to check a channel without having to wait when something isn’t ready (such as bytes are not available for reading) and without also having to perform the operation while checking is the key to scalability. A single thread can manage a huge number of channels, which reduces code complexity and potential threading issues.
Selectors are used with selectable channels , which are objects whose classes ultimately inherit from the abstract SelectableChannel class , which describes a channel that can be multiplexed by a selector. Socket channels, server socket channels, datagram channels, and pipe source/sink channels are selectable channels because SocketChannel, ServerChannel, DatagramChannel, Pipe.SinkChannel, and Pipe.SourceChannel are derived from SelectableChannel. In contrast, file channels are not selectable channels because FileChannel doesn’t include SelectableChannel in its ancestry.
One or more previously created selectable channels are registered with a selector. Each registration returns a key (described by a concrete instance of the abstract java.nio.channels.SelectionKey class) that’s a token signifying the relationship between one channel and the selector. This key keeps track of two sets of operations: interest set and ready set. The interest set identifies the operation categories that will be tested for readiness the next time one of the selector’s selection methods is invoked. The ready set identifies the operation categories for which the key’s channel has been found to be ready by the key’s selector. When a selection method is invoked, the selector’s associated keys are updated by checking all channels registered with that selector. The application can then obtain a set of keys whose channels were found ready, and iterate over these keys to service each channel that has become ready since a previous select method call.
A selectable channel can be registered with more than one selector. It has no knowledge of the selectors to which it’s currently registered.
SelectionKey register(Selector sel, int ops)
SelectionKey register(Selector sel, int ops, Object att)
OP_ACCEPT: Operation-set bit for socket-accept operations
OP_CONNECT: Operation-set bit for socket-connect operations
OP_READ: Operation-set bit for read operations
OP_WRITE: Operation-set bit for write operations
The second method also lets you pass an arbitrary java.lang.Object/subclass instance (or null) to att. The nonnull object is known as an attachment , and it is a convenient way of recognizing a given channel or attaching additional information to the channel. It’s stored in the SelectionKey instance returned from this method.
Upon success, each method returns a SelectionKey instance that relates the selectable channel with the selector. Upon failure, an exception is thrown. For example, ClosedChannelException is thrown when the channel is closed and IllegalBlockingModeException is thrown when the channel hasn’t been set to nonblocking mode.
- 1.
Performs a selection operation
- 2.
Obtains the selected keys followed by an iterator over the selected keys
- 3.
Iterates over these keys and perform channel operations
A selection operation is performed by invoking one of Selector’s selection methods. For example, int select() performs a blocking selection operation. It doesn’t return until at least one channel is selected, this selector’s wakeup() method is invoked, or the current thread is interrupted, whichever comes first.
Selector also declares an int select(long timeout) method that doesn’t return until at least one channel is selected, this selector’s wakeup() method is invoked, the current thread is interrupted, or the timeout value expires, whichever comes first. Additionally, Selector declares int selectNow(), which is a nonblocking version of select().
The select() method returns the number of channels that have become ready since the last time it was called. For example, if you call select() and it returns 1 because one channel has become ready, and if you call select() again and a second channel has become ready, select() will once again return 1. If you’ve not yet serviced the first ready channel, you now have two ready channels to service. However, only one channel became ready between these select() calls.
A set of the selected keys (the ready set) is now obtained by invoking Selector’s Set<SelectionKey> selectedKeys() method. Invoke the set’s iterator() method to obtain an iterator over these keys.
Finally, the application iterates over the keys. For each of the iterations, a SelectionKey instance is returned. Some combination of SelectionKey’s boolean isAcceptable(), boolean isConnectable(), boolean isReadable(), and boolean isWritable() methods are called to determine if the key indicates that a channel is ready to accept a connection, finished connecting, readable, or writable.
The aforementioned methods offer a convenient alternative to specifying expressions such as key.readyOps() & OP_READ != 0. SelectionKey’s int readyOps() method returns the key’s ready set. The returned set will only contain operation bits that are valid for this key’s channel. For example, it never returns an operation bit that indicates that a read-only channel is ready for writing. Note that every selectable channel also declares an int validOps() method, which returns a bitwise ORed set of operations that are valid for the channel.
Once the application determines that a channel is ready to perform a specific operation, it can call SelectionKey’s SelectableChannel channel() method to obtain the channel and then perform work on that channel.
SelectionKey also declares a Selector selector() method that returns the selector for which the key was created.
When you’re finished processing a channel, you must remove the key from the set of keys; the selector doesn’t perform this task. The next time the channel becomes ready, the Selector will add the key to the selected key set.
In addition to registering the server socket channel with the selector, each incoming client socket channel is also registered with the server socket channel. When a client socket channel becomes ready for read or write operations, key.isReadable() or key.isWritable() for the associated socket channel returns true and the socket channel can be read or written.
A key represents a relationship between a selectable channel and a selector. This relationship can be terminated by invoking SelectionKey’s void cancel() method. Upon return, the key will be invalid and will have been added to its selector’s canceled-key set. The key will be removed from all of the selector’s key sets during the next selection operation.
When you’re finished with a selector, call Selector’s void close() method. If a thread is currently blocked in one of this selector’s selection methods, it’s interrupted as if by invoking the selector’s wakeup() method. Any uncanceled keys still associated with this selector are invalidated, their channels are deregistered, and any other resources associated with this selector are released. If this selector is already closed, invoking close() has no effect.
Selector Demonstration
Serving Time to Clients
Listing 14–3’s server application consists of a SelectorServer class. This class allocates a direct byte buffer after this class is loaded.
When the main() method is executed, it first checks for a command-line argument, which is assumed to represent a port number. If no argument is specified, a default port number is used; otherwise, main() tries to convert it to an integer representing the port by passing the argument to Integer.parseInt(). (Remember that this method throws java.lang.NumberFormatException when a noninteger argument is passed.)
After outputting a startup message that identifies the listening port, main() obtains a server socket channel followed by the underlying socket, which is bound to the specified port. The server socket channel is then configured for nonblocking mode in preparation for registering this channel with a selector.
A selector is now obtained and the server socket channel registers itself with the selector so that it can learn when the channel is ready to perform an accept operation. The returned key isn’t saved because it’s never canceled (and the selector is never closed).
main() now enters an infinite loop, first invoking the selector’s select() method. If the server socket channel isn’t ready (select() returns 0), the rest of the loop is skipped.
The selected keys (just one key) along with an iterator for iterating over them are now obtained and main() enters an inner loop to loop over these keys. Each key’s isAcceptable() method is invoked to find out if the server socket channel is ready to perform an accept operation. If this is the case, the channel is obtained and cast to ServerSocketChannel, and ServerSocketChannel’s accept() method is called to accept the new connection.
To guard against the unlikely possibility of the returned SocketChannel instance being null (accept() returns null when the server socket channel is in nonblocking mode and no connection is available to be accepted), main() tests for this scenario and continues the loop when null is detected.
A message about receiving a connection is output, and the byte buffer is cleared in preparation for storing the local time. After this long integer has been stored in the buffer, the buffer is flipped in preparation for draining. A message about writing the current time is output and the buffer is drained. The socket channel is then closed and the key is removed from the set of keys.
Receiving Time from the Server
Listing 14-4 is much simpler than Listing 14-3 because selectors aren’t used. There’s no need for a selector in this simple application. You would typically use selectors in a client context when the client interacts with several servers.
bb.get() returns a 32-bit integer representation of an 8-bit byte. Sign extension is used for byte values greater than 127, which are regarded as negative numbers. Because leading one bits affect the result after bitwise ORing them with time, they are removed by bitwise ANDing the integer with 255.
This value in time is passed to the java.util.Date(long time) constructor when a new Date object is constructed. In turn, the Date object is passed to System.out.println(), which invokes Date’s toString() method to obtain a human-readable date/time string.
Working with Regular Expressions
Text-processing applications often need to match text against patterns (character strings that concisely describe sets of strings that are considered to be matches). For example, an application might need to locate all occurrences of a specific word pattern in a text file so that it can replace those occurrences with another word. NIO includes regular expressions to help text-processing applications perform pattern matching with high performance.
Despite regular expressions showing up in the NIO chapter, they are extremely powerful and you can use them everywhere in your application and outside any input or output processing context.
Pattern, PatternSyntaxException, and Matcher
A regular expression (also known as a regex or regexp) is a string-based pattern that represents the set of strings that match this pattern. The pattern consists of literal characters and metacharacters, which are characters with special meanings instead of literal meanings.
The Regular Expressions API provides the java.util.regex.Pattern class to represent patterns via compiled regexes. Regexes are compiled for performance reasons; pattern matching via compiled regexes is much faster than if the regexes were not compiled. The API documentation tells you all about Pattern’s methods.
A method of particular importance is the static compile(String) method. You can use it to prepare a pattern string for regular expression operations. In case you have several such operations, for example, inside a loop, preparing the pattern once and then using it often is much faster compared to building the same pattern over and over again.
Finally, the Matcher matcher(CharSequence input) method reveals that the Regular Expressions API also provides the Matcher class, whose matchers attempt to match compiled regexes against input text. The method you probably use most often to obtain matchers is Pattern.matches(). Or you use method find() repeatedly to iterate over all matches.
A matcher finds matches in a subset of its input called the region. By default, the region contains all of the matcher’s input. The region can be modified by calling Matcher’s Matcher region(int start, int end) method (set the limits of this matcher’s region) and queried by calling Matcher’s int regionStart() and int regionEnd() methods.
Playing with Regular Expressions
find() searches for a match by comparing regex characters with the input characters in left-to-right order and returns true because o equals o and x equals x.
find() first compares regex character b with input character o. Because these characters are not equal and because there are not enough characters in the input to continue the search, find() doesn’t output a “Located” message to indicate a match.
The ox regex consists of literal characters. More sophisticated regexes combine literal characters with metacharacters (such as the period [.]) and other regex constructs.
To specify a metacharacter as a literal character, precede the metacharacter with a backslash character (as in .) or place the metacharacter between Q and E (as in Q.E). In either case, make sure to double the backslash character when the escaped metacharacter appears in a string literal, such as "\." or "\Q.\E".
The period metacharacter matches all characters except for the line terminator. For example, each of java RegExDemo .ox box and java RegExDemo .ox fox reports a match because the period matches the b in box and the f in fox.
Pattern recognizes the following line terminators: carriage return ( ), newline (line feed) ( ), carriage return immediately followed by newline ( ), next line (u0085), line separator (u2028), and paragraph separator (u2029). The period metacharacter can also be made to match these line terminators by specifying the Pattern.DOTALL flag when calling Pattern.compile(String, int).
Character Classes
A simple character class consists of literal characters placed side by side and matches only these characters. For example, [abc] consists of characters a, b, and c. Also, java RegExDemo t[aiou]ck tack reports a match because a is a member of [aiou]. It also reports a match when the input is tick, tock, or tuck because i, o, and u are members.
A negation character class consists of a circumflex metacharacter (^), followed by literal characters placed side by side, and it matches all characters except for those in the class. For example, [^abc] consists of all characters except for a, b, and c. Also, java RegExDemo "[^b]ox" box doesn’t report a match because b isn’t a member of [^b], whereas java RegExDemo "[^b]ox" fox reports a match because f is a member. (The double quotes surrounding [^b]ox are necessary on our Windows 7 platform because ^ is treated specially at the command line.)
A range character class consists of successive literal characters expressed as a starting literal character, followed by the hyphen metacharacter (-), followed by an ending literal character, and matches all characters in this range. For example, [a-z] consists of all characters from a through z. Also, java RegExDemo [h-l]ouse house reports a match because h is a member of the class, whereas java RegExDemo [h-l]ouse mouse doesn’t report a match because m lies outside of the range and is therefore not part of the class. You can combine multiple ranges within the same range character class by placing them side by side; for example, [A-Za-z] consists of all uppercase and lowercase Latin letters.
A union character class consists of multiple nested character classes and matches all characters that belong to the resulting union. For example, [abc[u-z]] consists of characters a, b, c, u, v, w, x, y, and z. Also, java RegExDemo [[0-9][A-F][a-f]] e reports a match because e is a hexadecimal character. (We could have alternatively expressed this character class as [0-9A-Fa-f] by combining multiple ranges.)
An intersection character class consists of multiple &&-separated nested character classes and matches all characters that are common to these nested character classes. For example, [a-c&&[c-f]] consists of character c, which is the only character common to [a-c] and [c-f]. Also, java RegExDemo "[aeiouy&&[y]]" y reports a match because y is common to classes [aeiouy] and [y].
A subtraction character class consists of multiple &&-separated nested character classes, where at least one nested character class is a negation character class, and it matches all characters except for those indicated by the negation character class/classes. For example, [a-z&&[^x-z]] consists of characters a through w. (The square brackets surrounding ^x-z are necessary; otherwise, ^ is ignored and the resulting class consists of only x, y, and z.) Also, java RegExDemo "[a-z&&[^aeiou]]" g reports a match because g is a consonant and only consonants belong to this class. (We’re ignoring y, which is sometimes regarded as a consonant and sometimes regarded as a vowel.)
Predefined Character Classes
Predefined Character Class | Description |
---|---|
d | Matches any digit character. d is equivalent to [0-9]. |
D | Matches any nondigit character. D is equivalent to [^d]. |
s | Matches any whitespace character. s is equivalent to [ x0Bf ]. |
S | Matches any nonwhitespace character. S is equivalent to [^s]. |
w | Matches any word character. w is equivalent to [a-zA-Z0-9]. |
W | Matches any nonword character. W is equivalent to [^w]. |
Capturing Groups
A capturing group saves a match’s characters for later recall during pattern matching and is expressed as a character sequence surrounded by parentheses metacharacters ( and ). All characters within a capturing group are treated as a unit. For example, the (Android) capturing group combines A, n, d, r, o, i, and d into a unit. It matches the Android pattern against all occurrences of Android in the input. Each match replaces the previous match’s saved Android characters with the next match’s Android characters.
Capturing groups can appear inside other capturing groups. For example, capturing groups (A) and (B(C)) appear inside capturing group ((A)(B(C))), and capturing group (C) appears inside capturing group (B(C)). Each nested or nonnested capturing group receives its own number, numbering starts at 1, and capturing groups are numbered from left to right. For example, ((A)(B(C))) is assigned 1, (A) is assigned 2, (B(C)) is assigned 3, and (C) is assigned 4.
RegExDemo reports a match because the matcher detects Android, followed by a space, followed by Android in the input.
Boundary Matchers and Zero-Length Matches
Boundary Matchers
Boundary Matcher | Description |
---|---|
^ | Matches the beginning of the line. |
$ | Matches the end of the line. |
Matches the word boundary. | |
B | Matches a nonword boundary. |
A | Matches the beginning of text. |
G | Matches the end of the previous match. |
Matches the end of text except for line terminator (when present). | |
z | Matches the end of text. |
This output reveals several zero-length matches . When a zero-length match occurs, the starting and ending indexes are equal, although the output shows the ending index to be one less than the starting index because we specified end() - 1 in Listing 14-5 (so that a match’s end index identifies a non-zero-length match’s last character, not the character following the non-zero-length match’s last character).
A zero-length match occurs in empty input text, at the beginning of input text, after the last character of input text, or between any two characters of that text. Zero-length matches are easy to identify because they always start and end at the same index position.
Quantifiers
A greedy quantifier (?, *, or +) attempts to find the longest match. Specify X? to find one or no occurrences of X; X* to find zero or more occurrences of X; X+ to find one or more occurrences of X; X{n} to find n occurrences of X; X{n,} to find at least n (and possibly more) occurrences of X; and X{n,m} to find at least n but no more than m occurrences of X.
A reluctant quantifier (??, *?, or +?) attempts to find the shortest match. Specify X?? to find one or no occurrences of X; X*? to find zero or more occurrences of X; X+? to find one or more occurrences of X; X{n}? to find n occurrences of X; X{n,}? to find at least n (and possibly more) occurrences of X; and X{n,m}? to find at least n but no more than m occurrences of X.
A possessive quantifier (?+, *+, or ++) is similar to a greedy quantifier except that a possessive quantifier only makes one attempt to find the longest match, whereas a greedy quantifier can make multiple attempts. Specify X?+ to find one or no occurrences of X; X*+ to find zero or more occurrences of X; X++ to find one or more occurrences of X; X{n}+ to find n occurrences of X; X{n,}+ to find at least n (and possibly more) occurrences of X; and X{n,m}+ to find at least n but no more than m occurrences of X.
The greedy quantifier (.*) matches the longest sequence of characters that terminates in end. It starts by consuming all of the input text and then is forced to back off until it discovers that the input text terminates with these characters.
The reluctant quantifier (.*?) matches the shortest sequence of characters that terminates in end. It begins by consuming nothing and then slowly consumes characters until it finds a match. It then continues until it exhausts the input text.
The possessive quantifier (.*+) doesn’t detect a match because it consumes the entire input text, leaving nothing left over to match end at the end of the regex. Unlike a greedy quantifier, a possessive quantifier doesn’t back off.
The result of this greedy quantifier is that 1 is detected at locations 0, 2, 3, and 5 in the input text and that nothing is detected (a zero-length match) at locations 1, 4, and 6.
This output might look surprising, but remember that a reluctant quantifier looks for the shortest match, which (in this case) is no match at all.
This possessive quantifier only matches the locations where 1 is detected in the input text. It doesn’t perform zero-length matches.
Practical Regular Expressions
To learn more about regular expressions, check out “Lesson: Regular Expressions” (http://download.oracle.com/javase/tutorial/essential/regex/index.html) in The Java Tutorials.
Working with Charsets
In Chapter 12, we briefly introduced the concepts of character set and character encoding. We also referred to some of the types located in the java.nio.charset package. In this section, we expand on these topics and explore this package in more detail. We also briefly revisit the String class, discussing that part of String that’s relevant to the discussion.
A Brief Review of the Fundamentals
Character : A meaningful symbol. For example, “$” and “E” are characters. These symbols predate the computer era.
Character set : A set of characters. For example, uppercase letters A through Z could be considered to form a character set. No numeric values are assigned to the characters in the set. There is no relationship to Unicode, ASCII, EBCDIC, or any other kind of character set standard.
Coded character set : A character set where each character is assigned a unique numeric value. Standards bodies such as US-ASCII or ISO-8859-1 define mappings from characters to numeric values.
Character-encoding scheme : An encoding of a coded character set’s numeric values to sequences of bytes that represent these values. Some encodings are one-to-one. For example, in ASCII, character A is mapped to integer 65 and encoded as integer 65. For some other mappings, encodings are one-to-one or one-to-many. For example, UTF-8 encodes Unicode characters. Each character whose numeric value is less than 128 is encoded as a single byte to be compatible with ASCII. Other Unicode characters are encoded as 2- to 6-byte sequences. See www.ietf.org/rfc/rfc2279.txt for more information.
Charset : A coded character set combined with a character-encoding scheme. Charsets are described by the abstract java.nio.charset.CharSet class.
Although Unicode is widely used and increasing in popularity, other character set standards are also used. Because operating systems perform I/O at the byte level, and because files store data as byte sequences, it’s necessary to translate between byte sequences and the characters that are encoded into these sequences. Charset and the other classes located in the java.nio.charset package address this translation task.
Working with Charsets
Standard Charsets
Charset Name | Description |
---|---|
US-ASCII | The 7-bit ASCII that forms the American English character set. Also known as the basic Latin block in Unicode. |
ISO-8859-1 | The 8-bit character set used by most European languages. It’s a superset of ASCII and includes most non-English European characters. |
UTF-8 | An 8-bit byte-oriented character encoding for Unicode. Characters are encoded in 1 to 6 bytes. |
UTF-16BE | A 16-bit encoding using big-endian order for Unicode. Characters are encoded in 2 bytes with the high-order 8 bits written first. |
UTF-16LE | A 16-bit encoding using little-endian order for Unicode. Characters are encoded in 2 bytes with the low-order 8 bits written first. |
UTF-16 | A 16-bit encoding whose endian order is determined by an optional byte-order mark. |
Charset names are case insensitive and are maintained by the Internet Assigned Names Authority (IANA) . The names in Table 14-3 are included in IANA’s official registry.
The probably most often used character encoding is UTF-8.
UTF-16BE and UTF-16LE encode each character as a 2-byte sequence in big-endian or little-endian order, respectively. A decoder for a UTF-16BE- or UTF-16LE-encoded byte sequence needs to know how the byte sequence was encoded. In contrast, UTF-16 relies on a byte-order mark (BOM) that appears at the beginning of the sequence. If this mark is absent, decoding proceeds according to UTF-16BE (Java’s native byte order). If this mark equals uFEFF, the sequence is decoded according to UTF-16BE. If this mark equals uFFFE, the sequence is decoded according to UTF-16LE.
Using Charsets to Encode Characters into Byte Sequences
Listing 14-6’s main() method first creates a message consisting of two French words and an array of names for the standard collection of charsets. Next, it invokes the encode() method to encode the message according to the default charset, which it obtains by calling Charset’s Charset defaultCharset() factory method. Continuing, main() invokes encode() for each of the standard charsets. Charset’s Charset forName(String charsetName) factory method is used to obtain the Charset instance that corresponds to charsetName.
forName() throws java.nio.charset.IllegalCharsetNameException when the specified charset name is illegal and throws java.nio.charset.UnsupportedCharsetException when the desired charset isn’t supported by the virtual machine.
The encode() method first identifies the charset and the message. It then invokes Charset’s ByteBuffer encode(String s) method to return a new ByteBuffer object containing the bytes that encode the characters from s.
main() next iterates over the bytes in the byte buffer, converting each byte to a character. It uses java.lang.Character’s isWhitespace() and isISOControl() methods to determine if the character is whitespace or a control character (neither is regarded as printable) and converts such a character to Unicode 0 (empty string). (A carriage return or newline would screw up the output, for example.)
Finally, the index of the character, its hexadecimal value, and the character itself are printed to the standard output stream. We chose to use System.out.printf() for this task. You’ll learn about this method in the next section.
In addition to providing encoding methods such as the aforementioned ByteBuffer encode(String s) method, Charset provides a complementary CharBuffer decode(ByteBuffer buffer) decoding method. The return type is CharBuffer because byte sequences are decoded into characters.
ByteBuffer encode(String s) is a convenience method for specifying CharBuffer.wrap(s) and passing the result to the ByteBuffer encode(CharBuffer buffer) method.
Charsets and the String Class
String(byte[] data): Constructs a new String instance by decoding the specified array of bytes using the platform’s default charset
String(byte[] data, int offset, int byteCount): Constructs a new String instance by decoding the specified subsequence of the byte array using the platform’s default charset
String(byte[] data, String charsetName): Constructs a new String instance by decoding the specified array of bytes using the named charset
byte[] getBytes(): Returns a new byte array containing the characters of this string encoded using the platform’s default charset
byte[] getBytes(String charsetName): Returns a new byte array containing the characters of this string encoded using the named charset
Note that String(byte[] data, String charsetName) and byte[] getBytes(String charsetName) throw java.io.UnsupportedEncodingException when the charset isn’t supported.
Using Charsets with String
Listing 14-7’s main() method first creates a byte array containing a UTF-8 encoded message. It then converts this array to a String object via the UTF-8 charset. After outputting the resulting String object, it extracts this object’s bytes into a new byte array and proceeds to output these bytes in hexadecimal format. As demonstrated earlier in this chapter, we bitwise AND each byte value with 255 to remove the 0xFF sign extension bytes for negative integers when the 8-bit byte integer value is converted to a 32-bit integer value. These sign extension bytes would otherwise be output.
You might be wondering why you observe e7 instead of c3 a7 (Latin small letter c with a cedilla [hook or tail]) and e9 instead of c3 a9 (Latin small letter e with an acute accent). The answer is that we invoked the noargument getBytes() method to encode the string. This method uses the default charset, which is windows-1252 on our platform. According to this charset, e7 is equivalent to c3 a7 and e9 is equivalent to c3 a9. The result is a shorter encoded sequence.
Working with Formatter and Scanner
The description for JSR 51 (http://jcp.org/en/jsr/detail?id=51) indicates that a simple printf-style formatting facility was proposed for inclusion in NIO. If you’re familiar with the C language, you’ve probably worked with the printf() family of functions that support formatted output. You’ve also probably worked with the complementary scanf() family that support formatted input.
One feature that makes the printf() and scanf() functions useful is varargs, which lets you pass a variable number of arguments to these functions.
Working with Formatter
Java 5 introduced the java.util.Formatter class as an interpreter for printf()-style format strings. This class provides support for layout justification and alignment; common formats for numeric, string, and date/time data; and more. Commonly used Java types (such as byte and BigDecimal) are supported. Also, limited formatting customization for arbitrary user-defined types is provided through the associated java.util.Formattable interface and java.util.FormattableFlags class.
Formatter declares several constructors for creating Formatter objects. These constructors let you specify where you want formatted output to be sent. For example, Formatter() writes formatted output to an internal StringBuilder instance and Formatter(OutputStream os) writes formatted output to the specified output stream. You can access the destination by calling Formatter’s Appendable out() method .
The java.lang.Appendable interface describes an object to which char values and character sequences can be appended. Classes (such as StringBuilder) whose instances are to receive formatted output (via the Formatter class) implement Appendable. This interface declares methods such as Appendable append(char c)—append c’s character to this appendable. When an I/O error occurs, this method throws IOException.
After creating a Formatter object, you would call a format() method to format a varying number of values. For example, Formatter format(String format, Object... args) formats the args array according to the string of format specifiers passed to the format parameter, and it returns a reference to the invoking Formatter so that you can chain format() calls together (for convenience).
%[argument_index$][flags][width][.precision]conversion
%[argument_index$][flags][width]conversion
%[flags][width]conversion
The first syntax describes a format specifier for general, character, and numeric types. The second syntax describes a format specifier for types that are used to represent dates and times. The third syntax describes a format specifier that doesn’t correspond to arguments.
The optional argument_index is a decimal integer indicating the position of the argument in the argument list. The first argument is referenced by 1$, the second argument is referenced by 2$, and so on.
The optional flags are a set of characters that modify the output format. The set of valid flags depends on the conversion.
The optional width is a positive decimal integer indicating the minimum number of characters to be written to the output.
The optional precision is a nonnegative decimal integer usually used to restrict the number of characters. The specific behavior depends on the conversion.
The required conversion depends on the syntax. For the first syntax, it’s a character indicating how the argument should be formatted. The set of valid conversions for a given argument depends on the argument’s data type. For the second syntax, it’s a two-character sequence. The first character is t or T. The second character indicates the format to be used. For the third syntax, it’s a character indicating content to be inserted in the output.
%d: Formats argument as a decimal integer
%x: Formats argument as a hexadecimal integer
%c: Formats argument as a character
%f: Formats argument as a decimal number
%s: Formats argument as a string
%n: Outputs a platform-specific line separator
%10.2f: Formats argument as a decimal number with 10 as the minimum number of characters to be written (leading spaces are written when the number is smaller than the width) and 2 as the number of characters to be written after the decimal point
%05d: Formats argument as a decimal integer with 5 as the minimum number of characters to be written (leading 0s are written when the number is smaller than the width)
When you’re finished with the formatter, you might want to invoke the void flush() method to ensure that any buffered output in the destination is written to the underlying stream. You would typically invoke flush() when the destination is a file.
Continuing, invoke the formatter’s void close() method . In addition to closing the formatter, this method also closes the underlying output destination when this destination’s class implements the java.io.Closeable interface. If the formatter has been closed, this method has no effect. Attempting to format after calling close() results in java.util.FormatterClosedException.
Demonstrating the Formatter Class
Listing 14-8’s main() method first creates a Formatter object via the Formatter() constructor, which sends formatted output to a StringBuilder instance. It then demonstrates the aforementioned format specifiers by invoking a format() method, followed by the toString() method to obtain the formatted content, which is subsequently output.
The formatter.format("%1$d %1$d", 123); method call accesses the single data item argument to be formatted (123) twice by referencing this argument via 1$. Without this reference, which is demonstrated via formatter.format("%d %d", 123);, an exception would be thrown because there must be a separate argument for each format specifier unless you use an argument index.
Lastly, the formatter is closed.
The first thing to notice about the output is that each format() call appends formatted output to the previously formatted output. The second thing to notice is that java.util.MissingFormatArgumentException is thrown when you don’t specify a needed argument.
MissingFormatArgumentException is one of several formatter exception types. These exception types subtype the java.util.IllegalFormatException type.
Instantiate a new Formatter instance, as in formatter = new Formatter();, before calling format(). This ensures that a new default and empty string builder is created.
Create your own StringBuilder instance and pass it to a constructor such as Formatter(Appendable a). After outputting the formatted content, invoke StringBuilder’s void setLength(int newLength) method with 0 as the argument to erase previous content.
It’s cumbersome to have to create and manage a Formatter object when all you want to do is to achieve something equivalent to the C language’s printf() function. Java addresses this situation by adding formatter support to the java.io.PrintStream class.
Of the various formatter-oriented methods added to PrintStream, you’ll often invoke PrintStream printf(String format, Object... args). After sending its formatted content to the print stream, this method returns a reference to this stream so that you can chain method calls together.
Formatting via printf()
Listing 14-9’s main() method invokes System.out.printf() twice. The first invocation formats 32-bit integer 478 into a four-digit hexadecimal string with a leading zero and uppercase hexadecimal digits. The second invocation formats the current millisecond value returned from System.currentTimeMillis() into a date. The tb conversion specifies an abbreviated month name (such as Jan), the te conversion specifies the day of the month (such as 1 through 31), and the tY conversion specifies the year (formatted with at least four digits, with leading 0s as necessary).
For more information on Formatter and its supported format specifiers, we refer you to Formatter’s Java documentation. You might also want to check out the documentation on the Formattable interface and FormattableFlags class to learn about customizing Formatter.
Working with Scanner
Java 5 introduced the java.util.Scanner class to parse input characters into primitive types, strings, and big integers/big decimals with the help of regular expressions. Scanner declares several constructors for scanning content originating from diverse sources. For example, Scanner(InputStream source) creates a scanner for scanning the specified input stream, whereas Scanner(String source) creates a scanner for scanning the specified string.
A Scanner instance uses a delimiter pattern , which matches whitespace by default, to break its input into discrete values. After creating this instance, you can call one of the “hasNext” methods to verify that an anticipated character sequence is present for scanning. For example, you could call boolean hasNextDouble() to determine whether or not the next sequence of characters can be scanned into a double-precision floating-point value.
When the value is present, you would call the appropriate “next” method to scan the value. For example, you would call double nextDouble() to scan this sequence and return a double containing its value.
When you’re finished with the scanner, invoke its void close() method. Beyond closing the scanner, this method also closes the underlying input source when this source’s class implements the Closeable interface. If the scanner has been closed, this method has no effect. Any attempt to scan after calling this method will result in IllegalStateException.
- 1.
Define new I/O.
- 2.
What is a buffer?
- 3.
Identify a buffer’s four properties.
- 4.
What happens when you invoke Buffer’s array() method on a buffer backed by a read-only array?
- 5.
What happens when you invoke Buffer’s flip() method on a buffer?
- 6.
What happens when you invoke Buffer’s reset() method on a buffer where a mark has not been set?
- 7.
True or false: Buffers are thread-safe.
- 8.
How do you create a byte buffer?
- 9.
Define view buffer.
- 10.
How is a view buffer created?
- 11.
How do you create a read-only view buffer?
- 12.
Identify ByteBuffer’s methods for storing a single byte in a byte buffer and fetching a single byte from a byte buffer.
- 13.
What causes BufferOverflowException or BufferUnderflowException to occur?
- 14.
True or false: Calling flip() twice returns you to the original state.
- 15.
What is the difference between Buffer’s clear() and reset() methods?
- 16.
What does ByteBuffer’s compact() method accomplish?
- 17.
What is the purpose of the ByteOrder class?
- 18.
Define direct byte buffer.
- 19.
How do you obtain a direct byte buffer?
- 20.
What is a channel?
- 21.
What capabilities does the Channel interface provide?
- 22.
True or false: A channel that implements InterruptibleChannel is asynchronously closeable.
- 23.
Define file channel.
- 24.
Define exclusive lock and shared lock.
- 25.
What is the fundamental difference between FileChannel’s lock() and tryLock() methods?
- 26.
What does the FileLock lock() method do when either a lock is already held that overlaps this lock request or another thread is waiting to acquire a lock that will overlap with this request?
- 27.
Specify the pattern that you should adopt to ensure that an acquired file lock is always released.
- 28.
What method does FileChannel provide for mapping a region of a file into memory?
- 29.
True or false: Socket channels are selectable and can function in nonblocking mode.
- 30.
True or false: Datagram channels are not thread-safe.
- 31.
Why do socket channels support nonblocking mode?
- 32.
How would you obtain a socket channel’s associated socket?
- 33.
How do you obtain a server socket channel?
- 34.
Define selector.
- 35.
Define regular expression.
- 36.
What does the Pattern class accomplish?
- 37.
What do Pattern’s compile() methods do when they discover illegal syntax in their regular expression arguments?
- 38.
What does the Matcher class accomplish?
- 39.
What is the difference between Matcher’s matches() and lookingAt() methods?
- 40.
Define character class.
- 41.
Identify the various kinds of character classes.
- 42.
Define capturing group.
- 43.
What is a zero-length match?
- 44.
Define quantifier.
- 45.
What is the difference between a greedy quantifier and a reluctant quantifier?
- 46.
How do possessive and greedy quantifiers differ?
- 47.
Identify the two main classes that contribute to the NIO printf-style formatting facility.
- 48.
What does the %n format specifier accomplish?
- 49.
Refactor Listing 12–7 (Chapter 12’s Copy application) to use the ByteBuffer and FileChannel classes in partnership with FileInputStream and FileOutputStream.
- 50.
Create a ReplaceText application that takes input text, a pattern that specifies text to replace, and replacement text command-line arguments, and uses Matcher’s String replaceAll(String replacement) method to replace all matches of the pattern with the replacement text (passed to replacement). For example, java ReplaceText "too many embedded spaces" "s+" " " should output too many embedded spaces with only a single space character between successive words.
Summary
Java 1.4 introduced a more powerful I/O architecture that supports memory-mapped file I/O, readiness selection, file locking, and more. This architecture largely consists of buffers, channels, selectors, regular expressions, and charsets, and it is commonly known as new I/O (NIO).
NIO is based on buffers. A buffer is an object that stores a fixed amount of data to be sent to or received from an I/O service (a means for performing input/output). It sits between an application and a channel that writes the buffered data to the service or reads the data from the service and deposits it into the buffer. Java supports buffers by providing the Buffer class, assorted subclasses, and the ByteOrder type-safe enumeration.
Channels partner with buffers to achieve high-performance I/O. A channel is an object that represents an open connection to a hardware device, a file, a network socket, an application component, or another entity that’s capable of performing write, read, and other I/O operations. Channels efficiently transfer data between byte buffers and I/O service sources or destinations. Java supports channels by providing the Channel interface and related types.
Selectors let you achieve readiness selection in a Java context. Readiness selection offloads to the operating system the work involved in checking for I/O stream readiness to perform write, read, and other operations. The operating system is instructed to observe a group of channels and return some indication of which channels are ready to perform a specific operation (such as read) or operations (such as accept and read). This capability lets a thread multiplex a potentially huge number of active channels by using the readiness information provided by the operating system. In this way, network servers can handle large numbers of network connections; they are vastly scalable. Java supports selectors by offering the SelectableChannel, SelectionKey, and Selector classes.
Text-processing applications often need to match text against patterns (character strings that concisely describe sets of strings that are considered to be matches). For example, an application might need to locate all occurrences of a specific word pattern in a text file so that it can replace those occurrences with another word. NIO includes regular expressions to help text-processing applications perform pattern matching with high performance. Java supports regular expressions by providing the Pattern and Matcher classes.
Charsets combine coded character sets with character-encoding schemes. They’re used to translate between byte sequences and the characters that are encoded into these sequences. Java supports charsets by providing CharSet and related classes.
The description for JSR 51 (the NIO JSR) indicates that a simple printf-style formatting facility was proposed for inclusion in NIO. This facility consists of formatted output and formatted input.
Chapter 15 focuses on database access. You first encounter the Apache Derby and SQLite database products and then learn how to use the JDBC API to create/access their databases.