Chapter 19. Regular Expressions

This chapter provides an introduction to some classes in the java.util and related packages that have utilities for your code to use. We start with the regular expression support that allows you to do pattern matching of Strings. After that, we look at the Date and Calendar-related classes, which have confused and frustrated many people.

The chapter finishes with a brief look at some data structure classes that predate Collections, but are still useful. The rest of this book is devoted to explaining more Java libraries and showing examples of their use. Let's get going with regular expressions and pattern matching.

Regular Expressions And Pattern Matching

This section uses the I/O features described in the previous chapters and describes the regular expression pattern matching feature that was introduced with JDK 1.4.

If you have typed “dir *.java” to see all the Java files in a directory, you have used a regular expression for pattern matching. A regular expression is a String that can contain some special characters to help you match patterns in text. In this case, the asterisk is shorthand for “any characters at all”. The name “regular expression” was coined by American mathematician Stephen Kleene who developed the expressions as a notation for describing what he called “the algebra of regular sets”. The asterisk is also called a “Kleene star”.

JDK 1.4 introduced a package called java.util.regex that supports the use of regular expressions. Using the classes in that package lets you answer questions like “Does this kind of pattern occur anywhere in that String?”, and you can split Strings apart and create new Strings with changed contents. These sorts of operations are very useful in the following contexts:

  • Web searches (you can use regular expressions in many search engines).

  • Email filtering (discard email where the “From:” line matches well-known spammers)

  • Text-manipulation tasks. Source code editors usually have a way to search using regular expressions. If you don't know how to do pattern matching in the editor you use to edit programs, you aren't yet reaching your full potential as a programmer. Plus, it's a great way to beguile other programmers who look over your shoulder.

There's lots of good news about regular expressions in Java. First, the language of regular expressions (the way you form regular expressions, the special symbols and their meaning) is very similar to that used by Perl. There are a few obscure things supported by Java that Perl 5 doesn't support, and vice versa. Java is less forgiving about badly formed expressions. But if you already know Perl, there's less to learn about Java pattern matching. If you don't use Perl, your Java regex knowledge will get you jump-started.

Best of all, Java regular expressions are simple. There are only three classes in the package, and one of those is an exception! Well, you can't really judge the complexity of a library by the number of classes it has, but regular expressions are straightforward. Pattern matching is important, and we'll cover it in some detail.

Let's say you have a String somewhere, and you want to look for a pattern in it. A pattern will be something like “at least one letter or digit (and maybe many) followed by a colon followed by a space, followed by at least one letter or digit (and maybe many)”. The steps to look for a pattern in a String, s, include:

  1. You specify the pattern with a String, p, holding a regular expression representing the pattern.

  2. You then turn String p into a pattern object by invoking the static method java.util.regex.Pattern.compile(p). The compile() hands you back a pattern object.

  3. Using the pattern object from Step 2, you invoke the matcher() method, giving it the input String s that you want to look through. Matcher(s) will give you back a java.util.regex.Matcher object. That matcher object has the methods for matching, splitting, and replacing parts of input Strings.

Therefore, two of your classes are pattern and matcher. The third class is the exception PatternSyntaxException. That exception is thrown if you provide a faulty regular expression to Pattern.compile().

Under the covers, the Pattern.compile() method builds a tree to represent the regular expression. Each node in the tree represents one component of the regular expression. Each node contains the code that does a comparison on an input String and gives an answer about whether it matches that part of the pattern. It is similar to the work an ordinary compiler does to turn source code into executable code, so “compile()” is a reasonable name for the method. The programmers could have made a constructor available. The use of a static method called compile() to return an instance is a hint that there is a lot more work going on here than memory allocation and initialization. It also leaves the door open to a future release which tries to share pattern matchers or provide several alternative implementations that can be swapped in at run-time.

Matching a pattern

Just as you don't directly instantiate a pattern object, you don't directly instantiate a matcher. You get an instance of the Matcher class by calling the matcher() method of your pattern object. Then you send it input from anything that implements the CharSequence interface. String, StringBuffer, and CharBuffer implement CharSequence, so they are easy to pattern match. If you want to match patterns in a text file, this is also easy to do using the new Channel I/O feature. You get the Channel for the file, then you get a Buffer from the Channel. Character Buffers implement the CharSequence interface. If you need to pattern match from some other source, you can implement the CharSequence interface yourself (it's small and easy).

After you have a matcher object, it supports three kinds of match operations:

  • The find() method scans the input sequence looking for the next sequence that matches the pattern.

  • The matches() method tries to match all the input sequence against the pattern.

  • The lookingAt() method tries to match some or all of the input sequence, starting at the beginning, against the pattern.

Forming patterns

A regular expression, or pattern, is a String that describes the kind of thing you want to match. Most letters represent themselves. If you compile a literal pattern of “To: [email protected]”, it will match exactly those letters in that order. If you have a file containing old email, you could use this pattern to find all the email addressed that way.

After you have compiled a pattern and have a matcher object, you can invoke its find() method to look for the next occurrence of the pattern in the input.You can then call its group() method to get back the input sequence that was matched by the previous match operation. The code is similar to the following:

Pattern p = Pattern.compile("To: [email protected]");
Matcher m = p.matcher( someBuffer );
while (m.find())
    System.out.println("Found text: "+m.group());

The find() method returns a Boolean result indicating if the next occurrence of the pattern was found in the input. The group() method returns a String containing the most recent part of the input to match. If you provide a main program and run the code on a file containing my email, it prints the matches with output similar to the following:

Found text: To: [email protected]
Found text: To: [email protected]

The following sections use this example data file of email. It contains five email messages, each of which starts with “From:” and continues until the next “From:”.

Example . Sample email data file

From: [email protected]
To: [email protected]
Subject: weather
The weather is fine today.
From: [email protected]
To: [email protected]
 Hello
From: [email protected]
To: [email protected],[email protected]
Subject: no change!
--[booo!]
Weather still fine!
   He said he was a British Subject: born in London
From: [email protected]
To: [email protected]
Subject: Help - no rain.
We are in a drought,
   Bill.
From: [email protected]
To: [email protected]
Subject: SHIFT KEY
HELP! SHIFT KEY IS STUCK _ BILL

The following sections use this program as a framework for trying different patterns. The pattern line is marked in bold. If you want to experiment with the different patterns shown, this is the line to update. This program and the data file are on the web site afu.com/jj6.

Example . Sample pattern-matching program

import java.util.regex.*;
import java.io.*;
import java.nio.*;
import java.nio.charset.*;
import java.nio.channels.*;
public class Extract {
    public static void main(String[] args) throws Exception {
        // Create a pattern to match comments
        Pattern p = Pattern.compile("To: [email protected]");

        // Get a Channel for the source file
        FileInputStream fis = new FileInputStream("email.txt");
        FileChannel fc = fis.getChannel();

        // Map a Buffer in from the data file, and decode the bytes
        MappedByteBuffer bb =
           fc.map(FileChannel.MapMode.READ_ONLY, 0,(int)fc.size());
        Charset cs = Charset.forName("8859_1");
        CharsetDecoder cd = cs.newDecoder();
        CharBuffer cb = cd.decode(bb);

        // Run some matches
        Matcher m = p.matcher(cb);
        while (m.find())
            System.out.println("Found text: "+m.group( );
    }
}

Running the program creates the following output:

java Extract
Found text: To: [email protected]
Found text: To: [email protected]

The following three lines of code may be unfamiliar:

Charset cs = Charset.forName("8859_1");
CharsetDecoder cd = cs.newDecoder();
CharBuffer cb = cd.decode(bb);

The three lines of code specify how the bytes in the file are translated into characters as they are brought into the buffer. The earlier examples of mapped I/O just took bytes from the file and put them into bytes in the buffer. In this case, we get an object that represents the ISO 8859 Latin-1 character set (shown in Appendix B). Using that object we get a decoder object. The decoder object has the ability to “decode” or translate bytes into double-byte characters. The 8859 example is a single byte character set and the translation turns each byte in the file into two bytes in the buffer; zero is the most significant byte. For more details about character sets and encodings, see the second I/O chapter.

Range

The pattern “To: [email protected]” won't find all email to bbb. Email should be delivered whether the domain part of the address is in upper or lowercase. This pattern matches only lowercase. As frequently happens with regular expressions, there are several different ways of writing a pattern. We can make the pattern ignore letter case by passing a flag when we compile the pattern, as shown in the following example:

Pattern p = Pattern.compile("To: [email protected]",
                                Pattern.CASE_INSENSITIVE);

Another way to achieve the same effect is to use a range. When the pattern object sees square brackets, it tries to match one of the characters inside the brackets. If we want to match “sun” without regard to case of the first letter, we could use this pattern:

"[Ss]un"

The pattern in one pair of square brackets matches one character. You can use a hyphen to indicate a range of characters (hence the name for this feature). Both of the following patterns will match any single digit:

"[0123456789]"
"[0-9]"

To exactly match two digits “00” to “99” we could use “[0-9][0-9]”.

A powerful feature of the range function is the ability to match “anything but” the list of characters in the range. If the first character in a range is "^” (caret), it means “match any character except the ones that follow”. In order to extract the “To:” lines for all names except those that start with “j” we could use this pattern:

"To: [^j]"

Similarly, “[^ ]” matches any character other than a space, and “[^0-9]” matches any one character that is not a digit. If you need to match against one of the special characters like square bracket or caret, you can “escape” them in the String. You escape a special character (treat it literally) by putting two backslashes before it in the String. Rather than use one backslash, use two because the rules of Java Strings take precedence over the rules of regular expression patterns. To get one backslash in a Java String, you must escape it with its own backslash.

Ranges can be used to match literal Strings. But we are often in a situation where we want to match a String that conforms to some pattern. The email subject line, for example, starts with “Subject: “, then has some kind of text, and ends with an end-of-line character. To match a pattern that includes some arbitrary text, use metacharacters.

Single-character metacharacters

Let's say we want to match the Subject line of email. We want a pattern to match “Subject: anything” on one line. We will use the metacharacter “.” (dot) that matches any single character. There are other metacharacters besides dot that match a single character. These metacharacters are shown in Pattern metacharacters.

Table 19-1. Pattern metacharacters

Metacharacter

Written in Java String

Single character matched

Express with a range

.

"."

Any character

n/a

d

"\d"

A digit

[0-9]

D

"\D"

A non-digit

[^0-9]

s

"\s"

A whitespace character

[ x0Bf ]

S

"\S"

A non-whitespace character

[^s]

w

"\w"

A character that can be part of a word

[a-zA-Z_0-9]

W

"\W"

A character that isn't part of a word

[^w]

We use “.” to match any character. We also need to apply a quantifier that says how many times to do this. The quantifier “*” means “any number of times.” It applies to whatever immediately precedes it. Putting together the “match any character” dot with the “any number of times” quantifier, our pattern to match the subject line of email is the following:

"Subject: .*"

This matches up to the end of a line because, by default, the dot does not match line terminator characters. If you put that pattern into a suitable program and run it, you get output similar to the following:

Found text: Subject: weather
Found text: Subject: no change!
Found text: Subject: Born in London
Found text: Subject: Help - no rain.
Found text: Subject: SHIFT KEY

If you look back at the email.txt file, you'll see that the “Born in London” text is not actually a subject line. We will fix that in a later section.

Quantifiers

There are other quantifiers that can express different amounts of repetition. Table 19-2 on page 476 shows some quantifiers that specify the number of times a particular character or pattern should match. In this table, “X” represents any pattern.

Table 19-2. Quantifiers

Pattern

Meaning

X?

X, zero or one time

X*

X, zero or more times

X+

X, one or more times

X{n}

X, exactly n times

X{n,}

X, at least n times

X{n,m}

X, between n and m times

You can group patterns in parentheses to indicate exactly what is being repeated. So “(\w*: \w*)*” will match any number of sequences that consist of wordcharacters-colon-space-wordcharacters. This is a pattern for email headers. Don't be fooled by the name “wordcharacter”. It only matches a single character, not an entire word. If you want it to match a word, you have to use a quantifier to repeat it, as shown in the email example.

All quantifier operators (+, *, ?, {m,n}) are greedy by default, meaning that they match as many elements of the String as possible without causing the overall match to fail. In contrast, a reluctant closure will match as few elements of the String as possible when finding matches. You can make a quantifier reluctant by appending a '?' to the quantifier. An example of a reluctant quantifier is shown later.

Capturing groups and back references

Another use for parentheses is to represent matching subpatterns within the overall pattern. These subpatterns are called capturing groups, and you can retrieve them independently from the matcher you use in your code. You can also refer to one of these capturing groups later in the expression itself with a backslash-escaped number. A backslash followed by a number is always interpreted as a back reference. The first back reference in a regular expression is denoted by 1, the second by 2, and so on. Therefore, the expression: “([0-9]+)==1” would match input like “2==2” or “17==17.” Remember to double those backslashes when you want to put them in a Java String!

Back references let you match against patterns that contain Reader's Digest style junk mail. If you've never received a Reader's Digest letter, it is personalized by repeating your name and other details they have on file about you. A typical phony letter would be similar to the following:


    Dear Peter,
    Excuse the intrusion, Peter, but we just wanted to ask you who would
    look after your family at 123 Main Street, if anything should happen to you,
    Peter?
    Life insurance is not that expensive, Peter, and surely the family is worth it.
    Please contact us for more details, Peter.

The following pattern would match against this:

Pattern p = Pattern.compile( 
           "^Dear (\w+),$"     // matches "Dear name," 
           + "(^"               //   any number of lines 
           +   ".* \1.*"       
//   each line has the name we captured in group 1. 

           + "$)*"              //   end line 
         , Pattern.MULTILINE );

As you can see, patterns quickly become difficult to read. This pattern matches the greeting and then up to the first line not containing the name. The important things in the Reader's Digest example are the expression in parentheses on the first line (the parentheses make it a capturing group), and the “\1” on the third line of the pattern, which is a back reference to capturing group 1. The back references are numbered according to the order in which their opening parenthesis appears. Capturing groups can nest inside each other.

Whenever you use parentheses, the bracketed part of the pattern becomes a capturing group (there is a way to turn that off). The method Matcher.group(int i) returns the input sequence captured by group i during the most recent match. To extract the actual name and print it, we would add the following code:

Matcher m = p.matcher( someBuffer );
if (m.find()) { 
   System.out.println("Letter personalized for: " + m.group(1)); 
   System.out.println("line from letter: " + m.group(2)); 
}

Group 1 is the name. The call to group(2) will return the String “Please contact us for more details, Peter,” since that is the most recent match of all the lines that the group captured.

Anchors

Returning to our email example, the pattern “Subject: .*” will also find non-subject lines, where text similar to the following exists:

How to be a British Subject: marry into the Royal Family.

That's not an email subject line, but it contains characters that match our pattern. If we really only want Subject lines, we need to be able to specify that the pattern only matches something at the beginning of a line. To do this we use a set of metacharacters called “anchors”, which attach the pattern to a particular place. Anchor characters shows some anchor characters and how they affect matching. In this email example, anchoring the pattern to the beginning of a line is still not enough. The pattern could occur in the body of an email message at the beginning of a line. To get this exactly right, you will need to match on the whole message, and distinguish headers from the body. An exercise at the end of the chapter allows you to do that.

Table 19-3. Anchor characters

Anchor

Effect

^

The beginning of a line (also needs the multiline flag)

$

The end of a line (also needs the multiline flag)

<

The beginning of a word

>

The end of a word



A word boundary

B

A non-word boundary

A

The beginning of the input



The end of the input but for the final terminator, if any

z

The end of the input

Notice that the anchor for the beginning of a line is a caret. Don't get confused by the fact that caret is also used in ranges with a different meaning. There are so many metacharacters needed that it's inevitable that a few would be reused. We can anchor our email subject search to the beginning of a line with a pattern similar to the following:

"^Subject: .*"

By default, the expressions for beginning and end of line don't do that! They only match the beginning and end of the input. You must set a pattern flag for multiple lines, as we did previously for letter case. The multiline flag will cause pattern matching to extend across line boundaries in the input. The following example shows you how to set the pattern and a couple of flags in one statement:

Pattern p = Pattern.compile("^Subject: .*$", 
              Pattern.MULTILINE | Pattern.CASE_INSENSITIVE );

The flags are actually integer constants, and you “or” them together to combine their effect, as shown in the previous line of code.

Alternation

Let's make this example more realistic by writing a pattern to extract a series of entire email messages. A mail message is defined as everything between one “From:” at the start of a line, and the next one. There is a method in pattern that splits an input sequence into pieces that are separated by the pattern. This is similar to what a Scanner does with a delimiter. It returns an array of Strings, and is perfect for this purpose. The code to use it looks similar to the following example:

Pattern p = Pattern.compile("^From:", Pattern.MULTILINE); 
String[] messages = p.split( someBuffer );

This code will split our file into Strings, each of which contains one e-mail. The delimiter pattern is not copied into the resulting messages array.

Now let's look at a couple of other topics relating to pattern matching. The first one is alternation, which uses the “|” (vertical bar) meta character. The second topic is how to say “anything except this word”?

When you place a “|” in a pattern, it means “or”. The pattern will match if either the left side of the “|” or the right side matches the input. That's all there is to alternation! You can use parentheses to group the alternate things more explicitly if needed. There is no corresponding “and” feature, because you get that effect by writing two subpatterns one after the other. So “XY” means “match an X followed by a Y,” while “X|Y” means “match if you see either an X or a Y.”

Word negation

The next topic we will cover in this section is how to match the negation of a word. Ranges provide an easy way to match the negation of a single character. There is no built-in support for negation (“everything but”) of an entire word. Most people's first guess at a pattern to exclude all lines that start with “From” is “^[^F][^r][^o][^m]”.

This doesn't do what you want! It matches everything where the first character is not an “F”, and the second character is not an “r”, and the third character is not an “o”, and so on. Because these “not equal to” conditions are “anded” together, if you have a word for which any of these letters-and-positions is a F... or .r.. or ..o. or ...m; for example, “Frob” or “grin” or “shot” or “glum” you will find that the pattern rejects that line overall as a match. Regular expressions really ought to have an operator that matches the negation of a word.

You have to create the idiom of “exclude this word” from other primitive operations that are available. To extract complete email messages, we want to start with a “From:” at the beginning of a line, and go up to but not including the next “From:” at the beginning of a line.

One way to express this is with a pattern in three parts: a “^From:” matched literally, then an “everything except a '^From'”, then a “^From:” or an end-of-input (using alternation). Following is the code example:

Pattern p = Pattern.compile( 
"^From:.*$"       // first "From:" 
+ "(^.*$)*?"        // anything, over several lines 
+ "^(?=From:|\z)"  // second "From:" or end of input 
, Pattern.MULTILINE);

The “?” in pattern “*?” is the “reluctant quantifier” that we mentioned earlier. It matches zero or more times reluctantly. For example, if there is another way to interpret this match, that other part of the pattern is preferred. Similarly, the “?=From” is a special construct that provides a match with lookahead. The matching characters are not regarded as part of the captured group but remain in the buffer for the next find() attempt. Finally, the “|\z” causes a match on either the “From” or the end of input.

One place where regular expressions can be used is in the accept method of the class javax.swing.filechooser.FileFilter. The class java.io.File has a method: File[] listFiles( FileFilter filter). You can write a class implementing FileFilter and supply the only method there, which is accept(File). You can put your file selection logic in there, and base it on desired filename patterns.

Metawords

There are also metawords that match entire categories, as shown in POSIX character classes.

Table 19-4. POSIX character classes

(US-ASCII only)

Meaning

p{Lower}

A lower-case alphabetic character: [a-z]

p{Upper}

An upper-case alphabetic character:[A-Z]

p{Alpha}

An alphabetic character:[{lower}{upper}]

p{Digit}

A decimal digit: [0-9]

p{Alnum}

An alphanumeric character: [{alpha}{digit}]

p{Punct}

Punctuation: one of !"#$%&'()*+,-./:;<=>?@[]^_`{|}~

p{Graph}

A visible character: [p{Alnum}p{Punct}]

p{Print}

A printable character: [p{Graph}]

p{Blank}

A space or a tab: [ ]

p{Cntrl}

A control character: [x00-x1Fx7F]

p{Xdigit}

A hexadecimal digit: [0-9a-fA-F]

p{Space}

A white space character: [ x0Bf ]

All of the state involved in performing a match is in the matcher, so many matchers can share the same pattern. But matcher objects are not thread safe, and one matcher should not be invoked from different threads at the same time.

Finally, the JDK comes with an example program that searches for regular expressions in files. This program is also a Unix utility known as “grep,” which stands for “globally search for regular expression and print.” A much simplified version of the program follows. Please review it carefully, as it provides a non-trivial practical example of the use of regular expressions.

Java grep program

// Search a list of files for lines that match a given regular-expression
// pattern.  Demonstrates NIO mapped byte buffers, charsets, and regular
// expressions.
import java.io.*;
import java.nio.*;
import java.nio.channels.*;
import java.nio.charset.*;
import java.util.regex.*;
public class Grep {
   public static void main(String[] args) {
        if (args.length < 2) {
            System.err.println("Usage: java Grep pattern file..."); 
            return; 
        } 
        doCompile(args[0]); 
        for (int i = 1; i < args.length; i++) { 
            File f = new File(args[i]); 
            try {                CharBuffer cb = mapInFile(f); 
               grep(f, cb); 
            } catch (IOException x) { 
               System.err.println(f + ": " + x); 
            }
        }
    }
    // Charset and decoder for ISO-8859-15
    private static Charset charset = Charset.forName("ISO-8859-15");
    private static CharsetDecoder decoder = charset.newDecoder();
    // Pattern used to separate files into lines
    private static Pattern linePattern = Pattern.compile(".* ? ");
    // The input pattern that we're looking for
    private static Pattern pattern;
    // Compile the pattern from the command line
    //
    private static void doCompile(String pat) {
      try {
          pattern = Pattern.compile(pat);
      } catch (PatternSyntaxException x) {
          System.err.println(x.getMessage());
          System.exit(1); 
      } 
    } 
    // Use the linePattern to break the given CharBuffer into lines,
    // applying 
    // the input pattern to each line to see if we have a match 
    private static void grep(File f, CharBuffer cb) { 
        Matcher lm = linePattern.matcher(cb); // Line matcher
        Matcher pm = null;   // Pattern matcher
        int lines = 0;
        while (lm.find()) {
            lines++;
            CharSequence cs = lm.group();  // The current line
            if (pm == null)
                pm = pattern.matcher(cs);
            else
                pm.reset(cs);
           if (pm.find())
                System.out.print(f + ":" + lines + ":" + cs);
            if (lm.end() == cb.limit())
                break;
         }
    }
    // Search for occurrences of the input pattern in the given file
    private static CharBuffer mapInFile(File f) throws IOException {
        // Open the file and then get a channel from the stream
        FileInputStream fis = new FileInputStream(f);
        FileChannel fc = fis.getChannel();
        int size = (int)fc.size();
        MappedByteBuffer mbb = fc.map(FileChannel.MapMode.READ_ONLY, 0,
                                                                    size);
        // Decode the file into a char buffer 
       CharBuffer cb = decoder.decode(mbb); 
       return cb; 
    } 
}

That concludes the discussion of regular expressions. Now let's describe some more classes from package java.util.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.235.62