Chapter 13. Strings and Regular Expressions

 

What's the use of a good quotation if you can't change it?

 
 --Dr. Who, The Two Doctors

Strings are standard objects with built-in language support. You have already seen many examples of using string literals to create string objects. You've also seen the + and += operators that concatenate strings to create new strings. The String class, however, has much more functionality to offer. String objects are immutable (read-only), so you also have a StringBuilder class for mutable strings. This chapter describes String and StringBuilder and some related classes, including utilities for regular expression matching.

Character Sequences

As described in “Character Set” on page 161, the Java programming language represents text consisting of Unicode characters as sequences of char values using the UTF-16 encoding format. The String class defines objects that represent such character sequences. More generally, the java.lang.CharSequence interface is implemented by any class that represents such a character sequence—this includes the String, StringBuilder, and StringBuffer classes described in this chapter, together with the java.nio.CharBuffer class that is used for performing I/O.

The CharSequence interface is simple, defining only four methods:

  • public char charAt(int index)

    • Returns the char in this sequence at the given index. Sequences are indexed from zero to length()-1 (just as arrays are indexed). As this is a UTF-16 sequence of characters, the returned value may be an actual character or a value that is part of a surrogate pair. If the index is negative or not less than the length of the sequence, then an IndexOutOfBoundsException is thrown.

  • public int length()

    • Returns the length of this character sequence.

  • public CharSequence subSequence(int start, int end)

    • Returns a new CharSequence that contains the char values in this sequence consisting of charAt(start) through to charAt(end-1). If start is less than end or if use of either value would try to index outside this sequence, then an IndexOutOfBoundsException is thrown. Be careful to ensure that the specified range doesn't split any surrogate pairs.

  • public String toString()

    • Overrides the contract of Object.toString to specify that it returns the character sequence represented by this CharSequence.

The String Class

Strings are immutable (read-only) character sequences: Their contents can never be changed after the string is constructed. The String class provides numerous methods for working with strings—searching, comparing, interacting with other character sequences—an overview of which is given in the following sections.

Basic String Operations

You can create strings implicitly either by using a string literal (such as "Gröçe") or by using + or += on two String objects to create a new one.

You can also construct String objects explicitly using new. The String class supports the following simple constructors (other constructors are shown in later sections):

  • public String()

    • Constructs a new String with the value ""—an empty string.

  • public String(String value)

    • Constructs a new String that is a copy of the specified String object value—this is a copy constructor. Because String objects are immutable, this is rarely used.

  • public String(StringBuilder value)

    • Constructs a new String with the same contents as the given StringBuilder.

  • public String(StringBuffer value)

    • Constructs a new String with the same contents as the given StringBuffer.

The most basic methods of String objects are length and charAt, as defined by the CharSequence interface. This loop counts the number of each kind of character in a string:

for (int i = 0; i < str.length(); i++)
    counts[str.charAt(i)]++;

Note that length is a method for String, while for array it is a field—it's common for beginners to confuse the two.

In most String methods, a string index position less than zero or greater than length()-1 throws an IndexOutOfBoundsException. Some implementations throw the more specific StringIndexOutOfBoundsException, which can take the illegal index as a constructor argument and then include it in a detailed message. Methods or constructors that copy values to or from an array will also throw IndexOutOfBoundsException if any attempt is made to access outside the bounds of that array.

There are also simple methods to find the first or last occurrence of a particular character or substring in a string. The following method returns the number of characters between the first and last occurrences of a given character in a string:

static int countBetween(String str, char ch) {
    int begPos = str.indexOf(ch);
    if (begPos < 0)         // not there
        return -1;
    int endPos = str.lastIndexOf(ch);
    return endPos - begPos - 1;
}

The countBetween method finds the first and last positions of the character ch in the string str. If the character does not occur twice in the string, the method returns -1. The difference between the two character positions is one more than the number of characters in between (if the two positions were 2 and 3, the number of characters in between is zero).

Several overloads of the method indexOf search forward in a string, and several overloads of lastIndexOf search backward. Each method returns the index of what it found, or –1 if the search was unsuccessful:

Method

Returns Index Of...

indexOf(int ch)

first position of ch

indexOf(int ch, int start)

first position of chstart

indexOf(String str)

first position of str

indexOf(String str, int start)

first position of strstart

lastIndexOf(int ch)

last position of ch

lastIndexOf(int ch, int start)

last position of chstart

lastIndexOf(String str)

last position of str

lastIndexOf(String str, int start)

last position of strstart

The indexing methods that take an int parameter for the character to look for will search for the given character if the value is less than 0xFFFF, or else the code point with the given value—see “Working with UTF-16” on page 336.

If you don't care about the actual index of the substring, you can use the contains method, which returns true if the current string contains a given CharSequence as a subsequence. If you want to find the index of an arbitrary CharSequence you must invoke toString on the CharSequence and pass that to indexOf instead.

Exercise 13.1: Write a method that counts the number of occurrences of a given character in a string.

Exercise 13.2: Write a method that counts the number of occurrences of a particular string in another string.

String Comparisons

The String class supports several methods to compare strings and parts of strings. Before we describe the methods, though, you should be aware that internationalization and localization issues of full Unicode strings are not addressed with these methods. For example, when you're comparing two strings to determine which is “greater,” characters in strings are compared numerically by their Unicode values, not by their localized notion of order. To a French speaker, c and ç are the same letter, differing only by a small diacritical mark. Sorting a set of strings in French should ignore the difference between them, placing "açb" before "acz" because b comes before z. But the Unicode characters are different—c (u0063) comes before ç (u00e7) in the Unicode character set—so these strings will actually sort the other way around. Internationalization and localization are discussed in Chapter 24.

The first compare operation is equals, which returns true if it is passed a reference to a String object having the same contents—that is, the two strings have the same length and exactly the same Unicode characters. If the other object isn't a String or if the contents are different, String.equals returns false. As you learned on page 100, this overrides Object.equals to define equivalence instead of identity.

To compare strings while ignoring case, use the equalsIgnoreCase method. By “ignore case,” we mean that Ë and ë are considered the same but are different from E and e. Characters with no case distinctions, such as punctuation, compare equal only to themselves. Unicode has many interesting case issues, including a notion of “titlecase.” Case issues in String are handled in terms of the case-related methods of the Character class, as described in “Character” on page 192.

A String can be compared with an arbitrary CharSequence by using the contentEquals method, which returns true if both objects represent exactly the same sequence of characters.

To sort strings, you need a way to order them, so String implements the interface Comparable<String>—the Comparable interface was described on page 118. The compareTo method returns an int that is less than, equal to, or greater than zero when the string on which it is invoked is less than, equal to, or greater than the other string. The ordering used is Unicode character ordering. The String class also defines a compareToIgnoreCase method.

The compareTo method is useful for creating an internal canonical ordering of strings. A binary search, for example, requires a sorted list of elements, but it is unimportant that the sorted order be local language order. Here is a binary search lookup method for a class that has a sorted array of strings:

private String[] table;

public int position(String key) {
    int lo = 0;
    int hi = table.length - 1;
    while (lo <= hi) {
        int mid = lo + (hi - lo) / 2;
        int cmp = key.compareTo(table[mid]);
        if (cmp == 0)       // found it!
            return mid;
        else if (cmp < 0)   // search the lower part
            hi = mid - 1;

        else                // search the upper part
            lo = mid + 1;
    }
    return -1;              // not found
}

This is the basic binary search algorithm. It first checks the midpoint of the search range to determine whether the key is greater than, equal to, or less than the element at that position. If they are the same, the element has been found and the search is over. If the key is less than the element at the position, the lower half of the range is searched; otherwise, the upper half is searched. Eventually, either the element is found or the lower end of the range becomes greater than the higher end, in which case the key is not in the list.

In addition to entire strings, regions of strings can also be compared for equality. The method for this is regionMatches, and it has two forms:

  • public boolean regionMatches(int start, String other, int ostart, int count)

    • Returns true if the given region of this String has the same Unicode characters as the given region of the string other. Checking starts in this string at the position start, and in the other string at position ostart. Only the first count characters are compared.

  • public boolean regionMatches(boolean ignoreCase, int start, String other, int ostart, int count)

    • This version of regionMatches behaves exactly like the previous one, but the boolean ignoreCase controls whether case is significant.

For example:

class RegionMatch {
    public static void main(String[] args) {
        String str = "Look, look!";
        boolean b1, b2, b3;

        b1 = str.regionMatches(6, "Look", 0, 4);
        b2 = str.regionMatches(true, 6, "Look", 0, 4);
        b3 = str.regionMatches(true, 6, "Look", 0, 5);

        System.out.println("b1 = " + b1);
        System.out.println("b2 = " + b2);
        System.out.println("b3 = " + b3);
    }
}

Here is its output:

b1 = false
b2 = true
b3 = false

The first comparison yields false because the character at position 6 of the main string is 'l' and the character at position 0 of the other string is 'L'. The second comparison yields true because case is not significant. The third comparison yields false because the comparison length is now 5 and the two strings are not the same over five characters, even ignoring case.

In querying methods, such as regionMatches and those we mention next, any invalid indexes simply cause false to be returned rather than throwing exceptions. Passing a null argument when an object is expected generates a NullPointerException.

You can do simple tests for the beginnings and ends of strings by using startsWith and endsWith:

  • public boolean startsWith(String prefix, int start)

    • Returns true if this String starts (at start) with the given prefix.

  • public boolean startsWith(String prefix)

    • Equivalent to startsWith(prefix,0) .

  • public boolean endsWith(String suffix)

    • Returns true if this String ends with the given suffix.

String Literals, Equivalence and Interning

In general, using == to compare strings will give you the wrong results. Consider the following code:

if (str == "¿Peña?")
    answer(str);

This does not compare the contents of the two strings. It compares one object reference (str) to another (the string object representing the literal "¿Peña?"). Even if str contains the string "¿Peña?" this == expression will almost always yield false because the two strings will be held in different objects. Using == on objects only tests whether the two references refer to the same object, not whether they are equivalent objects.

However, any two string literals with the same contents will refer to the same String object. For example, == works correctly in the following code:

String str = "¿Peña?";
// ...
if (str == "¿Peña?")
    answer(str);

Because str is initially set to a string literal, comparing with another string literal is equivalent to comparing the strings for equal contents. But be careful—this works only if you are sure that all string references involved are references to string literals. If str is changed to refer to a manufactured String object, such as the result of a user typing some input, the == operator will return false even if the user types ¿Peña? as the string.

To overcome this problem you can intern the strings that you don't know for certain refer to string literals. The intern method returns a String that has the same contents as the one it is invoked on. However, any two strings with the same contents return the same String object from intern, which enables you to compare string references to test equality, instead of the slower test of string contents. For example:

int putIn(String key) {
    String unique = key.intern();
    int i;
    // see if it's in the table already
    for (i = 0; i < tableSize; i++)
        if (table[i] == unique)
            return i;
    // it's not there--add it in
    table[i] = unique;
    tableSize++;
    return i;
}

All the strings stored in the table array are the result of an intern invocation. The table is searched for a string that was the result of an intern invocation on another string that had the same contents as the key. If this string is found, the search is finished. If not, we add the unique representative of the key at the end. Dealing with the results of intern makes comparing object references equivalent to comparing string contents, but much faster.

Any two strings with the same contents are guaranteed to have the same hash code—the String class overrides Object.hashCode—although two different strings might also have the same hash code. Hash codes are useful for hashtables, such as the HashMap class in java.util—see “HashMap” on page 590.

Making Related Strings

Several String methods return new strings that are like the old one but with a specified modification. New strings are returned because String objects are immutable. You could extract delimited substrings from another string by using a method like this one:

public static String delimitedString(
    String from, char start, char end)
{
    int startPos = from.indexOf(start);
    int endPos = from.lastIndexOf(end);
    if (startPos == -1)         // no start found
        return null;
    else if (endPos == -1)      // no end found
        return from.substring(startPos);
    else if (startPos > endPos) // start after end
        return null;
    else                        // both start and end found
        return from.substring(startPos, endPos + 1);
}

The method delimitedString returns a new String object containing the string inside from that is delimited by start and end—that is, it starts with the character start and ends with the character end. If start is found but not end, the method returns a new String object containing everything from the start position to the end of the string. The method delimitedString works by using the two overloaded forms of substring. The first form takes only an initial start position and returns a new string containing everything in the original string from that point on. The second form takes both a start and an end position and returns a new string that contains all the characters in the original string from the start to the endpoint, including the character at the start but not the one at the end. This “up to but not including the end” behavior is the reason that the method adds one to endPos to include the delimiter characters in the returned string. For example, the string returned by

delimitedString("Il a dit «Bonjour!»", '«', '»'),

is

«Bonjour!»

Here are the rest of the “related string” methods:

  • public String replace(char oldChar, char newChar)

    • Returns a String with all instances of oldChar replaced with the character newChar.

  • public String replace(CharSequence oldSeq, CharSquence newSeq)

    • Returns a String with each occurrence of the subsequence oldSeq replaced by the subsequence newSeq.

  • public String trim()

    • Returns a String with leading and trailing whitespace stripped. Whitespace characters are those identified as such by the Character.isWhitespace method and include space, tab, and newline.

A number of methods return related strings based on a match with a given regular expression—see “Regular Expression Matching” on page 321:

  • public String replaceFirst(String regex, String repStr)

    • Returns a String with the first substring that matches the regular expression regex replaced by repStr. Invoked on str, this is equivalent to Pattern.compile(regex).matcher(str).replaceFirst(repStr).

  • public String replaceAll(String regex, String repStr)

    • Returns a String with all substrings that match the regular expression regex replaced by repStr. Invoked on str, this is equivalent to Pattern.compile(regex).matcher(str).replaceAll(repStr).

  • public String[] split(String regex)

    • Equivalent to split(regex,0) (see below).

  • public String[] split(String regex, int limit)

    • Returns an array of strings resulting from splitting up this string according to the regular expression. Each match of the regular expression will cause a split in the string, with the matched part of the string removed. The limit affects the number of times the regular expression will be applied to the string to create the array. Any positive number n limits the number of applications to n–1, with the remainder of the string returned as the last element of the array (so the array will be no larger than n). Any negative limit means that there is no limit to the number of applications and the array can have any length. A limit of zero behaves like a negative limit, but trailing empty strings will be discarded. Invoked on str, this is equivalent to Pattern.compile(regex).split(str, limit).

      This is easier to understand with an example. The following table shows the array elements returned from split("--",n) invoked on the string "w--x--y--" for n equal to –1, 0, 1, 2, 3, and 4:

       Limit:  -1       0           1       2       3       4
      Results
         [0]:   w       w   w--x--y--       w       w       w
         [1]:   x       x              x--y--       x       x
         [2]:   y       y                         y--       y
         [3]:  ""                                          ""
      

      With a negative or zero limit we remove all occurrences of "--", with the difference between the two being the trailing empty string in the negative case. With a limit of one we don't actually apply the pattern and so the whole string is returned as the zeroth element. A limit of two applies the pattern once, breaking the string into two substrings. A limit of three gives us three substrings. A limit of four gives us four substrings, with the fourth being the empty string due to the original string ending with the pattern we were splitting on. Any limit greater than four will return the same results as a limit of four.

In all the above, if the regular expression syntax is incorrect a PatternSyntaxException is thrown.

These are all convenience methods that avoid the need to work with Pattern and Matcher objects directly, but they require that the regular expression be compiled each time. If you just want to know if a given string matches a given regular expression, the matches method returns a boolean to tell you.

Case issues are locale sensitive—that is, they vary from place to place and from culture to culture. The platform allows users to specify a locale, which includes language and character case issues. Locales are represented by Locale objects, which you'll learn about in more detail in Chapter 24. The methods toLowerCase and toUpperCase use the current default locale, or you can pass a specific locale as an argument:

  • public String toLowerCase()

    • Returns a String with each character converted to its lowercase equivalent if it has one according to the default locale.

  • public String toUpperCase()

    • Returns a String with each character converted to its uppercase equivalent if it has one according to the default locale.

  • public String toLowerCase(Locale loc)

    • Returns a String with each character converted to its lowercase equivalent if it has one according to the specified locale.

  • public String toUpperCase(Locale loc)

    • Returns a String with each character converted to its uppercase equivalent if it has one according to the specified locale.

The concat method returns a new string that is equivalent to the string returned when you use + on two strings. The following two statements are equivalent:

newStr = oldStr.concat(" not");
newStr = oldStr + " not";

Exercise 13.3: As shown, the delimitedString method assumes only one such string per input string. Write a version that will pull out all the delimited strings and return an array.

Exercise 13.4: Write a program to read an input string with lines of the form “type value”, where type is one of the wrapper class names (Boolean, Character, and so on) and value is a string that the type's constructor can decode. For each such entry, create an object of that type with that value and add it to an ArrayList—see “ArrayList” on page 582. Display the final result when all the lines have been read. Assume a line is ended simply by the newline character ' '.

String Conversions

You often need to convert strings to and from something else, such as integers or booleans. The convention is that the type being converted to has the method that does the conversion. For example, converting from a String to an int requires a static method in class Integer. This table shows all the types that you can convert, and how to convert each to and from a String:

Type

To String

From String

boolean

String.valueOf(boolean)

Boolean.parseBoolean(String)

byte

String.valueOf(int)

Byte.parseByte(String, int base)

char

String.valueOf(char)

str.charAt(0)

short

String.valueOf(int)

Short.parseShort(String, int base)

int

String.valueOf(int)

Integer.parseInt(String, int base)

long

String.valueOf(long)

Long.parseLong(String, int base)

float

String.valueOf(float)

Float.parseFloat(String)

double

String.valueOf(double)

Double.parseDouble(String)

To convert a primitive type to a String you invoke one of the static valueOf methods of String, which for numeric types produces a base 10 representation.

The Integer and Long wrapper classes—as described in Chapter 8—also provide methods toBinaryString, toOctalString, and toHexString for other representations.

To convert, or more accurately to parse a string into a primitive type you invoke the static parseType method of the primitives type's corresponding wrapper class. Each parsing method has its own rules about the allowed format of the string, for example Float.parseFloat will accept a floating-point literal of the form "3.14f", whereas Long.parseLong will not accept the string "25L". These numeric parsing methods have two overloaded forms: one that takes a numeric base between 2 and 32 in addition to the string to parse; and one that takes only the string and assumes base 10. These parsing methods will also reject the string if it has characters representing the base of the number, such as "0x12FE" for a hexadecimal value, or "33" for an octal value. However, the Integer and Long wrapper classes also provide a static decode method that will parse a string that does include this base information. For the numeric types, if the string does not represent a valid value of that type, a NumberFormatException is thrown.

To convert a String to a char you simply extract the first char from the String.

Your classes can support string encoding and decoding by having an appropriate toString method and a constructor that creates a new object given the string description. The method String.valueOf(Objectobj) is defined to return either "null" (if obj is null) or the result of obj.toString. The String class provides enough overloads of valueOf that you can convert any value of any type to a String by invoking valueOf.

Strings and char Arrays

A String maps to an array of char and vice versa. You often want to build a string in a char array and then create a String object from the contents. Assuming that the writable StringBuilder class (described later) isn't adequate, several String methods and constructors help you convert a String to an array of char, or convert an array of char to a String.

There are two constructors for creating a String from a char array:

  • public String(char[] chars, int start, int count)

    • Constructs a new String whose contents are the same as the chars array, from index start up to a maximum of count characters.

  • public String(char[] chars)

    • Equivalent to String(chars,0, chars.length).

Both of these constructors make copies of the array, so you can change the array contents after you have created a String from it without affecting the contents of the String.

For example, the following simple algorithm squeezes out all occurrences of a character from a string:

public static String squeezeOut(String from, char toss) {
    char[] chars = from.toCharArray();
    int len = chars.length;
    int put = 0;
    for (int i = 0; i < len; i++)
        if (chars[i] != toss)
            chars[put++] = chars[i];
    return new String(chars, 0, put);
}

The method squeezeOut first converts its input string from into a character array using the method toCharArray. It then sets up put, which will be the next position into which to put a character. After that it loops, copying into the array any character that isn't a toss character. When the method is finished looping over the array, it returns a new String object that contains the squeezed string.

You can use the two static String.copyValueOf methods instead of the constructors if you prefer. For instance, squeezeOut could have been ended with

return String.copyValueOf(chars, 0, put);

There is also a single-argument form of copyValueOf that copies the entire array. For completeness, two static valueOf methods are also equivalent to the two String constructors.

The toCharArray method is simple and sufficient for most needs. When you need more control over copying pieces of a string into a character array, you can use the getChars method:

  • public void getChars(int srcBegin, int srcEnd, char[] dst, int dstBegin)

    • Copies characters from this String into the specified array. The characters of the specified substring are copied into the character array, starting at dst[dstBegin]. The specified substring is the part of the string starting at srcBegin, up to but not including srcEnd.

Strings and byte Arrays

Strings represent characters encoded as char values with the UTF-16 encoding format. To convert those char values into raw byte values requires that another encoding format be used. Similarly, to convert individual “characters” or arrays of raw 8-bit “characters” into char values requires that the encoding format of the raw bytes is known. For example, you would convert an array of ASCII or Latin-1 bytes to Unicode characters simply by setting the high bits to zero, but that would not work for other 8-bit character set encodings such as those for Hebrew. Different character sets are discussed shortly. In the following constructors and methods, you can name a character set encoding or use the user's or platform's default encoding:

  • public String(byte[] bytes, int start, int count)

    • Constructs a new String by converting the bytes, from index start up to a maximum of count bytes, into characters using the default encoding for the default locale.

  • public String(byte[] bytes)

    • Equivalent to String(bytes, 0, bytes.length).

  • public String(byte[] bytes, int start, int count, String enc) throws UnsupportedEncodingException

    • Constructs a new String by converting the bytes, from index start up to a maximum of count bytes, into characters using the encoding named by enc.

  • public String(byte[] bytes, String enc) throws UnsupportedEncodingException

    • Equivalent to String(bytes,0, bytes.length,enc) .

  • public byte[] getBytes()

    • Returns a byte array that encodes the contents of the string using the default encoding for the default locale.

  • public byte[] getBytes(String enc) throws UnsupportedEncodingException

    • Returns a byte array that encodes the contents of the string using the encoding named by enc.

The String constructors for building from byte arrays make copies of the data, so further modifications to the arrays will not affect the contents of the String.

Character Set Encoding

A character set encoding specifies how to convert between raw 8-bit “characters” and their 16-bit Unicode equivalents. Character sets are named using their standard and common names. The local platform defines which character set encodings are understood, but every implementation is required to support the following:

US-ASCII

7-bit ASCII, also known as ISO646-US, and as the Basic Latin block of the Unicode character set

ISO-8859-1

ISO Latin Alphabet No. 1, also known as ISO-LATIN-1

UTF-8

8-bit Unicode Transformation Format

UTF-16BE

16-bit Unicode Transformation Format, big-endian byte order

UTF-16LE

16-bit Unicode Transformation Format, little-endian byte order

UTF-16

16-bit Unicode Transformation Format, byte order specified by a mandatory initial byte-order mark (either order accepted on input, big-endian used on output)

Consult the release documentation for your implementation to see if any other character set encodings are supported.

Character sets and their encoding mechanisms are represented by specific classes within the java.nio.charset package:

  • Charset

    • A named mapping (such as US-ASCII or UTF-8) between sequences of 16-bit Unicode code units and sequences of bytes. This contains general information on the sequence encoding, simple mechanisms for encoding and decoding, and methods to create CharsetEncoder and CharsetDecoder objects for richer abilities.

  • CharsetEncoder

    • An object that can transform a sequence of 16-bit Unicode code units into a sequence of bytes in a specific character set. The encoder object also has methods to describe the encoding.

  • CharsetDecoder

    • An object that can transform a sequence of bytes in a specific character set into a sequence of 16-bit Unicode code units. The decoder object also has methods to describe the decoding.

You can obtain a Charset via its own static forName method, though usually you will just specify the character set name to some other method (such as the String constructor or an I/O operation) rather than working with the Charset object directly. To test whether a given character set is supported use the forName method, and if you get an UnsuppportedCharsetException then it is not.

You can find a list of available character sets from the static availableCharsets method, which returns a SortedMap of names and Charset instances, of all known character sets. For example, to print out the names of all the known character sets you can use:

for (String name : Charset.availableCharsets().keySet())
    System.out.println(name);

Every instance of the Java virtual machine has a default character set that is determined during virtual-machine startup and typically depends on the locale and encoding being used by the underlying operating system. You can obtain the default Charset using the static defaultCharset method.

Regular Expression Matching

The package java.util.regex provides you a way to find if a string matches a general description of a category of strings called a regular expression. A regular expression describes a class of strings by using wildcards that match or exclude groups of characters, markers to require matches in particular places, etc. The package uses a common kind of regular expression, quite similar to those used in the popular perl programming language, which itself evolved from those used in several Unix utilities.

You can use regular expressions to ask if strings match a pattern and pick out parts of strings using a rich expression language. First you will learn what regular expressions are. Then you will learn how to compile and use them.

Regular Expressions

A full description of regular expressions is complex and many other works describe them. So we will not attempt a complete tutorial, but instead will simply give some examples of the most commonly used features. (A full reference alone would take several pages.) A list of resources for understanding regular expressions is in “Further Reading” on page 758.

Regular expressions search in character sequences, as defined by java.lang.CharSequence, implemented by String and StringBuilder. You can implement it yourself if you want to provide new sources.

A regular expression defines a pattern that can be applied to a character sequence to search for matches. The simplest form is something that is matched exactly; the pattern xyz matches the string xyzzy but not the string plugh. Wildcards make the pattern more general. For example, . (dot) matches any single character, so the pattern .op matches both hop and pop, and * matches zero or more of the thing before it, so xyz* matches xy, xyz, and xyzzy.

Other useful wildcards include simple sets (p[aeiou]p matches pop and pup but not pgp, while [a-z] matches any single lowercase letter); negations ([^aeiou] matches anything that is not a single lowercase vowel); predefined sets (d matches any digit; s any whitespace character); and boundaries (^twisty matches the word “twisty” only at the beginning of a line; alike matches “alike” only after a word boundary, that is, at the beginning of a word).

Special symbols for particular characters include for tab; for newline; a for the alert (bell) character; e for escape; and \ for backslash itself. Any character that would otherwise have a special meaning can be preceded by a to remove that meaning; in other words c always represents the character c. This is how, for example, you would match a * in an expression—by using *.

Special symbols start with the character, which is also the character used to introduce an escape character. This means, for example, that in the string expression "alike", the actual pattern will consist of a backspace character followed by the word "alike", while "s" would not be a pattern for whitespace but would cause a compile-time error because s is not valid escape character. To use the special symbols within a string expression the leading must itself be escaped using \, so the example strings become "\balike" and "\s", respectively. To include an actual backslash in a pattern it has to be escaped twice, using four backslash characters: "\\". Each backslash pair becomes a single backslash within the string, resulting in a single backslash pair being included in the pattern, which is then interpreted as a single backslash character.

Regular expressions can also capture parts of the string for later use, either inside the regular expression itself or as a means of picking out parts of the string. You capture parts of the expression inside parentheses. For example, the regular expression (.)-(.*)-2-1 matches x-yup-yup-x or ñ-å-å-ñ or any other similar string because 1 matches the group (.) and 2 matches the group (.*).[1] Groups are numbered from one, in order of the appearance of their opening parenthesis.

Compiling and Matching with Regular Expressions

Evaluating a regular expression can be compute intensive, and in many instances a single regular expression will be used repeatedly. This can be addressed by compiling the regular expression once and using the result. In addition, a single character sequence might be checked repeatedly against the same pattern to find multiple matches, which can be done fastest by remembering some information about previous matches. To address both these opportunities for optimization, the full model of using a regular expression is this:

  1. First you turn your regular expression string into a Pattern object that is the compiled version of the pattern.

  2. Next you ask the Pattern object for a Matcher object that applies that pattern to a particular CharSequence (such as a String or StringBuilder).

  3. Finally you ask the Matcher to perform operations on the sequence using the compiled pattern.

Or, expressed in code:

Pattern pat = Pattern.compile(regularExpression);
Matcher matcher = pat.matcher(sequence);
boolean foundMatch = matcher.find();

If you are only using a pattern once, or are only matching each string against that pattern once, you need not actually deal with the intermediate objects. As you will see, there are convenience methods on Pattern for matching without a Matcher, and methods that create their own Pattern and Matcher. These are easy to use, but inefficient if you are using the same pattern multiple times, or matching against the same string with the same pattern repeatedly.

The Pattern class has the following methods:

  • public static Pattern compile(String regex) throws PatternSyntaxException

    • Compiles the given regular expression into a pattern.

  • public static Pattern compile(String regex, int flags) throws PatternSyntaxException

    • Compiles the given regular expression into a pattern with the given flags. The flags control how certain interesting cases are handled, as you will soon learn.

  • public String pattern()

    • Returns the regular expression from which this pattern was compiled.

  • public int flags()

    • Returns this pattern's match flags.

  • public Matcher matcher(CharSequence input)

    • Creates a matcher that will match the given input against this pattern.

  • public String[] split(CharSequence input, int limit)

    • A convenience method that splits the given input sequence around matches of this pattern. Useful when you do not need to reuse the matcher.

  • public String[] split(CharSequence input)

    • A convenience method that splits the given input sequence around matches of this pattern. Equivalent to split(input,0) .

  • public static boolean matches(String regex, CharSequence input)

    • A convenience method that compiles the given regular expression and attempts to match the given input against it. Useful when you do not need to reuse either parser or matcher. Returns true if a match is found.

  • public static String quote(String str)

    • Returns a string that can be used to create a pattern that would match with str.

The toString method of a Pattern also returns the regular expression from which the pattern was compiled.

The flags you can specify when creating the pattern object affect how the matching will be done. Some of these affect the performance of the matching, occasionally severely, but they may be functionality you need.

Flag

Meaning

CASE_INSENSITIVE

Case-insensitive matching. By default, only handle case for the ASCII characters.

UNICODE_CASE

Unicode-aware case folding when combined with CASE_INSENSITIVE

CANON_EQ

Canonical equivalence. If a character has multiple expressions, treat them as equivalent. For example, å is canonically equivalent to au030A.

DOTALL

Dot-all mode, where . matches line breaks, which it otherwise does not.

MULTILINE

Multiline mode, where ^ and $ match at lines embedded in the sequence, not just at the start end of the entire sequence

UNIX_LINES

Unix lines mode, where only is considered a line terminator.

COMMENTS

Comments and whitespace in pattern. Whitespace will be ignored, and comments starting with # are ignored up to the next end of line.

LITERAL

Enable literal parsing of the pattern

The Matcher class has methods to match against the sequence. Each of these returns a boolean indicating success or failure. If successful, the position and other state associated with the match can then be retrieved from the Matcher object via the start, end, and group methods. The matching queries are

  • public boolean matches()

    • Attempts to match the entire input sequence against the pattern.

  • public boolean lookingAt()

    • Attempts to match the input sequence, starting at the beginning, against the pattern. Like the matches method, this method always starts at the beginning of the input sequence; unlike that method, it does not require that the entire input sequence be matched.

  • public boolean find()

    • Attempts to find the next subsequence of the input sequence that matches the pattern. This method starts at the beginning of the input sequence or, if a previous invocation of find was successful and the matcher has not since been reset, at the first character not matched by the previous match.

  • public boolean find(int start)

    • Resets this matcher and then attempts to find the next subsequence of the input sequence that matches the pattern, starting at the specified index. If a match is found, subsequent invocations of the find method will start at the first character not matched by this match.

Once matching has commenced, the following methods allow the state of the matcher to be modified:

  • public Matcher reset()

    • Resets this matcher. This discards all state and resets the append position (see below) to zero. The returned Matcher is the one on which the method was invoked.

  • public Matcher reset(CharSequence input)

    • Resets this matcher to use a new input sequence. The returned Matcher is the one on which the method was invoked.

  • public Matcher usePattern(Pattern pattern)

    • Changes the pattern used by this matcher to be pattern. Any group information is discarded, but the input and append positions remain the same.

Once a match has been found, the following methods return more information about the match:

  • public int start()

    • Returns the start index of the previous match.

  • public int end()

    • Returns the index of the last character matched, plus one.

  • public String group()

    • Returns the input subsequence matched by the previous match; in other words, the substring defined by start and end.

  • public int groupCount()

    • Returns the number of capturing groups in this matcher's pattern. Group numbers range from zero to one less than this count.

  • public String group(int group)

    • Returns the input subsequence matched by the given group in the previous match. Group zero is the entire matched pattern, so group(0) is equivalent to group().

  • public int start(int group)

    • Returns the start index of the given group from the previous match.

  • public int end(int group)

    • Returns the index of the last character matched of the given group, plus one.

Together these methods form the MatchResult interface, which allows a match result to be queried but not modified. You can convert the current matcher state to a MatchResult instance by invoking its toMatchResult method. Any subsequent changes to the matcher state do not affect the existing MatchResult objects.

Replacing

You will often want to pair finding matches with replacing the matched characters with new ones. For example, if you want to replace all instances of sun with moon, your code might look like this:[2]

Pattern pat = Pattern.compile("sun");
Matcher matcher = pat.matcher(input);
StringBuffer result = new StringBuffer();
boolean found;
while ((found = matcher.find()))
    matcher.appendReplacement(result, "moon");
matcher.appendTail(result);

The loop continues as long as there are matches to sun. On each iteration through the loop, all the characters from the append position (the position after the last match; initially zero) to the start of the current match are copied into the string buffer. Then the replacement string moon is copied. When there are no more matches, appendTail copies any remaining characters into the buffer.

The replacement methods of Matcher are

  • public String replaceFirst(String replacement)

    • Replaces the first occurrence of this matcher's pattern with the replacement string, returning the result. The matcher is first reset and is not reset after the operation.

  • public String replaceAll(String replacement)

    • Replaces all occurrences of this matcher's pattern with the replacement string, returning the result. The matcher is first reset and is not reset after the operation.

  • public Matcher appendReplacement(StringBuffer buf, String replacement)

    • Adds to the string buffer the characters between the current append and match positions, followed by the replacement string, and then moves the append position to be after the match. As shown above, this can be used as part of a replacement loop. Returns this matcher.

  • public StringBuffer appendTail(StringBuffer buf)

    • Adds to the string buffer all characters from the current append position until the end of the sequence. Returns the buffer.

So the previous example can be written more simply with replaceAll:

Pattern pat = Pattern.compile("sun");
Matcher matcher = pat.matcher(input);
String result = matcher.replaceAll("moon");

As an example of a more complex usage of regular expressions, here is code that will replace every number with the next largest number:

Pattern pat = Pattern.compile("[-+]?[0-9]+");
Matcher matcher = pat.matcher(input);
StringBuffer result = new StringBuffer();
boolean found;
while ((found = matcher.find())) {
    String numStr = matcher.group();
    int num = Integer.parseInt(numStr);
    String plusOne = Integer.toString(num + 1);
    matcher.appendReplacement(result, plusOne);
}
matcher.appendTail(result);

Here we decode the number found by the match, add one to it, and replace the old value with the new one.

The replacement string can contain a $g, which will be replaced with the value from the gth capturing group in the expression. The following method uses this feature to swap all instances of two adjacent words:

public static String
    swapWords(String w1, String w2, String input)
{
    String regex = "\b(" + w1 + ")(\W+)(" + w2 + ")\b";
    Pattern pat = Pattern.compile(regex);
    Matcher matcher = pat.matcher(input);
    return matcher.replaceAll("$3$2$1");
}

First we build a pattern from the two words, using parenthesis to capture groups of characters. A  in a pattern matches a word boundary (otherwise the word “crow” would match part of “crown”), and W matches any character that would not be part of a word. The original pattern matches groups one (the first word), two (the separator characters), and three (the second word), which the "$3$2$1" replacement string inverts.

For example, the invocation

swapWords("up", "down", 
          "The yo-yo goes up, down, up, down, ...");

would return the string

The yo-yo goes down, up, down, up, ...

If we only wanted to swap the first time the words were encountered we could use replaceFirst:

public static String
    swapFirstWords(String w1, String w2, String input) {

    String regex = "\b(" + w1 + ")(\W+)(" + w2 + ")\b";
    Pattern pat = Pattern.compile(regex);
    Matcher matcher = pat.matcher(input);
    return matcher.replaceFirst("$3$2$1");
}

Regions

A Matcher looks for matches in the character sequence that it is given as input. By default, the entire character sequence is considered when looking for a match. You can control the region of the character sequence to be used, through the method region which takes a starting index and an ending index to define the subsequence in the input character sequence. The methods regionStart and regionEnd return, respectively, the current start index and the current end index.

You can control whether a region is considered to be the true start and end of the input, so that matching with the beginning or end of a line will work, by invoking useAnchoringBounds with an argument of true (the default). If you don't want the region to match with the line anchors then use false. The method hasAnchoringBounds will return the current setting.

Similarly, you can control whether the bounds of the region are transparent to matching methods that want to look-ahead, look-behind, or detect a boundary. By default bounds are opaque—that is, they will appear to be hard bounds on the input sequence—but you can change that with useTransparentBounds. The hasTransparentBounds method returns the current setting.

Efficiency

Suppose you want to parse a string into two parts that are separated by a comma. The pattern (.*),(.*) is clear and straightforward, but it is not necessarily the most efficient way to do this. The first .* will attempt to consume the entire input. The matcher will have to then back up to the last comma and then expand the rest into the second .*. You could help this along by being clear that a comma is not part of the group: ([^,]*),([^,]*). Now it is clear that the matcher should only go so far as the first comma and stop, which needs no backing up. On the other hand, the second expression is somewhat less clear to the casual user of regular expressions.

You should avoid trading clarity for efficiency unless you are writing a performance critical part of the code. Regular expressions are by nature already cryptic. Sophisticated techniques make them even more difficult to understand, and so should be used only when needed. And when you do need to be more efficient be sure that you are doing things that are more efficient—as with all optimizations, you should test carefully what is actually faster. In the example we give, a sufficiently smart pattern compiler and matcher might make both patterns comparably quick. Then you would have traded clarity for nothing. And even if today one is more efficient than the other, a better implementation tomorrow may make that vanish. With regular expressions, as with any other part of programming, choosing optimization over clarity is a choice to be made sparingly.

The StringBuilder Class

If immutable strings were the only kind available, you would have to create a new String object for each intermediate result in a sequence of String manipulations. Consider, for example, how the compiler would evaluate the following expression:

public static String guillemete(String quote) {
    return '«' + quote + '»';
}

If the compiler were restricted to String expressions, it would have to do the following:

quoted = String.valueOf('«').concat(quote)
            .concat(String.valueOf('»'));

Each valueOf and concat invocation creates another String object, so this operation would construct four String objects, of which only one would be used afterward. The others strings would have incurred overhead to create, to set to proper values, and to garbage collect.

The compiler is more efficient than this. It uses a StringBuilder object to build strings from expressions, creating the final String only when necessary. StringBuilder objects can be modified, so new objects are not needed to hold intermediate results. With StringBuilder, the previous string expression would be represented as

quoted = new StringBuilder().append('«')
            .append(quote).append('»').toString();

This code creates just one StringBuilder object to hold the construction, appends stuff to it, and then uses toString to create a String from the result.

To build and modify a string, you probably want to use the StringBuilder class. StringBuilder provides the following constructors:

  • public StringBuilder()

    • Constructs a StringBuilder with an initial value of "" (an empty string) and a capacity of 16.

  • public StringBuilder(int capacity)

    • Constructs a StringBuilder with an initial value of "" and the given capacity.

  • public StringBuilder(String str)

    • Constructs a StringBuilder with an initial value copied from str.

  • public StringBuilder(CharSequence seq)

    • Constructs a StringBuilder with an initial value copied from seq.

StringBuilder is similar to String, and it supports many methods that have the same names and contracts as some String methods—indexOf, lastIndexof, replace, substring. However, StringBuilder does not extend String nor vice versa. They are independent implementations of CharSequence.

Modifying the Buffer

There are several ways to modify the buffer of a StringBuilder object, including appending to the end and inserting in the middle. The simplest method is setCharAt, which changes the character at a specific position. The following replace method does what String.replace does, except that it uses a StringBuilder object. The replace method doesn't need to create a new object to hold the results, so successive replace calls can operate on one buffer:

public static void
    replace(StringBuilder str, char oldChar, char newChar) {

    for (int i = 0; i < str.length(); i++)
        if (str.charAt(i) == oldChar)
            str.setCharAt(i, newChar);
}

The setLength method truncates or extends the string in the buffer. If you invoke setLength with a length smaller than the length of the current string, the string is truncated to the specified length. If the length is longer than the current string, the string is extended with null characters ('\u0000').

There are also append and insert methods to convert any data type to a String and then append the result to the end or insert the result at a specified position. The insert methods shift characters over to make room for inserted characters as needed. The following types are converted by these append and insert methods:

Object

String

CharSequence

char[]

boolean

char

int

long

float

double

  

There are also append and insert methods that take part of a CharSequence or char array as an argument. Here is some code that uses various append invocations to create a StringBuilder that describes the square root of an integer:

String sqrtInt(int i) {
    StringBuilder buf = new StringBuilder();

    buf.append("sqrt(").append(i).append(')'),
    buf.append(" = ").append(Math.sqrt(i));
    return buf.toString();
}

The append and insert methods return the StringBuilder object itself, enabling you to append to the result of a previous append.

A few append methods together form the java.lang.Appendable interface. These methods are

public Appendable append(char c)
public Appendable append(CharSequence seq)
public Appendable append(CharSequence seq, int start, int end)

The Appendable interface is used to mark classes that can receive formatted output from a java.util.Formatter object—see “Formatter” on page 624.

The insert methods take two parameters. The first is the index at which to insert characters into the StringBuilder. The second is the value to insert, after conversion to a String if necessary. Here is a method to put the current date at the beginning of a buffer:

public static StringBuilder addDate(StringBuilder buf) {
    String now = new java.util.Date().toString();
    buf.insert(0, now).insert(now.length(), ": ");
    return buf;
}

The addDate method first creates a string with the current time using java.util.Date, whose default constructor creates an object that represents the time it was created. Then addDate inserts the string that represents the current date, followed by a simple separator string. Finally, it returns the buffer it was passed so that invoking code can use the same kind of method concatenation that proved useful in StringBuilder's own methods.

The reverse method reverses the order of characters in the StringBuilder. For example, if the contents of the buffer are "good", the contents after reverse are "doog".

You can remove part of the buffer with delete, which takes a starting and ending index. The segment of the string up to but not including the ending index is removed from the buffer, and the buffer is shortened. You can remove a single character by using deleteCharAt.

You can also replace characters in the buffer:

  • public StringBuilder replace(int start, int end, String str)

    • Replace the characters starting at start up to but not including end with the contents of str. The buffer is grown or shrunk as the length of str is greater than or less than the range of characters replaced.

Getting Data Out

To get a String object from a StringBuilder object, you simply invoke the toString method. If you need a substring of the buffer, the substring methods works analogously to those of String. If you want some or all of the contents as a character array, you can use getChars, which is analogous to String.getChars.

  • public void getChars(int srcBegin, int srcEnd, char[] dst, int dstBegin)

    • Copies characters from this StringBuilder into the specified array. The characters of the specified substring are copied into the character array, starting at dst[dstBegin]. The specified substring is the part of the string buffer from srcBegin up to but not including srcEnd.

Here is a method that uses getChars to remove part of a buffer:

public static StringBuilder
    remove(StringBuilder buf, int pos, int cnt) {

    if (pos < 0 || cnt < 0 || pos + cnt > buf.length())
        throw new IndexOutOfBoundsException();

    int leftover = buf.length() - (pos + cnt);
    if (leftover == 0) {    // a simple truncation
        buf.setLength(pos);
        return buf;
    }

    char[] chrs = new char[leftover];
    buf.getChars(pos + cnt, buf.length(), chrs, 0);
    buf.setLength(pos);
    buf.append(chrs);
    return buf;
}

First remove ensures that the array references will stay in bounds. You could handle the actual exception later, but checking now gives you more control. Then remove calculates how many characters follow the removed portion. If there are none, it truncates and returns. Otherwise, remove retrieves them using getChars and then truncates the buffer and appends the leftover characters before returning.

Capacity Management

The buffer of a StringBuilder object has a capacity, which is the length of the string it can store before it must allocate more space. The buffer grows automatically as characters are added, but it is more efficient to specify the size of the buffer only once.

You set the initial size of a StringBuilder object by using the constructor that takes a single int:

  • public StringBuilder(int capacity)

    • Constructs a StringBuilder with the given initial capacity and an initial value of "".

  • public void ensureCapacity(int minimum)

    • Ensures that the capacity of the buffer is at least the specified minimum.

  • public int capacity()

    • Returns the current capacity of the buffer.

  • public void trimToSize()

    • Attempts to reduce the capacity of the buffer to accommodate the current sequence of characters. There is no guarantee that this will actually reduce the capacity of the buffer, but this gives a hint to the system that it may be a good time to try and reclaim some storage space.

You can use these methods to avoid repeatedly growing the buffer. Here, for example, is a rewrite of the sqrtInt method from page 332 that ensures that you allocate new space for the buffer at most once:

String sqrtIntFaster(int i) {
    StringBuilder buf = new StringBuilder(50);
    buf.append("sqrt(").append(i).append(')'),
    buf.append(" = ").append(Math.sqrt(i));
    return buf.toString();
}

The only change is to use a constructor that creates a StringBuilder object large enough to contain the result string. The value 50 is somewhat larger than required; therefore, the buffer will never have to grow.

The StringBuffer Class

The StringBuffer class is essentially identical to the StringBuilder class except for one thing: It provides a thread-safe implementation of an appendable character sequence—see Chapter 14 for more on thread safety. This difference would normally relegate discussion of StringBuffer to a discussion on thread-safe data structures, were it not for one mitigating factor: The StringBuffer class is older, and previously filled the role that StringBuilder does now as the standard class for mutable character sequences. For this reason, you will often find methods that take or return StringBuffer rather than StringBuilder, CharSequence, or Appendable. These historical uses of StringBuffer are likely to be enshrined in the existing APIs for many years to come.

Exercise 13.5: Write a method to convert strings containing decimal numbers into comma-punctuated numbers, with a comma every third digit from the right. For example, given the string "1543729", the method should return the string "1,543,729".

Exercise 13.6: Modify the method to accept parameters specifying the separator character to use and the number of digits between separator characters.

Working with UTF-16

In “Working with UTF-16” on page 196, we described a number of utility methods provided by the Character class to ease working with the supplementary Unicode characters (those greater in value than 0xFFFF that require encoding as a pair of char values in a CharSequence). Each of the String, StringBuilder, and StringBuffer classes provides these methods:

  • public int codePointAt(int index)

    • Returns the code point defined at the given index in this, taking into account that it may be a supplementary character represented by the pair this.charAt(index) and this.charAt(index+1).

  • public int codePointBefore(int index)

    • Returns the code point defined at the given index in this, taking into account that it may be a supplementary character represented by the pair this.charAt(index-2) and this.charAt(index-1).

  • public int codePointCount(int start, int end)

    • Returns the number of code points defined in this.charAt(start) to this.charAt(end), taking into account surrogate pairs. Any unpaired surrogate values count as one code point each.

  • public int offsetByCodePoints(int index, int numberOfCodePoints)

    • Returns the index into this that is numberOfCodePoints away from index, taking into account surrogate pairs.

In addition, the StringBuilder and StringBuffer classes define the appendCodePoint method that takes an int representing an arbitrary Unicode character, encodes it as a surrogate pair if needed, and appends it to the end of the buffer. Curiously, there is no corresponding insertCodePoint method.

Finally, the String class also provides the following constructor:

  • public String(int[] codePoints, int start, int count)

    • Constructs a new String with the contents from codePoints[start] up to a maximum of count code points, with supplementary characters encoded as surrogate pairs as needed. If any value in the array is not a valid Unicode code point, then IllegalArgumentException is thrown.

 

When ideas fail, words come in very handy.

 
 --Johann Wolfgang von Goethe


[1] The .* means “zero or more characters,” because . means “any character” and * means “zero or more of the thing I follow,” so together they mean “zero or more of any character.”

[2] The StringBuffer class (see page 335) is an appendable character sequence (you can modify its contents). The Matcher class should have been updated in the 5.0 release to work with any appendable character sequence, such as StringBuilder, but this was overlooked.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.12.140