Chapter 3. Strings and Things

3.0 Introduction

Character strings are an inevitable part of just about any programming task. We use them for printing messages for the user; for referring to files on disk or other external media; and for people’s names, addresses, and affiliations. The uses of strings are many, almost without number (actually, if you need numbers, we’ll get to them in Chapter 5).

If you’re coming from a programming language like C, you’ll need to remember that String is a defined type (class) in Java—that is, a string is an object and therefore has methods. It is not an array of characters (though it contains one) and should not be thought of as an array. Operations like fileName.endsWith(".gif") and extension.equals(".gif") (and the equivalent ".gif".equals(extension)) are commonplace.1

Java old-timers should note that Java 11 and 12 added several new methods, including indent(int n), stripLeading() and stripTrailing(), Stream<T> lines(), isBlank(), and transform(). Most of these provide obvious functionality; the last one allows applying an instance of a “functional interface” (see Recipe 9.1) to a string and returning the result of that operatio.

Although we haven’t discussed the details of the java.io package yet (we will, in Chapter 10), you need to be able to read text files for some of these programs. Even if you’re not familiar with java.io, you can probably see from the examples of reading text files that a BufferedReader allows you to read “chunks” of data, and that this class has a very convenient readLine() method.

Going the other way, System.out.println() is normally used to print strings or other values to the terminal or “standard output.” String concatenation is commonly used here, as in:

System.out.println("The answer is " + result);

One caveat with string concatenation is that if you are appending a bunch of things, and a number and a character are concatenated at the front, they are added before concatenation due to Java’s precedence rules. So don’t do as I did in this contrived example:

int result = ...;
System.out.println(result + '=' + " the answer.");

Given that result is an integer, then result + '=' (result added to the equals sign, which is of the numeric type char) is a valid numeric expression, which will result in a single value of type int. If the variable result has the value 42, and given that the character = in a Unicode (or ASCII) code chart has the value 61, this prints:

103 the answer.

The wrong value and no equals sign! Safer approaches include using parentheses, using double quotes around the equals sign, using a StringBuilder (see Recipe 3.2) or using String.format() (see Recipe 10.4). Of course in this simple example you could just move the = to be part of the string literal, but the example was chosen to illustrate the problem of arithmetic on char values being confused with string contatenation. I won’t show you how to sort an array of strings here; the more general notion of sorting a collection of objects will be taken up in Recipe 7.11.

Java 14 enables “Text blocks”, also known as multi-line text strings. These are delimited with a set of three double quotes, the opening of which must have a newline after the quotes (which doesn’t become part of the string; the following newlines do):

String long = """
This is a long
text String."""

3.1 Taking Strings Apart with Substrings or Tokenizing

Problem

You want to break a string apart, either by indexing positions or by fixed token characters (e.g., break on spaces to get words).

Solution

For substrings, use the String object’s substring() method. For tokenizing, construct a StringTokenizer around your string and call its methods hasMoreTokens() and nextToken().

Or, use regular expressions (see Chapter 4).

Discussion

Substrings

The substring() method constructs a new String object made up of a run of characters contained somewhere in the original string, the one whose substring() you called. The substring method is overloaded: both forms require a starting index (which is always zero-based). The one-argument form returns from startIndex to the end. The two-argument form takes an ending index (not a length, as in some languages), so that an index can be generated by the String methods indexOf() or lastIndexOf().

Warning

Note that the end index is one beyond the last character! Java adopts this “half open interval” (or inclusive start, exclusive end) policy fairly consistently; there are good practical reasons for adopting this approach, and some other languages do likewise.

public class SubStringDemo {
    public static void main(String[] av) {
        String a = "Java is great.";
        System.out.println(a);
        String b = a.substring(5);    // b is the String "is great."
        System.out.println(b);
        String c = a.substring(5,7);// c is the String "is"
        System.out.println(c);
        String d = a.substring(5,a.length());// d is "is great."
        System.out.println(d);
    }
}

When run, this prints the following:

C:> java strings.SubStringDemo
Java is great.
is great.
is
is great.
C:>

Tokenizing

The easiest way is to use a regular expression; we’ll discuss these in Chapter 4, but for now, a string containing a space is a valid regular expression to match space characters, so you can most easily split a string into words like this:

for (String word : some_input_string.split(" ")) {
    System.out.println(word);
}

If you need to match multiple spaces, or spaces and tabs, use the string "s+".

If you want to split a file, you can try the string "," or use one of several third-party libraries for CSV files.

Another method is to use StringTokenizer. The StringTokenizer methods implement the Iterator interface and design pattern (see Recipe 7.6):

StrTokDemo.java

StringTokenizer st = new StringTokenizer("Hello World of Java");

while (st.hasMoreTokens( ))
    System.out.println("Token: " + st.nextToken( ));

StringTokenizer also implements the Enumeration interface (see Recipe 7.6), but if you use the methods thereof you need to cast the results to String.

A StringTokenizer normally breaks the String into tokens at what we would think of as “word boundaries” in European languages. Sometimes you want to break at some other character. No problem. When you construct your StringTokenizer, in addition to passing in the string to be tokenized, pass in a second string that lists the “break characters.” For example:

StrTokDemo2.java

StringTokenizer st = new StringTokenizer("Hello, World|of|Java", ", |");

while (st.hasMoreElements( ))
    System.out.println("Token: " + st.nextElement( ));

It outputs the four words, each on a line by itself, with no punctuation.

But wait, there’s more! What if you are reading lines like:

FirstName|LastName|Company|PhoneNumber

and your dear old Aunt Begonia hasn’t been employed for the last 38 years? Her “Company” field will in all probability be blank.3 If you look very closely at the previous code example, you’ll see that it has two delimiters together (the comma and the space), but if you run it, there are no “extra” tokens—that is, the StringTokenizer normally discards adjacent consecutive delimiters. For cases like the phone list, where you need to preserve null fields, there is good news and bad news. The good news is that you can do it: you simply add a second argument of true when constructing the StringTokenizer, meaning that you wish to see the delimiters as tokens. The bad news is that you now get to see the delimiters as tokens, so you have to do the arithmetic yourself. Want to see it? Run this program:

StrTokDemo3.java

StringTokenizer st =
    new StringTokenizer("Hello, World|of|Java", ", |", true);

while (st.hasMoreElements( ))
    System.out.println("Token: " + st.nextElement( ));

and you get this output:

C:>java strings.StrTokDemo3
Token: Hello
Token: ,
Token:
Token: World
Token: |
Token: of
Token: |
Token: Java
C:>

This isn’t how you’d like StringTokenizer to behave, ideally, but it is serviceable enough most of the time. Example 3-1 processes and ignores consecutive tokens, returning the results as an array of Strings.

Example 3-1. main/src/main/java/strings/StrTokDemo4.java (StringTokenizer)
public class StrTokDemo4 {
    public final static int MAXFIELDS = 5;
    public final static String DELIM = "|";

    /** Processes one String, returns it as an array of Strings */
    public static String[] process(String line) {
        String[] results = new String[MAXFIELDS];

        // Unless you ask StringTokenizer to give you the tokens,
        // it silently discards multiple null tokens.
        StringTokenizer st = new StringTokenizer(line, DELIM, true);

        int i = 0;
        // stuff each token into the current slot in the array.
        while (st.hasMoreTokens()) {
            String s = st.nextToken();
            if (s.equals(DELIM)) {
                if (i++>=MAXFIELDS)
                    // This is messy: See StrTokDemo4b which uses
                    // a List to allow any number of fields.
                    throw new IllegalArgumentException("Input line " +
                        line + " has too many fields");
                continue;
            }
            results[i] = s;
        }
        return results;
    }

    public static void printResults(String input, String[] outputs) {
        System.out.println("Input: " + input);
        for (String s : outputs)
            System.out.println("Output " + s + " was: " + s);
    }

    public static void main(String[] a) {
        printResults("A|B|C|D", process("A|B|C|D"));
        printResults("A||C|D", process("A||C|D"));
        printResults("A|||D|E", process("A|||D|E"));
    }
}

When you run this, you will see that A is always in Field 1, B (if present) is in Field 2, and so on. In other words, the null fields are being handled properly:

Input: A|B|C|D
Output 0 was: A
Output 1 was: B
Output 2 was: C
Output 3 was: D
Output 4 was: null
Input: A||C|D
Output 0 was: A
Output 1 was: null
Output 2 was: C
Output 3 was: D
Output 4 was: null
Input: A|||D|E
Output 0 was: A
Output 1 was: null
Output 2 was: null
Output 3 was: D
Output 4 was: E

See Also

Many occurrences of StringTokenizer may be replaced with regular expressions (see Chapter 4) with considerably more flexibility. For example, to extract all the numbers from a String, you can use this code:

Matcher tokenizer = Pattern.compile("\d+").matcher(inputString);
while (tokenizer.find( )) {
        String courseString = tokenizer.group(0);
        int courseNumber = Integer.parseInt(courseString);
        ...

This allows user input to be more flexible than you could easily handle with a StringTokenizer. Assuming that the numbers represent course numbers at some educational institution, the inputs “471,472,570” or “Courses 471 and 472, 570” or just “471 472 570” should all give the same results.

3.2 Putting Strings Together with StringBuilder

Problem

You need to put some String pieces (back) together.

Solution

Use string concatenation: the + operator. The compiler implicitly constructs a StringBuilder for you and uses its append() methods (unless all the string parts are known at compile time).

Better yet, construct and use a StringBuilder yourself.

Discussion

An object of one of the StringBuilder classes basically represents a collection of characters. It is similar to a String objectfootnote[String and StringBuilder have several methods that are forced to be identical by their implementation of the CharSequence interface]. However, as mentioned, Strings are immutable; StringBuilders are mutable and designed for, well, building Strings. You typically construct a StringBuilder, invoke the methods needed to get the character sequence just the way you want it, and then call toString() to generate a String representing the same character sequence for use in most of the Java API, which deals in Strings.

StringBuffer is historical—it’s been around since the beginning of time. Some of its methods are synchronized (see Recipe 16.5), which involves unneeded overhead in a single-threaded context. In Java 5, this class was “split” into StringBuffer (which is synchronized) and StringBuilder (which is not synchronized); thus, it is faster and preferable for single-threaded use. Another new class, AbstractStringBuilder, is the parent of both. In the following discussion, I’ll use “the StringBuilder classes” to refer to all three because they mostly have the same methods.

The book’s example code provides a StringBuilderDemo and a StringBufferDemo. Except for the fact that StringBuilder is not threadsafe, these API classes are identical and can be used interchangeably, so my two demo programs are almost identical except that each one uses the appropriate builder class.

The StringBuilder classes have a variety of methods for inserting, replacing, and otherwise modifying a given StringBuilder. Conveniently, the append() methods return a reference to the StringBuilder itself, so “stacked” statements like .append(…).append(…) are fairly common. This style of coding is referred to as a “fluent API” because it reads smoothly, like prose from a native speaker of a human language. You might even see this style of coding in a toString() method, for example. Example 3-2 shows three ways of concatenating strings.

Example 3-2. main/src/main/java/strings/StringBuilderDemo.java
public class StringBuilderDemo {

    public static void main(String[] argv) {

        String s1 = "Hello" + ", " + "World";
        System.out.println(s1);

        // Build a StringBuilder, and append some things to it.
        StringBuilder sb2 = new StringBuilder();
        sb2.append("Hello");
        sb2.append(',');
        sb2.append(' ');
        sb2.append("World");

        // Get the StringBuilder's value as a String, and print it.
        String s2 = sb2.toString();
        System.out.println(s2);

        // Now do the above all over again, but in a more
        // concise (and typical "real-world" Java) fashion.

        System.out.println(
          new StringBuilder()
            .append("Hello")
            .append(',')
            .append(' ')
            .append("World"));
    }
}

In fact, all the methods that modify more than one character of a StringBuilder’s contents (i.e., append(), delete(), deleteCharAt(), insert(), replace(), and reverse()) return a reference to the builder object to facilitate this “fluent API” style of coding.

As another example of using a StringBuilder, consider the need to convert a list of items into a comma-separated list, while avoiding getting an extra comma after the last element of the list. This can be done using a StringBuilder, although in Java 8+ there is a static String method to do the same. Code for these are shown in Example 3-3.

Example 3-3. main/src/main/java/strings/StringBuilderCommaList.java
        System.out.println(
            "Split using String.split; joined using 1.8 String join");
        System.out.println(String.join(", ", SAMPLE_STRING.split(" ")));

        System.out.println(
            "Split using String.split; joined using StringBuilder");
        StringBuilder sb1 = new StringBuilder();
        for (String word : SAMPLE_STRING.split(" ")) {
            if (sb1.length() > 0) {
                sb1.append(", ");
            }
            sb1.append(word);
        }
        System.out.println(sb1);

        System.out.println(
            "Split using StringTokenizer; joined using StringBuilder");
        StringTokenizer st = new StringTokenizer(SAMPLE_STRING);
        StringBuilder sb2 = new StringBuilder();
        while (st.hasMoreElements()) {
            sb2.append(st.nextToken());
            if (st.hasMoreElements()) {
                sb2.append(", ");
            }
        }
        System.out.println(sb2);

The first method is clearly the most compact; the static String.join() make short work of this task. The next method uses the StringBuilder.length() method, so it will only work correctly when you are starting with an empty StringBuilder. The second method relies on calling the informational method hasMoreElements() in the Enumeration (or hasNext() in an Iterator, as discussed in Recipe 7.6) more than once on each element. An alternative method, particularly when you aren’t starting with an empty builder, would be to use a boolean flag variable to track whether you’re at the beginning of the list.

3.3 Processing a String One Character at a Time

Problem

You want to process the contents of a string, one character at a time.

Solution

Use a for loop and the String’s charAt() or codePointAt() method. Or a “for each” loop and the String’s toCharArray method.

Discussion

A string’s charAt() method retrieves a given character by index number (starting at zero) from within the String object. Since Unicode has had to expand beyond 16 bits, not all Unicode characters can fit into a Java char variable. There is thus an analogous codePointAt() method, whose return type is int. To process all the characters in a String, one after another, use a for loop ranging from zero to String.length()-1. Here we process all the characters in a String:

main/src/main/java/strings/strings/StrCharAt.java

public class StrCharAt {
    public static void main(String[] av) {
        String a = "A quick bronze fox";
        for (int i=0; i < a.length(); i++) { // no forEach, need the index
            String message = String.format(
                "charAt is '%c', codePointAt is %3d, casted it's '%c'",
                     a.charAt(i),
                     a.codePointAt(i),
                     (char)a.codePointAt(i));
            System.out.println(message);
        }
    }
}

Given that the “for each” loop has been in the language for ages, you might be excused for expecting to be able to write something like for (char ch : myString) {…}. Unfortunately, this does not work. But you can use myString.toCharArray() as in the following:

public class ForEachChar {
    public static void main(String[] args) {
        String mesg = "Hello world";

        // Does not compile, Strings are not iterable
        // for (char ch : mesg) {
        //        System.out.println(ch);
        // }

        System.out.println("Using toCharArray:");
        for (char ch : mesg.toCharArray()) {
            System.out.println(ch);
        }

        System.out.println("Using Streams:");
        mesg.chars().forEach(c -> System.out.println((char)c));
    }
}

A “checksum” is a numeric quantity representing and confirming the contents of a file. If you transmit the checksum of a file separately from the contents, a recipient can checksum the file—assuming the algorithm is known—and verify that the file was received intact. Example 3-4 shows the simplest possible checksum, computed just by adding the numeric values of each character. Note that on files, it does not include the values of the newline characters; in order to fix this, retrieve System.getProperty("line.separator"); and add its character value(s) into the sum at the end of each line. Or give up on line mode and read the file a character at a time.

Example 3-4. main/src/main/java/strings/CheckSum.java
    /** CheckSum one text file, given an open BufferedReader.
     * Checksum does not include line endings, so will give the
     * same value for given text on any platform. Do not use
     * on binary files!
     */
    public static int process(BufferedReader is) {
        int sum = 0;
        try {
            String inputLine;

            while ((inputLine = is.readLine()) != null) {
                for (char c : inputLine.toCharArray()) {
                    sum += c;
                }
            }
        } catch (IOException e) {
            throw new RuntimeException("IOException: " + e);
        }
        return sum;
    }

3.4 Aligning, Indenting, and Un-indenting Strings

Problem

You want to align strings to the left, right, or center.

Solution

Do the math yourself, and use substring (see Recipe 3.1) and a StringBuilder (see Recipe 3.2). Or, use my StringAlign class, which is based on the java.text.Format class. For left or right alignment, use String.format().

Discussion

Centering and aligning text comes up fairly often. Suppose you want to print a simple report with centered page numbers. There doesn’t seem to be anything in the standard API that will do the job fully for you. But I have written a class called StringAlign that will. Here’s how you might use it:

public class StringAlignSimple {

    public static void main(String[] args) {
        // Construct a "formatter" to center strings.
        StringAlign formatter = new StringAlign(70, StringAlign.Justify.CENTER);
        // Try it out, for page "i"
        System.out.println(formatter.format("- i -"));
        // Try it out, for page 4. Since this formatter is
        // optimized for Strings, not specifically for page numbers,
        // we have to convert the number to a String
        System.out.println(formatter.format(Integer.toString(4)));
    }
}

If you compile and run this class, it prints the two demonstration line numbers centered, as shown:

> javac -d . StringAlignSimple.java
> java strings.StringAlignSimple
                                - i -
                                  4
>

Example 3-5 is the code for the StringAlign class. Note that this class extends the class Format in the package java.text. There is a series of Format classes that all have at least one method called format(). It is thus in a family with numerous other formatters, such as DateFormat, NumberFormat, and others, that we’ll take a look at in upcoming chapters.

Example 3-5. main/src/main/java/strings/StringAlign.java
public class StringAlign extends Format {

    private static final long serialVersionUID = 1L;

    public enum Justify {
        /* Constant for left justification. */
        LEFT,
        /* Constant for centering. */
        CENTER,
        /** Constant for right-justified Strings. */
        RIGHT,
    }

    /** Current justification */
    private Justify just;
    /** Current max length */
    private int maxChars;

    /** Construct a StringAlign formatter; length and alignment are
     * passed to the Constructor instead of each format() call as the
     * expected common use is in repetitive formatting e.g., page numbers.
     * @param maxChars - the maximum length of the output
     * @param just - one of the enum values LEFT, CENTER or RIGHT
     */
    public StringAlign(int maxChars, Justify just) {
        switch(just) {
        case LEFT:
        case CENTER:
        case RIGHT:
            this.just = just;
            break;
        default:
            throw new IllegalArgumentException("invalid justification arg.");
        }
        if (maxChars < 0) {
            throw new IllegalArgumentException("maxChars must be positive.");
        }
        this.maxChars = maxChars;
    }

    /** Format a String.
     * @param input - the string to be aligned.
     * @parm where - the StringBuilder to append it to.
     * @param ignore - a FieldPosition (may be null, not used but
     * specified by the general contract of Format).
     */
    @Override
    public StringBuffer format(
        Object input, StringBuffer where, FieldPosition ignore)  {

        String s = input.toString();
        String wanted = s.substring(0, Math.min(s.length(), maxChars));

        // Get the spaces in the right place.
        switch (just) {
            case RIGHT:
                pad(where, maxChars - wanted.length());
                where.append(wanted);
                break;
            case CENTER:
                int toAdd = maxChars - wanted.length();
                pad(where, toAdd/2);
                where.append(wanted);
                pad(where, toAdd - toAdd/2);
                break;
            case LEFT:
                where.append(wanted);
                pad(where, maxChars - wanted.length());
                break;
            }
        return where;
    }

    protected final void pad(StringBuffer to, int howMany) {
        for (int i=0; i<howMany; i++)
            to.append(' ');
    }

    /** Convenience Routine */
    String format(String s) {
        return format(s, new StringBuffer(), null).toString();
    }

    /** ParseObject is required, but not useful here. */
    public Object parseObject (String source, ParsePosition pos)  {
        return source;
    }
}

Java 12 introduced a new method public String indent(int n) which prepends n spaces to the string, which is treated as a sequence of lines with line separators. This works well in conjunction with the Java 11 Stream<String> lines() method e.g., for the case where a series of lines, conveniently already stored in a single string, needs the same indent (Streams, and the “::” notation, are explained in Recipe 9.1).

jshell> "abc
def".indent(30).lines().forEach(System.out::println);
                              abc
                              def

jshell> "abc
def".indent(30).indent(-10).lines().forEach(System.out::println);
                    abc
                    def

jshell>

See Also

The alignment of numeric columns is considered in Chapter 5.

3.5 Converting Between Unicode Characters and Strings

Problem

You want to convert between Unicode characters and Strings.

Solution

Use Java char or String datatypes to deal with characters; these intrinsically support Unicode. Print characters as integers to display their raw value if needed.

Discussion

Unicode is an international standard that aims to represent all known characters used by people in their various languages. Though the original ASCII character set is a subset, Unicode is huge. At the time Java was created, Unicode was a 16-bit character set, so it seemed natural to make Java char values be 16 bits in width, and for years a char could hold any Unicode character. However, over time, Unicode has grown, to the point that it now includes over a million “code points” or characters, more than the 65,525 that could be represented in 16 bits.4 Not all possible 16-bit values were defined as characters in UCS-2, the 16-bit version of Unicode originally used in Java. A few were reserved as “escape characters,” which allows for multicharacter-length mappings to less common characters. Fortunately, there is a go-between standard, called UTF-16 (16-bit Unicode Transformation Format). As the String class documentation puts it:

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.

The String class provides methods for dealing with Unicode code points (i.e., characters), in addition to those for dealing with Unicode code units (i.e., char values).

The charAt() method of String returns the char value for the character at the specified offset. The StringBuilder append() method has a form that accepts a char. Because char is an integer type, you can even do arithmetic on chars, though this is not needed as frequently as in, say, C. Nor is it often recommended, because the Character class provides the methods for which these operations were normally used in languages such as C. Here is a program that uses arithmetic on chars to control a loop, and also appends the characters into a StringBuilder (see Recipe 3.2):

        // UnicodeChars.java
        StringBuilder b = new StringBuilder();
        for (char c = 'a'; c<'d'; c++) {
            b.append(c);
        }
        b.append('u00a5');    // Japanese Yen symbol
        b.append('u01FC');    // Roman AE with acute accent
        b.append('u0391');    // GREEK Capital Alpha
        b.append('u03A9');    // GREEK Capital Omega

        for (int i=0; i<b.length(); i++) {
            System.out.printf(
                "Character #%d (%04x) is %c%n",
                i, (int)b.charAt(i), b.charAt(i));
        }
        System.out.println("Accumulated characters are " + b);

When you run it, the expected results are printed for the ASCII characters. On Unix and Mac systems, the default fonts don’t include all the additional characters, so they are either omitted or mapped to irregular characters:

$ java -cp target/classes strings.UnicodeChars
Character #0 (0061) is a
Character #1 (0062) is b
Character #2 (0063) is c
Character #3 (00a5) is ¥
Character #4 (01fc) is Ǽ
Character #5 (0391) is Α
Character #6 (03a9) is Ω
Accumulated characters are abc¥ǼΑΩ
$

The Windows system used to try this doesn’t have most of those characters either, but at least it prints the ones it knows are lacking as question marks (Windows system fonts are more homogenous than those of the various Unix systems, so it is easier to know what won’t work). On the other hand, it tries to print the Yen sign as a Spanish capital Enye (N with a ~ over it).

Character #0 is a
Character #1 is b
Character #2 is c
Character #3 is ¥
Character #4 is ?
Character #5 is ?
Character #6 is ?
Accumulated characters are abc¥___

where the “_” characters are unprintable characters, which may appear as a question mark (“?”).

See Also

The Unicode program in this book’s online source displays any 256-character section of the Unicode character set. You can download documentation listing every character in the Unicode character set from the Unicode Consortium.

3.6 Reversing a String by Word or by Character

Problem

You wish to reverse a string, a character, or a word at a time.

Solution

You can reverse a string by character easily, using a StringBuilder. There are several ways to reverse a string a word at a time. One natural way is to use a StringTokenizer and a stack. Stack is a class (defined in java.util; see Recipe 7.16) that implements an easy-to-use last-in, first-out (LIFO) stack of objects.

Discussion

To reverse the characters in a string, use the StringBuilder reverse() method:

StringRevChar.java

String sh = "FCGDAEB";
System.out.println(sh + " -> " + new StringBuilder(sh).reverse( ));

The letters in this example list the order of the sharps in the key signatures of Western music; in reverse, it lists the order of flats. Alternatively, of course, you could reverse the characters yourself, using character-at-a-time mode (see Recipe 3.3).

A popular mnemonic, or memory aid, to help music students remember the order of sharps and flats consists of one word for each sharp instead of just one letter. Let’s to reverse this one word at a time. Example 3-6 adds each one to a Stack (see Recipe 7.16), then processes the whole lot in LIFO order, which reverses the order.

Example 3-6. main/src/main/java/strings/StringReverse.java
        String s = "Father Charles Goes Down And Ends Battle";

        // Put it in the stack frontwards
        Stack<String> myStack = new Stack<>();
        StringTokenizer st = new StringTokenizer(s);
        while (st.hasMoreTokens()) {
            myStack.push(st.nextToken());
        }

        // Print the stack backwards
        System.out.print('"' + s + '"' + " backwards by word is:
	"");
        while (!myStack.empty()) {
            System.out.print(myStack.pop());
            System.out.print(' ');    // inter-word spacing
        }
        System.out.println('"');

3.7 Expanding and Compressing Tabs

Problem

You need to convert space characters to tab characters in a file, or vice versa. You might want to replace spaces with tabs to save space on disk, or go the other way to deal with a device or program that can’t handle tabs.

Solution

Use my Tabs class or its subclass EnTab.

Discussion

Because programs that deal with tabbed text or data expect tab stops to be at fixed positions, you cannot use a typical text editor to replace tabs with spaces or vice versa. Example 3-7 is a listing of EnTab, complete with a sample main program. The program works a line at a time. For each character on the line, if the character is a space, we see if we can coalesce it with previous spaces to output a single tab character. This program depends on the Tabs class, which we’ll come to shortly. The Tabs class is used to decide which column positions represent tab stops and which do not.

Example 3-7. main/src/main/java/strings/Entab.java
public class EnTab {

    private static Logger logger = Logger.getLogger(EnTab.class.getSimpleName());

    /** The Tabs (tab logic handler) */
    protected Tabs tabs;

    /**
     * Delegate tab spacing information to tabs.
     */
    public int getTabSpacing() {
        return tabs.getTabSpacing();
    }

    /**
     * Main program: just create an EnTab object, and pass the standard input
     * or the named file(s) through it.
     */
    public static void main(String[] argv) throws IOException {
        EnTab et = new EnTab(8);
        if (argv.length == 0) // do standard input
            et.entab(
                new BufferedReader(new InputStreamReader(System.in)),
                System.out);
        else
            for (String fileName : argv) { // do each file
                et.entab(
                    new BufferedReader(new FileReader(fileName)),
                    System.out);
            }
    }

    /**
     * Constructor: just save the tab values.
     * @param n The number of spaces each tab is to replace.
     */
    public EnTab(int n) {
        tabs = new Tabs(n);
    }

    public EnTab() {
        tabs = new Tabs();
    }

    /**
     * entab: process one file, replacing blanks with tabs.
     * @param is A BufferedReader opened to the file to be read.
     * @param out a PrintWriter to send the output to.
     */
    public void entab(BufferedReader is, PrintWriter out) throws IOException {

        // main loop: process entire file one line at a time.
        is.lines().forEach(line -> {
            out.println(entabLine(line));
        });
    }

    /**
     * entab: process one file, replacing blanks with tabs.
     *
     * @param is A BufferedReader opened to the file to be read.
     * @param out A PrintStream to write the output to.
     */
    public void entab(BufferedReader is, PrintStream out) throws IOException {
        entab(is, new PrintWriter(out));
    }

    /**
     * entabLine: process one line, replacing blanks with tabs.
     * @param line the string to be processed
     */
    public String entabLine(String line) {
        int N = line.length(), outCol = 0;
        StringBuilder sb = new StringBuilder();
        char ch;
        int consumedSpaces = 0;

        for (int inCol = 0; inCol < N; inCol++) { // Cannot use foreach here
            ch = line.charAt(inCol);
            // If we get a space, consume it, don't output it.
            // If this takes us to a tab stop, output a tab character.
            if (ch == ' ') {
                logger.info("Got space at " + inCol);
                if (tabs.isTabStop(inCol)) {
                    logger.info("Got a Tab Stop " + inCol);
                    sb.append('	');
                    outCol += consumedSpaces;
                    consumedSpaces = 0;
                } else {
                    consumedSpaces++;
                }
                continue;
            }

            // We're at a non-space; if we're just past a tab stop, we need
            // to put the "leftover" spaces back out, since we consumed
            // them above.
            while (inCol-1 > outCol) {
                logger.info("Padding space at " + inCol);
                sb.append(' ');
                outCol++;
            }

            // Now we have a plain character to output.
            sb.append(ch);
            outCol++;

        }
        // If line ended with trailing (or only!) spaces, preserve them.
        for (int i = 0; i < consumedSpaces; i++) {
            logger.info("Padding space at end # " + i);
            sb.append(' ');
        }
        return sb.toString();
    }
}

This code was patterned after a program in Kernighan and Plauger’s classic work, Software Tools. While their version was in a language called RatFor (Rational Fortran), my version has since been through several translations. Their version actually worked one character at a time, and for a long time I tried to preserve this overall structure. Eventually, I rewrote it to be a line-at-a-time program.

The program that goes in the opposite direction—putting tabs in rather than taking them out—is the DeTab class shown in Example 3-8; only the core methods are shown.

Example 3-8. main/src/main/java/strings/DeTab.java
public class DeTab {
    Tabs ts;

    public static void main(String[] argv) throws IOException {
        DeTab dt = new DeTab(8);
        dt.detab(new BufferedReader(new InputStreamReader(System.in)),
                new PrintWriter(System.out));
    }

    public DeTab(int n) {
        ts = new Tabs(n);
    }
    public DeTab() {
        ts = new Tabs();
    }

    /** detab one file (replace tabs with spaces)
     * @param is - the file to be processed
     * @param out - the updated file
     */
    public void detab(BufferedReader is, PrintWriter out) throws IOException {
        is.lines().forEach(line -> {
            out.println(detabLine(line));
        });
    }

    /** detab one line (replace tabs with spaces)
     * @param line - the line to be processed
     * @return the updated line
     */
    public String detabLine(String line) {
        char c;
        int col;
        StringBuilder sb = new StringBuilder();
        col = 0;
        for (int i = 0; i < line.length(); i++) {
            // Either ordinary character or tab.
            if ((c = line.charAt(i)) != '	') {
                sb.append(c); // Ordinary
                ++col;
                continue;
            }
            do { // Tab, expand it, must put >=1 space
                sb.append(' ');
            } while (!ts.isTabStop(++col));
        }
        return sb.toString();
    }
}

The Tabs class provides two methods: settabpos() and istabstop(). Example 3-9 is the source for the Tabs class.

Example 3-9. main/src/main/java/strings/Tabs.java
public class Tabs {
    /** tabs every so often */
    public final static int DEFTABSPACE =   8;
    /** the current tab stop setting. */
    protected int tabSpace = DEFTABSPACE;
    /** The longest line that we initially set tabs for. */
    public final static int MAXLINE  = 255;

    /** Construct a Tabs object with a given tab stop settings */
    public Tabs(int n) {
        if (n <= 0) {
            n = 1;
        }
        tabSpace = n;
    }

    /** Construct a Tabs object with a default tab stop settings */
    public Tabs() {
        this(DEFTABSPACE);
    }

    /**
     * @return Returns the tabSpace.
     */
    public int getTabSpacing() {
        return tabSpace;
    }

    /** isTabStop - returns true if given column is a tab stop.
     * @param col - the current column number
     */
    public boolean isTabStop(int col) {
        if (col <= 0)
            return false;
        return (col+1) % tabSpace == 0;
    }
}

3.8 Controlling Case

Problem

You need to convert strings to uppercase or lowercase, or to compare strings without regard for case.

Solution

The String class has a number of methods for dealing with documents in a particular case. toUpperCase() and toLowerCase() each return a new string that is a copy of the current string, but converted as the name implies. Each can be called either with no arguments or with a Locale argument specifying the conversion rules; this is necessary because of internationalization. Java’s API provides significant internationalization and localization features, as covered in “Ian’s Basic Steps: Internationalization”. Whereas the equals() method tells you if another string is exactly the same, equalsIgnoreCase() tells you if all characters are the same regardless of case. Here, you can’t specify an alternative locale; the system’s default locale is used:

        String name = "Java Cookbook";
        System.out.println("Normal:	" + name);
        System.out.println("Upper:	" + name.toUpperCase());
        System.out.println("Lower:	" + name.toLowerCase());
        String javaName = "java cookBook"; // If it were Java identifiers :-)
        if (!name.equals(javaName))
            System.err.println("equals() correctly reports false");
        else
            System.err.println("equals() incorrectly reports true");
        if (name.equalsIgnoreCase(javaName))
            System.err.println("equalsIgnoreCase() correctly reports true");
        else
            System.err.println("equalsIgnoreCase() incorrectly reports false");

If you run this, it prints the first name changed to uppercase and lowercase, then it reports that both methods work as expected:

C:javasrcstrings>java strings.Case
Normal: Java Cookbook
Upper:  JAVA COOKBOOK
Lower:  java cookbook
equals( ) correctly reports false
equalsIgnoreCase( ) correctly reports true

See Also

Regular expressions make it simpler to ignore case in string searching (see Chapter 4).

3.9 Entering Nonprintable Characters

Problem

You need to put nonprintable characters into strings.

Solution

Use the backslash character and one of the Java string escapes.

Discussion

The Java string escapes are listed in Table 3-1.

Table 3-1. String escapes
To get: Use: Notes

Tab

Linefeed (Unix newline)

The call System.getProperty("line.separator") will give you the platform’s line end.

Carriage return

Form feed

f

Backspace



Single quote

'

Double quote

"

Unicode character

u NNNN

Four hexadecimal digits (no x as in C/C++). See http://www.unicode.org for codes.

Octal(!) character

++NNN

Who uses octal (base 8) these days?

Backslash

\

Here is a code example that shows most of these in action:

public class StringEscapes {
    public static void main(String[] argv) {
        System.out.println("Java Strings in action:");
        // System.out.println("An alarm or alert: a");    // not supported
        System.out.println("An alarm entered in Octal: 07");
        System.out.println("A tab key: 	(what comes after)");
        System.out.println("A newline: 
(what comes after)");
        System.out.println("A UniCode character: u0207");
        System.out.println("A backslash character: \");
    }
}

If you have a lot of non-ASCII characters to enter, you may wish to consider using Java’s input methods, discussed briefly in the online documentation.

3.10 Trimming Blanks from the End of a String

Problem

You need to work on a string without regard for extra leading or trailing spaces a user may have typed.

Solution

Use the String class strip() or trim() methods.

Discussion

There are four methods in the String class for this:

strip()

Returns a string with all leading and trailing whitespace removed.

stripLeading()

Returns a string whose value is this string, with all leading white space removed.

stripTrailing()

Returns the string with all trailing whitespace removed.

String trim()

Returns the string with all leading and trailing spaces removed,

For the strip() methods, “whitespace” is as defined by Character.isSpace(). For the trim() method, “space” includes any character whose numeric value is less than or equal to 32, or U+0020 (the space character).

Example 3-10 uses trim() to strip an arbitrary number of leading spaces and/or tabs from lines of Java source code in order to look for the characters //+ and //-. These strings are special Java comments I previously used to mark the parts of the programs in this book that I want to include in the printed copy.

Example 3-10. main/src/main/java/strings/GetMark.java (trimming and comparing strings)
public class GetMark {
    /** the default starting mark. */
    public final String START_MARK = "//+";
    /** the default ending mark. */
    public final String END_MARK = "//-";
    /** Set this to TRUE for running in "exclude" mode (e.g., for
     * building exercises from solutions) and to FALSE for running
     * in "extract" mode (e.g., writing a book and omitting the
     * imports and "public class" stuff).
     */
    public final static boolean START = true;
    /** True if we are currently inside marks. */
    protected boolean printing = START;
    /** True if you want line numbers */
    protected final boolean number = false;

    /** Get Marked parts of one file, given an open LineNumberReader.
     * This is the main operation of this class, and can be used
     * inside other programs or from the main() wrapper.
     */
    public void process(String fileName,
        LineNumberReader is,
        PrintStream out) {
        int nLines = 0;
        try {
            String inputLine;

            while ((inputLine = is.readLine()) != null) {
                if (inputLine.trim().equals(START_MARK)) {
                    if (printing)
                        // These go to stderr, so you can redirect the output
                        System.err.println("ERROR: START INSIDE START, " +
                            fileName + ':' + is.getLineNumber());
                    printing = true;
                } else if (inputLine.trim().equals(END_MARK)) {
                    if (!printing)
                        System.err.println("ERROR: STOP WHILE STOPPED, " +
                            fileName + ':' + is.getLineNumber());
                    printing = false;
                } else if (printing) {
                    if (number) {
                        out.print(nLines);
                        out.print(": ");
                    }
                    out.println(inputLine);
                    ++nLines;
                }
            }
            is.close();
            out.flush(); // Must not close - caller may still need it.
            if (nLines == 0)
                System.err.println("ERROR: No marks in " + fileName +
                    "; no output generated!");
        } catch (IOException e) {
            System.out.println("IOException: " + e);
        }
    }

3.11 Creating a Message with I18N Resources

Problem

You want your program to take “sensitivity lessons” so that it can communicate well internationally.

Solution

Your program must obtain all control and message strings via the internationalization software. Here’s how:

  1. Get a ResourceBundle.

    ResourceBundle rb = ResourceBundle.getBundle("Menus");

    I’ll talk about ResourceBundle in Recipe 3.13, but briefly, a ResourceBundle represents a collection of name-value pairs (resources). The names are names you assign to each GUI control or other user interface text, and the values are the text to assign to each control in a given language.

  2. Use this ResourceBundle to fetch the localized version of each control name.

    Old way:

    String label = "Exit";
    // Create the control, e.g., new JButton(label);

    New way:

    try { label = rb.getString("exit.label"); }
    catch (MissingResourceException e) { label="Exit"; } // fallback
    // Create the control, e.g., new JButton(label);

This may seem quite a bit of code for one control, but you can write a convenience routine to simplify it, e.g.,

JButton exitButton = I18NUtil.getButton("exit.label", "Exit");

The file I18NUtil.java is included in the book’s code distribution.

While the example is a Swing JButton, the same approach goes with other UIs, such as the web tier. In JSF, for example, you might place your strings in a properties file called resources.properties and store it in src/main/resources. You’d load this in faces-config.xml:

  <application>
    <locale-config>
        <default-locale>en</default-locale>
        <supported-locale>en</supported-locale>
        <supported-locale>es</supported-locale>
        <supported-locale>fr</supported-locale>
    </locale-config>
    <resource-bundle>
        <base-name>resources</base-name>
        <var>msg</var>
    </resource-bundle>
  </application>

Then in each web page that needs these strings, refer to the resource using the msg variable in an expression:

// In signup.xhtml:
<h:outputText value="#{msg.prompt_firstname}"/>
<h:inputText required="true" id="firstName" value="#{person.firstName}" />

What happens at runtime?

The default locale is used, because we didn’t specify one. The default locale is platform-dependent:

Unix/POSIX

LANG environment variable (per user)

Windows

Control Panel→Regional Settings

Mac OS X

System Preferences→Language & Text

Others

See platform documentation

ResourceBundle.getBundle() locates a file with the named resource bundle name (Menus, in the previous example), plus an underscore and the locale name (if a non-default locale is set), plus another underscore and the locale variation (if any variation is set), plus the extension .properties. If a variation is set but the file can’t be found, it falls back to just the country code. If that can’t be found, it falls back to the original default. Table 3-2 shows some examples for various locales.

Note that Android apps—usually written in Java or Kotlin—use a similar mechanism, but with the files in XML format instead of Java Properties, and with some small changes in the name of the file in which the properties files are found.

Table 3-2. Property filenames for different locales
Locale Filename

Default locale

Menus.Properties

Swedish

Menus_sv.properties

Spanish

Menus_es.properties

French

Menus_fr.properties

French-Canadian

Menus_fr_CA.properties

Locale names are two-letter ISO-639 language codes (lowercase), and normally abbreviate the country’s endonym (the name its language speakers refer to it by), thus Sweden is sv for Sverige, Spain is es for Espanol, etc. Locale variations are two-letter ISO country codes (uppercase), e.g., CA for Canada, US for the United States, SV for Sweden, ES for Spain, etc.

Setting the locale

On Windows, go into Regional Settings in the Control Panel. Changing this setting may entail a reboot, so exit any editor windows.

On Unix, set your LANG environment variable. For example, a Korn shell user in Mexico might have this line in her .profile:

export LANG=es_MX

On either system, for testing a different locale, you need only define the locale in the System Properties at runtime using the command-line option -D, as in:

java -Duser.language=es i18n.Browser

to run the Java program named Browser in package i18n in the Spanish locale.

You can get a list of the available locales with a call to Locale.getAvailableLocales().

3.12 Using a Particular Locale

Problem

You want to use a locale other than the default in a particular operation.

Solution

Obtain a Locale by using a predefined instance or the Locale constructor. Optionally make it global to your application by using Locale.setDefault(newLocale).

Discussion

Classes that provide formatting services, such as DateTimeFormatter and NumberFormat, provide overloads so they can be called either with or without a Locale-related argument.

To obtain a Locale object, you can employ one of the predefined locale variables provided by the Locale class, or you can construct your own Locale object giving a language code and a country code:

Locale locale1 = Locale.FRANCE;    // predefined
Locale locale2 = new Locale("en", "UK");    // English, UK version

These can then be used in the various formatting operations.

DateFormat frDateFormatter = DateFormat.getDateInstance(
		DateFormat.MEDIUM, frLocale);
DateFormat ukDateFormatter = DateFormat.getDateInstance(
		DateFormat.MEDIUM, ukLocale);

Either of these can be used to format a date or a number, as shown in class UseLocales:

package i18n;

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.time.format.FormatStyle;
import java.util.Locale;

/** Use some locales; based on user's OS "settings"
 * choices or -Duser.lang= or -Duser.region=.
 */
// tag::main[]
public class UseLocales {
    public static void main(String[] args) {

        Locale frLocale = Locale.FRANCE;    // predefined
        Locale ukLocale = new Locale("en", "UK");    // English, UK version

        DateTimeFormatter defaultDateFormatter =
            DateTimeFormatter.ofLocalizedDateTime(
                FormatStyle.MEDIUM);
        DateTimeFormatter frDateFormatter =
            DateTimeFormatter.ofLocalizedDateTime(
                FormatStyle.MEDIUM).localizedBy(frLocale);
        DateTimeFormatter ukDateFormatter =
            DateTimeFormatter.ofLocalizedDateTime(
                FormatStyle.MEDIUM).localizedBy(ukLocale);

        LocalDateTime now = LocalDateTime.now();
        System.out.println("Default: " + ' ' +
            now.format(defaultDateFormatter));
        System.out.println(frLocale.getDisplayName() + ' ' +
            now.format(frDateFormatter));
        System.out.println(ukLocale.getDisplayName() + ' ' +
            now.format(ukDateFormatter));
    }
}
// end::main[]

The program prints the locale name and formats the date in each of the locales:

$ <strong>java i18n.UseLocales</strong>
Default:  Oct 16, 2019, 4:41:45 PM
French (France) 16 oct. 2019 à 16:41:45
English (UK) Oct 16, 2019, 4:41:45 PM$

3.13 Creating a Resource Bundle

Problem

You need to create a resource bundle for use with I18N.

Solution

A resource bundle is simply a collection of names and values. You could write a java.util.ResourceBundle subclass, but it is easier to create textual Properties files (see Recipe 7.10) that you then load with ResourceBundle.getBundle( ). The files can be created using any plain text editor. Leaving it in a text file format also allows user customization in desktop applications; a user whose language is not provided for, or who wishes to change the wording somewhat due to local variations in dialect, should be able to edit the file.

Note that the resource bundle text file should not have the same name as any of your Java classes. The reason is that the ResourceBundle constructs a class dynamically with the same name as the resource files.

Discussion

Here is a sample properties file for a few menu items:

# Default Menu properties
# The File Menu
file.label=File Menu
file.new.label=New File
file.new.key=N
file.save.label=Save
file.new.key=S

Creating the default properties file is usually not a problem, but creating properties files for other languages might be. Unless you are a large multinational corporation, you will probably not have the resources (pardon the pun) to create resource files in-house. If you are shipping commercial software, or using the web for global reach, you need to identify your target markets and understand which of these are most sensitive to wanting menus and the like in their own languages. Then, hire a professional translation service that has expertise in the required languages to prepare the files. Test them well before you ship, as you would any other part of your software.

If you need special characters, multiline text, or other complex entry, remember that a ResourceBundle is also a Properties file, so see the documentation for java.util.Properties.

3.14 Program: A Simple Text Formatter

This program is a very primitive text formatter, representative of what people used on most computing platforms before the rise of standalone graphics-based word processors, laser printers, and, eventually, desktop publishing and desktop office suites. It simply reads words from a file—previously created with a text editor—and outputs them until it reaches the right margin, when it calls println() to append a line ending. For example, here is an input file:

It's a nice
day, isn't it, Mr. Mxyzzptllxy?
I think we should
go for a walk.

Given the preceding as its input, the Fmt program prints the lines formatted neatly:

It's a nice day, isn't it, Mr. Mxyzzptllxy? I think we should go for a
walk.

As you can see, it fits the text we gave it to the margin and discards all the line breaks present in the original. Here’s the code:

public class Fmt {
    /** The maximum column width */
    public static final int COLWIDTH=72;
    /** The file that we read and format */
    final BufferedReader in;
    /** Where the output goes */
    PrintWriter out;

    /** If files present, format each one, else format the standard input. */
    public static void main(String[] av) throws IOException {
        if (av.length == 0)
            new Fmt(System.in).format();
        else for (String name : av) {
            new Fmt(name).format();
        }
    }

    public Fmt(BufferedReader inFile, PrintWriter outFile) {
        this.in = inFile;
        this.out = outFile;
    }

    public Fmt(PrintWriter out) {
        this(new BufferedReader(new InputStreamReader(System.in)), out);
    }

    /** Construct a Formatter given an open Reader */
    public Fmt(BufferedReader file) throws IOException {
        this(file, new PrintWriter(System.out));
    }

    /** Construct a Formatter given a filename */
    public Fmt(String fname) throws IOException {
        this(new BufferedReader(new FileReader(fname)));
    }

    /** Construct a Formatter given an open Stream */
    public Fmt(InputStream file) throws IOException {
        this(new BufferedReader(new InputStreamReader(file)));
    }

    /** Format the File contained in a constructed Fmt object */
    public void format() throws IOException {
        format(in.lines(), out);
    }

    /** Format a Stream of lines, e.g., bufReader.lines() */
    public static void format(Stream<String> s, PrintWriter out) {
        StringBuilder outBuf = new StringBuilder();
        s.forEachOrdered((line -> {
            if (line.length() == 0) {    // null line
                out.println(outBuf);    // end current line
                out.println();    // output blank line
                outBuf.setLength(0);
            } else {
                // otherwise it's text, so format it.
                StringTokenizer st = new StringTokenizer(line);
                while (st.hasMoreTokens()) {
                    String word = st.nextToken();

                    // If this word would go past the margin,
                    // first dump out anything previous.
                    if (outBuf.length() + word.length() > COLWIDTH) {
                        out.println(outBuf);
                        outBuf.setLength(0);
                    }
                    outBuf.append(word).append(' ');
                }
            }
        }));
        if (outBuf.length() > 0) {
            out.println(outBuf);
        } else {
            out.println();
        }
    }


}

A slightly fancier version of this program, Fmt2, is in the online source for this book. It uses “dot commands”—lines beginning with periods—to give limited control over the formatting. A family of “dot command” formatters includes Unix’s roff, nroff, troff, and groff, which are in the same family with programs called runoff on Digital Equipment systems. The original for this is J. Saltzer’s runoff, which first appeared on Multics and from there made its way into various OSes. To save trees, I did not include Fmt2 here; it subclasses Fmt and overrides the format() method to include additional functionality (the source code is in the full javasrc repository for the book).

3.15 Program: Soundex Name Comparisons

The difficulties in comparing (American-style) names inspired the U.S. Census Bureau to develop the Soundex algorithm in the early 1900s. Each of a given set of consonants maps to a particular number, the effect being to map similar-sounding names together, on the grounds that in those days many people were illiterate and could not spell their family names consistently. But it is still useful today—for example, in a company-wide telephone book application. The names Darwin and Derwin, for example, map to D650, and Darwent maps to D653, which puts it adjacent to D650. All of these are believed to be historical variants of the same name. Suppose we needed to sort lines containing these names together: if we could output the Soundex numbers at the beginning of each line, this would be easy. Here is a simple demonstration of the Soundex class:

public class SoundexSimple {

    /** main */
    public static void main(String[] args) {
        String[] names = {
            "Darwin, Ian",
            "Davidson, Greg",
            "Darwent, William",
            "Derwin, Daemon"
        };
        for (String name : names) {
            System.out.println(Soundex.soundex(name) + ' ' + name);
        }
    }
}

Let’s run it:

> javac -d . SoundexSimple.java
> java strings.SoundexSimple | sort
D132 Davidson, Greg
D650 Darwin, Ian
D650 Derwin, Daemon
D653 Darwent, William
>

As you can see, the Darwin-variant names (including Daemon Derwin5) all sort together and are distinct from the Davidson (and Davis, Davies, etc.) names that normally appear between Darwin and Derwin when using a simple alphabetic sort. The Soundex algorithm has done its work.

Here is the Soundex class itself—it uses Strings and StringBuilders to convert names into Soundex codes:

main/src/main/java/strings/Soundex.java

public class Soundex {

    static boolean debug = false;

    /* Implements the mapping
     * from: AEHIOUWYBFPVCGJKQSXZDTLMNR
     * to:   00000000111122222222334556
     */
    public static final char[] MAP = {
        //A  B   C   D   E   F   G   H   I   J   K   L   M
        '0','1','2','3','0','1','2','0','0','2','2','4','5',
        //N  O   P   W   R   S   T   U   V   W   X   Y   Z
        '5','0','1','2','6','2','3','0','1','0','2','0','2'
    };

    /** Convert the given String to its Soundex code.
     * @return null If the given string can't be mapped to Soundex.
     */
    public static String soundex(String s) {

        // Algorithm works on uppercase (mainframe era).
        String t = s.toUpperCase();

        StringBuilder res = new StringBuilder();
        char c, prev = '?', prevOutput = '?';

        // Main loop: find up to 4 chars that map.
        for (int i=0; i<t.length() && res.length() < 4 &&
            (c = t.charAt(i)) != ','; i++) {

            // Check to see if the given character is alphabetic.
            // Text is already converted to uppercase. Algorithm
            // only handles ASCII letters, do NOT use Character.isLetter()!
            // Also, skip double letters.
            if (c>='A' && c<='Z' && c != prev) {
                prev = c;

                // First char is installed unchanged, for sorting.
                if (i==0) {
                    res.append(c);
                } else {
                    char m = MAP[c-'A'];
                    if (debug) {
                        System.out.println(c + " --> " + m);
                    }
                    if (m != '0' && m != prevOutput) {
                        res.append(m);
                        prevOutput = m;
                    }
                }
            }
        }
        if (res.length() == 0)
            return null;
        for (int i=res.length(); i<4; i++)
            res.append('0');
        return res.toString();
    }

There are apparently some nuances of the full Soundex algorithm that are not implemented by this application. A more complete test using JUnit (see Recipe 1.10) is also online as SoundexTest.java, in the src/tests/java/strings directory. The dedicated reader may use this to provoke failures of such nuances, and send a pull request with updated versions of the test and the code.

See Also

The Levenshtein string edit distance algorithm can be used for doing approximate string comparisons in a different fashion. You can find this in Apache Commons StringUtils. I show a non-Java (Perl) implementation of this algorithm in Recipe 18.5.

1 The two +.equals()+ calls are “equivalent” with the exception that the first can throw a +NullPointerException+ while the second cannot.

2 StringBuilder was added in Java 5. It is functionally equivalent to the older StringBuffer. We will delve into the details in Recipe 3.2.

3 Unless, perhaps, you’re as slow at updating personal records as I am.

4 Indeed, there are so many characters in Unicode that a fad has emerged of displaying your name upside down using characters that approximate upside-down versions of the Latin alphabet. Do a web search for “upside down unicode.”

5 In Unix terminology, a “daemon” is a server. The old English word has nothing to do with satanic “demons” but refers to a helper or assistant. Derwin Daemon was actually a character in Susannah Coleman’s “Source Wars” online comic strip, which long ago was online at a now-departed site called darby.daemonnews.org.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.66.185