Character strings are an inevitable part of just about any programming task. We use them for printing messages for the user; for referring to files on disk or other external media; and for people’s names, addresses, and affiliations. The uses of strings are many, almost without number (actually, if you need numbers, we’ll get to them in Chapter 5).
If you’re coming from a programming language like C, you’ll need to remember that String
is a defined type (class) in Java—that is, a string is an object and therefore has methods. It is not an array of characters (though it contains one) and should not be thought of as an array. Operations like fileName.endsWith(".gif")
and extension.equals(".gif")
(and the equivalent ".gif".equals(extension)
) are commonplace.1
Java old-timers should note that Java 11 and 12 added several new methods, including indent(int n), stripLeading() and stripTrailing(), Stream<T> lines(), isBlank(), and transform(). Most of these provide obvious functionality; the last one allows applying an instance of a “functional interface” (see Recipe 9.1) to a string and returning the result of that operatio.
Although we haven’t discussed the details of the java.io
package yet (we will, in Chapter 10), you need to be able to read text files for some of these programs. Even if you’re not familiar with java.io
, you can probably see from the examples of reading text files that a BufferedReader
allows you to read “chunks” of data, and that this class has a very convenient readLine()
method.
Going the other way, System.out.println()
is normally used to print strings or other values
to the terminal or “standard output.” String concatenation is commonly used here, as in:
System.out.println("The answer is " + result);
One caveat with string concatenation is that if you are appending a bunch of things, and a number and a character are concatenated at the front, they are added before concatenation due to Java’s precedence rules. So don’t do as I did in this contrived example:
int
result
=
...;
System
.
out
.
println
(
result
+
'='
+
" the answer."
);
Given that result
is an integer, then result + '=
' (result
added to the equals sign, which is of the numeric type char
) is a valid numeric expression, which will result in a single value of type int
. If the variable result
has the value 42, and given that the character =
in a Unicode (or ASCII) code chart has the value 61, this prints:
103 the answer.
The wrong value and no equals sign! Safer approaches include using parentheses, using double quotes around the equals sign, using a StringBuilder
(see Recipe 3.2) or using String.format()
(see Recipe 10.4). Of course in this simple example you could just move the = to be part of the string literal, but the example was chosen to illustrate the problem of arithmetic on char
values being confused with string contatenation.
I won’t show you how to sort an array of strings here; the more general notion of sorting a collection of objects will be taken up in Recipe 7.11.
Java 14 enables “Text blocks”, also known as multi-line text strings. These are delimited with a set of three double quotes, the opening of which must have a newline after the quotes (which doesn’t become part of the string; the following newlines do):
String long = """ This is a long text String."""
You want to break a string apart, either by indexing positions or by fixed token characters (e.g., break on spaces to get words).
For substrings, use the String
object’s substring()
method.
For tokenizing, construct a StringTokenizer
around your string and call its methods hasMoreTokens()
and nextToken()
.
Or, use regular expressions (see Chapter 4).
The substring()
method constructs a new String
object made
up of a run of characters contained somewhere in the original string,
the one whose substring()
you called. The
substring
method is overloaded: both forms require a starting
index (which is always zero-based). The one-argument form returns from startIndex
to the end.
The two-argument form takes an ending index (not a length, as in
some languages), so that an index can be generated by the String
methods indexOf()
or lastIndexOf()
.
Note that the end index is one beyond the last character! Java adopts this “half open interval” (or inclusive start, exclusive end) policy fairly consistently; there are good practical reasons for adopting this approach, and some other languages do likewise.
public
class
SubStringDemo
{
public
static
void
main
(
String
[]
av
)
{
String
a
=
"Java is great."
;
System
.
out
.
println
(
a
);
String
b
=
a
.
substring
(
5
);
// b is the String "is great."
System
.
out
.
println
(
b
);
String
c
=
a
.
substring
(
5
,
7
);
// c is the String "is"
System
.
out
.
println
(
c
);
String
d
=
a
.
substring
(
5
,
a
.
length
());
// d is "is great."
System
.
out
.
println
(
d
);
}
}
When run, this prints the following:
C:> java strings.SubStringDemo Java is great. is great. is is great. C:>
The easiest way is to use a regular expression; we’ll discuss these in Chapter 4, but for now, a string containing a space is a valid regular expression to match space characters, so you can most easily split a string into words like this:
for
(
String
word
:
some_input_string
.
split
(
" "
))
{
System
.
out
.
println
(
word
);
}
If you need to match multiple spaces, or spaces and tabs, use the string "s+"
.
If you want to split a file, you can try the string ","
or use one of several third-party libraries for CSV files.
Another method is to use StringTokenizer
. The StringTokenizer
methods implement the Iterator
interface and design pattern (see Recipe 7.6):
StrTokDemo.java
StringTokenizer
st
=
new
StringTokenizer
(
"Hello World of Java"
);
while
(
st
.
hasMoreTokens
(
))
System
.
out
.
println
(
"Token: "
+
st
.
nextToken
(
));
StringTokenizer
also implements the Enumeration
interface (see Recipe 7.6),
but if you use the methods thereof you need to cast the results to String
.
A StringTokenizer
normally breaks the String
into tokens at what we would think of as “word boundaries” in European languages. Sometimes you want to break at some other character. No problem. When you construct your StringTokenizer
, in addition to passing in the string to be tokenized, pass in a second string that lists the “break characters.” For example:
StrTokDemo2.java
StringTokenizer
st
=
new
StringTokenizer
(
"Hello, World|of|Java"
,
", |"
);
while
(
st
.
hasMoreElements
(
))
System
.
out
.
println
(
"Token: "
+
st
.
nextElement
(
));
It outputs the four words, each on a line by itself, with no punctuation.
But wait, there’s more! What if you are reading lines like:
FirstName|LastName|Company|PhoneNumber
and your dear old Aunt Begonia hasn’t been employed for the last 38 years? Her “Company” field will in all probability be blank.3 If you look very closely at the previous code example, you’ll see that it has two delimiters together (the comma and the space), but if you run it, there are no “extra” tokens—that is, the StringTokenizer
normally discards adjacent consecutive delimiters. For cases like the phone list, where you need to preserve null fields, there is good news and bad news. The good news is that you can do it: you simply add a second argument of true
when constructing the StringTokenizer
, meaning that you wish to see the delimiters as tokens. The bad news is that you now get to see the delimiters as tokens, so you have to do the arithmetic yourself. Want to see it? Run this program:
StrTokDemo3.java
StringTokenizer
st
=
new
StringTokenizer
(
"Hello, World|of|Java"
,
", |"
,
true
);
while
(
st
.
hasMoreElements
(
))
System
.
out
.
println
(
"Token: "
+
st
.
nextElement
(
));
and you get this output:
C:>java strings.StrTokDemo3 Token: Hello Token: , Token: Token: World Token: | Token: of Token: | Token: Java C:>
This isn’t how you’d like StringTokenizer
to behave, ideally, but it is serviceable enough most of the time. Example 3-1 processes and ignores consecutive tokens, returning the results as an array of String
s.
public
class
StrTokDemo4
{
public
final
static
int
MAXFIELDS
=
5
;
public
final
static
String
DELIM
=
"|"
;
/** Processes one String, returns it as an array of Strings */
public
static
String
[]
process
(
String
line
)
{
String
[]
results
=
new
String
[
MAXFIELDS
];
// Unless you ask StringTokenizer to give you the tokens,
// it silently discards multiple null tokens.
StringTokenizer
st
=
new
StringTokenizer
(
line
,
DELIM
,
true
);
int
i
=
0
;
// stuff each token into the current slot in the array.
while
(
st
.
hasMoreTokens
())
{
String
s
=
st
.
nextToken
();
if
(
s
.
equals
(
DELIM
))
{
if
(
i
++>=
MAXFIELDS
)
// This is messy: See StrTokDemo4b which uses
// a List to allow any number of fields.
throw
new
IllegalArgumentException
(
"Input line "
+
line
+
" has too many fields"
);
continue
;
}
results
[
i
]
=
s
;
}
return
results
;
}
public
static
void
printResults
(
String
input
,
String
[]
outputs
)
{
System
.
out
.
println
(
"Input: "
+
input
);
for
(
String
s
:
outputs
)
System
.
out
.
println
(
"Output "
+
s
+
" was: "
+
s
);
}
public
static
void
main
(
String
[]
a
)
{
printResults
(
"A|B|C|D"
,
process
(
"A|B|C|D"
));
printResults
(
"A||C|D"
,
process
(
"A||C|D"
));
printResults
(
"A|||D|E"
,
process
(
"A|||D|E"
));
}
}
When you run this, you will see that A
is always in Field 1, B
(if present) is in Field 2, and so on. In other words, the null fields are being handled properly:
Input: A|B|C|D Output 0 was: A Output 1 was: B Output 2 was: C Output 3 was: D Output 4 was: null Input: A||C|D Output 0 was: A Output 1 was: null Output 2 was: C Output 3 was: D Output 4 was: null Input: A|||D|E Output 0 was: A Output 1 was: null Output 2 was: null Output 3 was: D Output 4 was: E
Many occurrences of StringTokenizer
may be replaced with regular expressions (see Chapter 4) with considerably more flexibility. For example, to extract all the numbers from a String
, you can use this code:
Matcher tokenizer = Pattern.compile("\d+").matcher(inputString); while (tokenizer.find( )) { String courseString = tokenizer.group(0); int courseNumber = Integer.parseInt(courseString); ...
This allows user input to be more flexible than you could easily handle with a StringTokenizer
. Assuming that the numbers represent course numbers at some educational institution, the inputs “471,472,570” or “Courses 471 and 472, 570” or just “471 472 570” should all give the same results.
You need to put some String
pieces (back) together.
Use string concatenation: the +
operator. The compiler implicitly constructs a StringBuilder
for you and uses its append()
methods (unless all the string parts are known at compile time).
Better yet, construct and use a StringBuilder
yourself.
An object of one of the StringBuilder
classes basically represents a collection of characters.
It is similar to a String
objectfootnote[String
and StringBuilder
have several methods that are
forced to be identical by their implementation of the CharSequence
interface].
However, as mentioned, String
s are immutable; StringBuilder
s are mutable and designed for, well, building String
s. You typically construct a StringBuilder
, invoke the methods needed to get the character sequence just the way you want it, and then call toString()
to generate a String
representing the same character sequence for use in most of the Java API, which deals in String
s.
StringBuffer
is historical—it’s been around since the beginning of time. Some of its methods are synchronized (see Recipe 16.5), which involves unneeded overhead in a single-threaded context. In Java 5, this class was “split” into StringBuffer
(which is synchronized) and StringBuilder
(which is not synchronized); thus, it is faster and preferable for single-threaded use. Another new class, AbstractStringBuilder
, is the parent of both. In the following discussion, I’ll use “the StringBuilder
classes” to refer to all three because they mostly have the same methods.
The book’s example code provides a StringBuilderDemo
and a StringBufferDemo
. Except for the fact that StringBuilder
is not threadsafe, these API classes are identical and can be used interchangeably, so my two demo programs are almost identical except that each one uses the appropriate builder class.
The StringBuilder
classes have a variety of methods for inserting,
replacing, and otherwise modifying a given StringBuilder
. Conveniently,
the append()
methods return a reference to the StringBuilder
itself, so
“stacked” statements like .append(…).append(…)
are fairly common.
This style of coding is referred to as a “fluent API” because
it reads smoothly, like prose from a native speaker of a human language.
You might even see this style of coding in a toString()
method, for example.
Example 3-2 shows three ways of concatenating strings.
public
class
StringBuilderDemo
{
public
static
void
main
(
String
[]
argv
)
{
String
s1
=
"Hello"
+
", "
+
"World"
;
System
.
out
.
println
(
s1
);
// Build a StringBuilder, and append some things to it.
StringBuilder
sb2
=
new
StringBuilder
();
sb2
.
append
(
"Hello"
);
sb2
.
append
(
','
);
sb2
.
append
(
' '
);
sb2
.
append
(
"World"
);
// Get the StringBuilder's value as a String, and print it.
String
s2
=
sb2
.
toString
();
System
.
out
.
println
(
s2
);
// Now do the above all over again, but in a more
// concise (and typical "real-world" Java) fashion.
System
.
out
.
println
(
new
StringBuilder
()
.
append
(
"Hello"
)
.
append
(
','
)
.
append
(
' '
)
.
append
(
"World"
));
}
}
In fact, all the methods that modify more than one character of a StringBuilder
’s contents (i.e., append()
, delete()
, deleteCharAt()
, insert()
, replace()
, and reverse()
) return a reference to the builder object to facilitate this “fluent API” style of coding.
As another example of using a StringBuilder
, consider the need to convert a list of
items into a comma-separated list, while avoiding getting an extra comma after the last element of the list.
This can be done using a StringBuilder
, although in Java 8+ there
is a static String method to do the same.
Code for these are shown in Example 3-3.
System
.
out
.
println
(
"Split using String.split; joined using 1.8 String join"
);
System
.
out
.
println
(
String
.
join
(
", "
,
SAMPLE_STRING
.
split
(
" "
)));
System
.
out
.
println
(
"Split using String.split; joined using StringBuilder"
);
StringBuilder
sb1
=
new
StringBuilder
();
for
(
String
word
:
SAMPLE_STRING
.
split
(
" "
))
{
if
(
sb1
.
length
()
>
0
)
{
sb1
.
append
(
", "
);
}
sb1
.
append
(
word
);
}
System
.
out
.
println
(
sb1
);
System
.
out
.
println
(
"Split using StringTokenizer; joined using StringBuilder"
);
StringTokenizer
st
=
new
StringTokenizer
(
SAMPLE_STRING
);
StringBuilder
sb2
=
new
StringBuilder
();
while
(
st
.
hasMoreElements
())
{
sb2
.
append
(
st
.
nextToken
());
if
(
st
.
hasMoreElements
())
{
sb2
.
append
(
", "
);
}
}
System
.
out
.
println
(
sb2
);
The first method is clearly the most compact; the static String.join()
make short work of this task.
The next method uses the StringBuilder.length()
method, so it will only work correctly when you are starting with an empty StringBuilder
.
The second method relies on calling the informational method hasMoreElements()
in the Enumeration
(or hasNext()
in an Iterator
, as discussed in Recipe 7.6) more than once on each element. An alternative method, particularly when you aren’t starting with an empty builder, would be to use a boolean
flag variable to track whether you’re at the beginning of the list.
You want to process the contents of a string, one character at a time.
Use a for
loop and the String
’s charAt()
or codePointAt()
method.
Or a “for each” loop and the String
’s toCharArray
method.
A string’s charAt()
method retrieves a given character by index number (starting at zero) from within the String
object.
Since Unicode has had to expand beyond 16 bits, not all Unicode characters
can fit into a Java char
variable.
There is thus an analogous codePointAt()
method, whose return type is int
.
To process all the characters in a String
, one after another, use a for
loop ranging from zero to String.length()-1
. Here we process all the characters in a String
:
main/src/main/java/strings/strings/StrCharAt.java
public
class
StrCharAt
{
public
static
void
main
(
String
[]
av
)
{
String
a
=
"A quick bronze fox"
;
for
(
int
i
=
0
;
i
<
a
.
length
();
i
++)
{
// no forEach, need the index
String
message
=
String
.
format
(
"charAt is '%c', codePointAt is %3d, casted it's '%c'"
,
a
.
charAt
(
i
),
a
.
codePointAt
(
i
),
(
char
)
a
.
codePointAt
(
i
));
System
.
out
.
println
(
message
);
}
}
}
Given that the “for each” loop has been in the language for ages,
you might be excused for expecting to be able to write something
like for (char ch : myString) {…}
. Unfortunately, this does not
work. But you can use myString.toCharArray()
as in the following:
public
class
ForEachChar
{
public
static
void
main
(
String
[]
args
)
{
String
mesg
=
"Hello world"
;
// Does not compile, Strings are not iterable
// for (char ch : mesg) {
// System.out.println(ch);
// }
System
.
out
.
println
(
"Using toCharArray:"
);
for
(
char
ch
:
mesg
.
toCharArray
())
{
System
.
out
.
println
(
ch
);
}
System
.
out
.
println
(
"Using Streams:"
);
mesg
.
chars
().
forEach
(
c
->
System
.
out
.
println
((
char
)
c
));
}
}
A “checksum” is a numeric quantity representing and confirming the contents of a file. If you transmit the checksum of a file separately from the contents, a recipient can checksum the file—assuming the algorithm is known—and verify that the file was received intact. Example 3-4 shows the simplest possible checksum, computed just by adding the numeric values of each character. Note that on files, it does not include the values of the newline characters; in order to fix this, retrieve System.getProperty("line.separator");
and add its character value(s) into the sum at the end of each line. Or give up on line mode and read the file a character at a time.
/** CheckSum one text file, given an open BufferedReader.
* Checksum does not include line endings, so will give the
* same value for given text on any platform. Do not use
* on binary files!
*/
public
static
int
process
(
BufferedReader
is
)
{
int
sum
=
0
;
try
{
String
inputLine
;
while
((
inputLine
=
is
.
readLine
())
!=
null
)
{
for
(
char
c
:
inputLine
.
toCharArray
())
{
sum
+=
c
;
}
}
}
catch
(
IOException
e
)
{
throw
new
RuntimeException
(
"IOException: "
+
e
);
}
return
sum
;
}
You want to align strings to the left, right, or center.
Do the math yourself, and use substring
(see Recipe 3.1) and a StringBuilder
(see Recipe 3.2). Or, use my StringAlign
class, which is based on the java.text.Format
class.
For left or right alignment, use String.format()
.
Centering and aligning text comes up fairly often. Suppose you want to print a simple report with centered page numbers. There doesn’t seem to be anything in the standard API that will do the job fully for you. But I have written a class called StringAlign
that will. Here’s how you might use it:
public
class
StringAlignSimple
{
public
static
void
main
(
String
[]
args
)
{
// Construct a "formatter" to center strings.
StringAlign
formatter
=
new
StringAlign
(
70
,
StringAlign
.
Justify
.
CENTER
);
// Try it out, for page "i"
System
.
out
.
println
(
formatter
.
format
(
"- i -"
));
// Try it out, for page 4. Since this formatter is
// optimized for Strings, not specifically for page numbers,
// we have to convert the number to a String
System
.
out
.
println
(
formatter
.
format
(
Integer
.
toString
(
4
)));
}
}
If you compile and run this class, it prints the two demonstration line numbers centered, as shown:
> javac -d . StringAlignSimple.java > java strings.StringAlignSimple - i - 4 >
Example 3-5 is the code for the StringAlign
class. Note that this class extends the class Format
in the package java.text
. There is a series of Format
classes that all have at least one method called format()
. It is thus in a family with numerous other formatters, such as DateFormat
, NumberFormat
, and others, that we’ll take a look at in upcoming chapters.
public
class
StringAlign
extends
Format
{
private
static
final
long
serialVersionUID
=
1L
;
public
enum
Justify
{
/* Constant for left justification. */
LEFT
,
/* Constant for centering. */
CENTER
,
/** Constant for right-justified Strings. */
RIGHT
,
}
/** Current justification */
private
Justify
just
;
/** Current max length */
private
int
maxChars
;
/** Construct a StringAlign formatter; length and alignment are
* passed to the Constructor instead of each format() call as the
* expected common use is in repetitive formatting e.g., page numbers.
* @param maxChars - the maximum length of the output
* @param just - one of the enum values LEFT, CENTER or RIGHT
*/
public
StringAlign
(
int
maxChars
,
Justify
just
)
{
switch
(
just
)
{
case
LEFT:
case
CENTER:
case
RIGHT:
this
.
just
=
just
;
break
;
default
:
throw
new
IllegalArgumentException
(
"invalid justification arg."
);
}
if
(
maxChars
<
0
)
{
throw
new
IllegalArgumentException
(
"maxChars must be positive."
);
}
this
.
maxChars
=
maxChars
;
}
/** Format a String.
* @param input - the string to be aligned.
* @parm where - the StringBuilder to append it to.
* @param ignore - a FieldPosition (may be null, not used but
* specified by the general contract of Format).
*/
@Override
public
StringBuffer
format
(
Object
input
,
StringBuffer
where
,
FieldPosition
ignore
)
{
String
s
=
input
.
toString
();
String
wanted
=
s
.
substring
(
0
,
Math
.
min
(
s
.
length
(),
maxChars
));
// Get the spaces in the right place.
switch
(
just
)
{
case
RIGHT:
pad
(
where
,
maxChars
-
wanted
.
length
());
where
.
append
(
wanted
);
break
;
case
CENTER:
int
toAdd
=
maxChars
-
wanted
.
length
();
pad
(
where
,
toAdd
/
2
);
where
.
append
(
wanted
);
pad
(
where
,
toAdd
-
toAdd
/
2
);
break
;
case
LEFT:
where
.
append
(
wanted
);
pad
(
where
,
maxChars
-
wanted
.
length
());
break
;
}
return
where
;
}
protected
final
void
pad
(
StringBuffer
to
,
int
howMany
)
{
for
(
int
i
=
0
;
i
<
howMany
;
i
++)
to
.
append
(
' '
);
}
/** Convenience Routine */
String
format
(
String
s
)
{
return
format
(
s
,
new
StringBuffer
(),
null
).
toString
();
}
/** ParseObject is required, but not useful here. */
public
Object
parseObject
(
String
source
,
ParsePosition
pos
)
{
return
source
;
}
}
Java 12 introduced a new method public String indent(int n)
which prepends n spaces
to the string, which is treated as a sequence of lines with line separators.
This works well in conjunction with the Java 11 Stream<String> lines()
method
e.g., for the case where a series of lines, conveniently already stored in a single string,
needs the same indent
(Streams, and the “::” notation, are explained in Recipe 9.1).
jshell
>
"abc def"
.
indent
(
30
).
lines
().
forEach
(
System
.
out
::
println
);
abc
def
jshell
>
"abc def"
.
indent
(
30
).
indent
(-
10
).
lines
().
forEach
(
System
.
out
::
println
);
abc
def
jshell
>
The alignment of numeric columns is considered in Chapter 5.
You want to convert between Unicode characters and String
s.
Use Java char
or String
datatypes to deal with characters;
these intrinsically support Unicode.
Print characters as integers to display their raw value if needed.
Unicode is an international standard that aims to represent all known characters used by
people in their various languages. Though the original ASCII character set is a subset, Unicode is huge.
At the time Java was created, Unicode was a 16-bit character set, so it seemed natural
to make Java char
values be 16 bits in width, and for years a char
could hold any Unicode character.
However, over time, Unicode has grown, to the point that it now includes over a million “code points”
or characters, more than the 65,525 that could be represented in 16 bits.4
Not all possible 16-bit values were defined as characters in UCS-2, the 16-bit version of
Unicode originally used in Java. A few were reserved as “escape characters,” which allows
for multicharacter-length mappings to less common characters.
Fortunately, there is a go-between standard, called UTF-16 (16-bit Unicode Transformation Format).
As the String
class documentation puts it:
A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.
The String class provides methods for dealing with Unicode code points (i.e., characters), in addition to those for dealing with Unicode code units (i.e., char values).
The charAt()
method of String
returns the char
value for the character at the specified offset.
The StringBuilder append()
method has a form that accepts a char
.
Because char
is an integer type, you can even
do arithmetic on char
s, though this is not needed as frequently as in, say, C. Nor
is it often recommended, because the Character
class provides the methods for which these
operations were normally used in languages such as C. Here is a program that uses
arithmetic on char
s to control a loop, and also appends the characters into a
StringBuilder
(see Recipe 3.2):
// UnicodeChars.java
StringBuilder
b
=
new
StringBuilder
();
for
(
char
c
=
'a'
;
c
<
'd'
;
c
++)
{
b
.
append
(
c
);
}
b
.
append
(
'u00a5'
);
// Japanese Yen symbol
b
.
append
(
'u01FC'
);
// Roman AE with acute accent
b
.
append
(
'u0391'
);
// GREEK Capital Alpha
b
.
append
(
'u03A9'
);
// GREEK Capital Omega
for
(
int
i
=
0
;
i
<
b
.
length
();
i
++)
{
System
.
out
.
printf
(
"Character #%d (%04x) is %c%n"
,
i
,
(
int
)
b
.
charAt
(
i
),
b
.
charAt
(
i
));
}
System
.
out
.
println
(
"Accumulated characters are "
+
b
);
When you run it, the expected results are printed for the ASCII characters. On Unix and Mac systems, the default fonts don’t include all the additional characters, so they are either omitted or mapped to irregular characters:
$ java -cp target/classes strings.UnicodeChars Character #0 (0061) is a Character #1 (0062) is b Character #2 (0063) is c Character #3 (00a5) is ¥ Character #4 (01fc) is Ǽ Character #5 (0391) is Α Character #6 (03a9) is Ω Accumulated characters are abc¥ǼΑΩ $
The Windows system used to try this doesn’t have most of those characters either, but at least it prints the ones it knows are lacking as question marks (Windows system fonts are more homogenous than those of the various Unix systems, so it is easier to know what won’t work). On the other hand, it tries to print the Yen sign as a Spanish capital Enye (N with a ~ over it).
Character #0 is a Character #1 is b Character #2 is c Character #3 is ¥ Character #4 is ? Character #5 is ? Character #6 is ? Accumulated characters are abc¥___
where the “_” characters are unprintable characters, which may appear as a question mark (“?”).
The Unicode
program in this book’s online source displays any 256-character section of
the Unicode character set. You can download documentation listing every character in the Unicode character
set from the Unicode Consortium.
You wish to reverse a string, a character, or a word at a time.
You can reverse a string by character easily, using a StringBuilder
. There are several ways to reverse a string a word at a time. One natural way is to use a StringTokenizer
and a stack. Stack
is a class (defined in java.util
; see Recipe 7.16) that implements an easy-to-use last-in, first-out (LIFO) stack of objects.
To reverse the characters in a string, use the StringBuilder reverse()
method:
StringRevChar.java
String
sh
=
"FCGDAEB"
;
System
.
out
.
println
(
sh
+
" -> "
+
new
StringBuilder
(
sh
).
reverse
(
));
The letters in this example list the order of the sharps in the key signatures of Western music; in reverse, it lists the order of flats. Alternatively, of course, you could reverse the characters yourself, using character-at-a-time mode (see Recipe 3.3).
A popular mnemonic, or memory aid, to help music students remember the order of sharps and flats consists of one word for each sharp instead of just one letter. Let’s to reverse this one word at a time. Example 3-6 adds each one to a Stack
(see Recipe 7.16), then processes the whole lot in LIFO order, which reverses the order.
String
s
=
"Father Charles Goes Down And Ends Battle"
;
// Put it in the stack frontwards
Stack
<
String
>
myStack
=
new
Stack
<>();
StringTokenizer
st
=
new
StringTokenizer
(
s
);
while
(
st
.
hasMoreTokens
())
{
myStack
.
push
(
st
.
nextToken
());
}
// Print the stack backwards
System
.
out
.
(
'"'
+
s
+
'"'
+
" backwards by word is: ""
);
while
(!
myStack
.
empty
())
{
System
.
out
.
(
myStack
.
pop
());
System
.
out
.
(
' '
);
// inter-word spacing
}
System
.
out
.
println
(
'"'
);
You need to convert space characters to tab characters in a file, or vice versa. You might want to replace spaces with tabs to save space on disk, or go the other way to deal with a device or program that can’t handle tabs.
Use my Tabs
class or its subclass EnTab
.
Because programs that deal with tabbed text or data expect tab stops
to be at fixed positions, you cannot use a typical text editor to
replace tabs with spaces or vice versa.
Example 3-7 is a listing of EnTab
, complete with a sample main program. The program works a line at a time. For each character on the line, if the character is a space, we see if we can coalesce it with previous spaces to output a single tab character. This program depends on the Tabs
class, which we’ll come to shortly. The Tabs
class is used to decide which column positions represent tab stops and which do not.
public
class
EnTab
{
private
static
Logger
logger
=
Logger
.
getLogger
(
EnTab
.
class
.
getSimpleName
());
/** The Tabs (tab logic handler) */
protected
Tabs
tabs
;
/**
* Delegate tab spacing information to tabs.
*/
public
int
getTabSpacing
()
{
return
tabs
.
getTabSpacing
();
}
/**
* Main program: just create an EnTab object, and pass the standard input
* or the named file(s) through it.
*/
public
static
void
main
(
String
[]
argv
)
throws
IOException
{
EnTab
et
=
new
EnTab
(
8
);
if
(
argv
.
length
==
0
)
// do standard input
et
.
entab
(
new
BufferedReader
(
new
InputStreamReader
(
System
.
in
)),
System
.
out
);
else
for
(
String
fileName
:
argv
)
{
// do each file
et
.
entab
(
new
BufferedReader
(
new
FileReader
(
fileName
)),
System
.
out
);
}
}
/**
* Constructor: just save the tab values.
* @param n The number of spaces each tab is to replace.
*/
public
EnTab
(
int
n
)
{
tabs
=
new
Tabs
(
n
);
}
public
EnTab
()
{
tabs
=
new
Tabs
();
}
/**
* entab: process one file, replacing blanks with tabs.
* @param is A BufferedReader opened to the file to be read.
* @param out a PrintWriter to send the output to.
*/
public
void
entab
(
BufferedReader
is
,
PrintWriter
out
)
throws
IOException
{
// main loop: process entire file one line at a time.
is
.
lines
().
forEach
(
line
->
{
out
.
println
(
entabLine
(
line
));
});
}
/**
* entab: process one file, replacing blanks with tabs.
*
* @param is A BufferedReader opened to the file to be read.
* @param out A PrintStream to write the output to.
*/
public
void
entab
(
BufferedReader
is
,
PrintStream
out
)
throws
IOException
{
entab
(
is
,
new
PrintWriter
(
out
));
}
/**
* entabLine: process one line, replacing blanks with tabs.
* @param line the string to be processed
*/
public
String
entabLine
(
String
line
)
{
int
N
=
line
.
length
(),
outCol
=
0
;
StringBuilder
sb
=
new
StringBuilder
();
char
ch
;
int
consumedSpaces
=
0
;
for
(
int
inCol
=
0
;
inCol
<
N
;
inCol
++)
{
// Cannot use foreach here
ch
=
line
.
charAt
(
inCol
);
// If we get a space, consume it, don't output it.
// If this takes us to a tab stop, output a tab character.
if
(
ch
==
' '
)
{
logger
.
info
(
"Got space at "
+
inCol
);
if
(
tabs
.
isTabStop
(
inCol
))
{
logger
.
info
(
"Got a Tab Stop "
+
inCol
);
sb
.
append
(
' '
);
outCol
+=
consumedSpaces
;
consumedSpaces
=
0
;
}
else
{
consumedSpaces
++;
}
continue
;
}
// We're at a non-space; if we're just past a tab stop, we need
// to put the "leftover" spaces back out, since we consumed
// them above.
while
(
inCol
-
1
>
outCol
)
{
logger
.
info
(
"Padding space at "
+
inCol
);
sb
.
append
(
' '
);
outCol
++;
}
// Now we have a plain character to output.
sb
.
append
(
ch
);
outCol
++;
}
// If line ended with trailing (or only!) spaces, preserve them.
for
(
int
i
=
0
;
i
<
consumedSpaces
;
i
++)
{
logger
.
info
(
"Padding space at end # "
+
i
);
sb
.
append
(
' '
);
}
return
sb
.
toString
();
}
}
This code was patterned after a program in Kernighan and Plauger’s classic work, Software Tools. While their version was in a language called RatFor (Rational Fortran), my version has since been through several translations. Their version actually worked one character at a time, and for a long time I tried to preserve this overall structure. Eventually, I rewrote it to be a line-at-a-time program.
The program that goes in the opposite direction—putting tabs in rather than taking them
out—is the DeTab
class shown in Example 3-8; only the core methods are
shown.
public
class
DeTab
{
Tabs
ts
;
public
static
void
main
(
String
[]
argv
)
throws
IOException
{
DeTab
dt
=
new
DeTab
(
8
);
dt
.
detab
(
new
BufferedReader
(
new
InputStreamReader
(
System
.
in
)),
new
PrintWriter
(
System
.
out
));
}
public
DeTab
(
int
n
)
{
ts
=
new
Tabs
(
n
);
}
public
DeTab
()
{
ts
=
new
Tabs
();
}
/** detab one file (replace tabs with spaces)
* @param is - the file to be processed
* @param out - the updated file
*/
public
void
detab
(
BufferedReader
is
,
PrintWriter
out
)
throws
IOException
{
is
.
lines
().
forEach
(
line
->
{
out
.
println
(
detabLine
(
line
));
});
}
/** detab one line (replace tabs with spaces)
* @param line - the line to be processed
* @return the updated line
*/
public
String
detabLine
(
String
line
)
{
char
c
;
int
col
;
StringBuilder
sb
=
new
StringBuilder
();
col
=
0
;
for
(
int
i
=
0
;
i
<
line
.
length
();
i
++)
{
// Either ordinary character or tab.
if
((
c
=
line
.
charAt
(
i
))
!=
' '
)
{
sb
.
append
(
c
);
// Ordinary
++
col
;
continue
;
}
do
{
// Tab, expand it, must put >=1 space
sb
.
append
(
' '
);
}
while
(!
ts
.
isTabStop
(++
col
));
}
return
sb
.
toString
();
}
}
The Tabs
class provides two methods: settabpos()
and istabstop()
.
Example 3-9 is the source for the Tabs
class.
public
class
Tabs
{
/** tabs every so often */
public
final
static
int
DEFTABSPACE
=
8
;
/** the current tab stop setting. */
protected
int
tabSpace
=
DEFTABSPACE
;
/** The longest line that we initially set tabs for. */
public
final
static
int
MAXLINE
=
255
;
/** Construct a Tabs object with a given tab stop settings */
public
Tabs
(
int
n
)
{
if
(
n
<=
0
)
{
n
=
1
;
}
tabSpace
=
n
;
}
/** Construct a Tabs object with a default tab stop settings */
public
Tabs
()
{
this
(
DEFTABSPACE
);
}
/**
* @return Returns the tabSpace.
*/
public
int
getTabSpacing
()
{
return
tabSpace
;
}
/** isTabStop - returns true if given column is a tab stop.
* @param col - the current column number
*/
public
boolean
isTabStop
(
int
col
)
{
if
(
col
<=
0
)
return
false
;
return
(
col
+
1
)
%
tabSpace
==
0
;
}
}
You need to convert strings to uppercase or lowercase, or to compare strings without regard for case.
The String
class has a number of methods for dealing with documents in a particular case. toUpperCase()
and toLowerCase()
each return a new string that is a copy of the current string, but converted as the name implies. Each can be called either with no arguments or with a Locale
argument specifying the conversion rules; this is necessary because of internationalization. Java’s API provides significant internationalization and localization features, as covered in
“Ian’s Basic Steps: Internationalization”.
Whereas the equals()
method tells you if another string is exactly the same, equalsIgnoreCase()
tells you if all characters are the same regardless of case. Here, you can’t specify an alternative locale; the system’s default locale is used:
String
name
=
"Java Cookbook"
;
System
.
out
.
println
(
"Normal: "
+
name
);
System
.
out
.
println
(
"Upper: "
+
name
.
toUpperCase
());
System
.
out
.
println
(
"Lower: "
+
name
.
toLowerCase
());
String
javaName
=
"java cookBook"
;
// If it were Java identifiers :-)
if
(!
name
.
equals
(
javaName
))
System
.
err
.
println
(
"equals() correctly reports false"
);
else
System
.
err
.
println
(
"equals() incorrectly reports true"
);
if
(
name
.
equalsIgnoreCase
(
javaName
))
System
.
err
.
println
(
"equalsIgnoreCase() correctly reports true"
);
else
System
.
err
.
println
(
"equalsIgnoreCase() incorrectly reports false"
);
If you run this, it prints the first name changed to uppercase and lowercase, then it reports that both methods work as expected:
C:javasrcstrings>java strings.Case Normal: Java Cookbook Upper: JAVA COOKBOOK Lower: java cookbook equals( ) correctly reports false equalsIgnoreCase( ) correctly reports true
Regular expressions make it simpler to ignore case in string searching (see Chapter 4).
You need to put nonprintable characters into strings.
Use the backslash character and one of the Java string escapes.
The Java string escapes are listed in Table 3-1.
To get: | Use: | Notes |
---|---|---|
Tab |
|
|
Linefeed (Unix newline) |
|
The call |
Carriage return |
|
|
Form feed |
|
|
Backspace |
||
Single quote |
|
|
Double quote |
|
|
Unicode character |
|
Four hexadecimal digits (no |
Octal(!) character |
++ |
Who uses octal (base 8) these days? |
Backslash |
|
Here is a code example that shows most of these in action:
public
class
StringEscapes
{
public
static
void
main
(
String
[]
argv
)
{
System
.
out
.
println
(
"Java Strings in action:"
);
// System.out.println("An alarm or alert: a"); // not supported
System
.
out
.
println
(
"An alarm entered in Octal: 07"
);
System
.
out
.
println
(
"A tab key: (what comes after)"
);
System
.
out
.
println
(
"A newline: (what comes after)"
);
System
.
out
.
println
(
"A UniCode character: u0207"
);
System
.
out
.
println
(
"A backslash character: \"
);
}
}
If you have a lot of non-ASCII characters to enter, you may wish to consider using Java’s input methods, discussed briefly in the online documentation.
You need to work on a string without regard for extra leading or trailing spaces a user may have typed.
Use the String
class strip()
or trim()
methods.
There are four methods in the String
class for this:
Returns a string with all leading and trailing whitespace removed.
Returns a string whose value is this string, with all leading white space removed.
Returns the string with all trailing whitespace removed.
Returns the string with all leading and trailing spaces removed,
For the strip()
methods, “whitespace” is as defined by Character.isSpace()
.
For the trim()
method, “space” includes any character whose numeric value
is less than or equal to 32, or U+0020 (the space character).
Example 3-10 uses trim()
to strip an arbitrary number of leading spaces and/or tabs from lines of Java source code in order to look for the characters //+
and //-
. These strings are special Java comments I previously used to mark the parts of the programs in this book that I want to include in the printed copy.
public
class
GetMark
{
/** the default starting mark. */
public
final
String
START_MARK
=
"//+"
;
/** the default ending mark. */
public
final
String
END_MARK
=
"//-"
;
/** Set this to TRUE for running in "exclude" mode (e.g., for
* building exercises from solutions) and to FALSE for running
* in "extract" mode (e.g., writing a book and omitting the
* imports and "public class" stuff).
*/
public
final
static
boolean
START
=
true
;
/** True if we are currently inside marks. */
protected
boolean
printing
=
START
;
/** True if you want line numbers */
protected
final
boolean
number
=
false
;
/** Get Marked parts of one file, given an open LineNumberReader.
* This is the main operation of this class, and can be used
* inside other programs or from the main() wrapper.
*/
public
void
process
(
String
fileName
,
LineNumberReader
is
,
PrintStream
out
)
{
int
nLines
=
0
;
try
{
String
inputLine
;
while
((
inputLine
=
is
.
readLine
())
!=
null
)
{
if
(
inputLine
.
trim
().
equals
(
START_MARK
))
{
if
(
printing
)
// These go to stderr, so you can redirect the output
System
.
err
.
println
(
"ERROR: START INSIDE START, "
+
fileName
+
':'
+
is
.
getLineNumber
());
printing
=
true
;
}
else
if
(
inputLine
.
trim
().
equals
(
END_MARK
))
{
if
(!
printing
)
System
.
err
.
println
(
"ERROR: STOP WHILE STOPPED, "
+
fileName
+
':'
+
is
.
getLineNumber
());
printing
=
false
;
}
else
if
(
printing
)
{
if
(
number
)
{
out
.
(
nLines
);
out
.
(
": "
);
}
out
.
println
(
inputLine
);
++
nLines
;
}
}
is
.
close
();
out
.
flush
();
// Must not close - caller may still need it.
if
(
nLines
==
0
)
System
.
err
.
println
(
"ERROR: No marks in "
+
fileName
+
"; no output generated!"
);
}
catch
(
IOException
e
)
{
System
.
out
.
println
(
"IOException: "
+
e
);
}
}
You want your program to take “sensitivity lessons” so that it can communicate well internationally.
Your program must obtain all control and message strings via the internationalization software. Here’s how:
Get a ResourceBundle
.
ResourceBundle rb = ResourceBundle.getBundle("Menus");
I’ll talk about ResourceBundle
in Recipe 3.13, but briefly, a ResourceBundle
represents a collection of name-value pairs (resources). The names are names you assign to each GUI control or other user interface text, and the values are the text to assign to each control in a given language.
Use this ResourceBundle
to fetch the localized version of each control name.
Old way:
String label = "Exit"; // Create the control, e.g., new JButton(label);
New way:
try { label = rb.getString("exit.label"); } catch (MissingResourceException e) { label="Exit"; } // fallback // Create the control, e.g., new JButton(label);
This may seem quite a bit of code for one control, but you can write a convenience routine to simplify it, e.g.,
JButton exitButton = I18NUtil.getButton("exit.label", "Exit");
The file I18NUtil.java
is included in the book’s code distribution.
While the example is a Swing JButton, the same approach goes with other UIs, such as the web tier.
In JSF, for example, you might place your strings in a properties file called resources.properties
and store it in src/main/resources. You’d load this in faces-config.xml
:
<application>
<locale-config>
<default-locale>
en</default-locale>
<supported-locale>
en</supported-locale>
<supported-locale>
es</supported-locale>
<supported-locale>
fr</supported-locale>
</locale-config>
<resource-bundle>
<base-name>
resources</base-name>
<var>
msg</var>
</resource-bundle>
</application>
Then in each web page that needs these strings, refer to the resource using the msg
variable
in an expression:
// In signup.xhtml:<h:outputText
value=
"#{msg.prompt_firstname}"
/>
<h:inputText
required=
"true"
id=
"firstName"
value=
"#{person.firstName}"
/>
The default locale is used, because we didn’t specify one. The default locale is platform-dependent:
LANG environment variable (per user)
Control Panel→Regional Settings
System Preferences→Language & Text
See platform documentation
ResourceBundle.getBundle()
locates a file with the named resource
bundle name (Menus
, in the previous example), plus an underscore
and the locale name (if a non-default locale is set), plus another underscore
and the locale variation (if any variation is set), plus the extension
.properties. If a variation is set but the file can’t be found,
it falls back to just the country code. If that can’t be found, it
falls back to the original default. Table 3-2 shows
some examples for various locales.
Note that Android apps—usually written in Java or Kotlin—use a similar mechanism, but with the files in XML format instead of Java Properties, and with some small changes in the name of the file in which the properties files are found.
Locale | Filename |
---|---|
Default locale |
Menus.Properties |
Swedish |
Menus_sv.properties |
Spanish |
Menus_es.properties |
French |
Menus_fr.properties |
French-Canadian |
Menus_fr_CA.properties |
Locale names are two-letter ISO-639 language codes (lowercase), and normally abbreviate the country’s endonym (the name its language speakers refer to it by), thus Sweden is sv for Sverige, Spain is es for Espanol, etc. Locale variations are two-letter ISO country codes (uppercase), e.g., CA for Canada, US for the United States, SV for Sweden, ES for Spain, etc.
On Windows, go into Regional Settings in the Control Panel. Changing this setting may entail a reboot, so exit any editor windows.
On Unix, set your LANG environment variable. For example, a Korn shell user in Mexico might have this line in her .profile:
export LANG=es_MX
On either system, for testing a different locale, you need only define the locale in the System Properties at runtime using the command-line option -D
, as in:
java -Duser.language=es i18n.Browser
to run the Java program named Browser
in package i18n
in the Spanish locale.
You can get a list of the available locales with a call to
Locale.getAvailableLocales()
.
You want to use a locale other than the default in a particular operation.
Obtain a Locale
by using a predefined instance or the Locale
constructor.
Optionally make it global to your application by using
Locale.setDefault(newLocale)
.
Classes that provide formatting services, such as DateTimeFormatter
and NumberFormat
,
provide overloads so they can be called either with or without a Locale
-related argument.
To obtain a Locale
object, you can employ one of the predefined locale variables provided by the Locale
class, or you can construct your own Locale
object giving a language code and a country code:
Locale locale1 = Locale.FRANCE; // predefined Locale locale2 = new Locale("en", "UK"); // English, UK version
These can then be used in the various formatting operations.
DateFormat frDateFormatter = DateFormat.getDateInstance( DateFormat.MEDIUM, frLocale); DateFormat ukDateFormatter = DateFormat.getDateInstance( DateFormat.MEDIUM, ukLocale);
Either of these can be used to format a date or a number, as shown in class UseLocales
:
package
i18n
;
import
java.time.LocalDateTime
;
import
java.time.format.DateTimeFormatter
;
import
java.time.format.FormatStyle
;
import
java.util.Locale
;
/** Use some locales; based on user's OS "settings"
* choices or -Duser.lang= or -Duser.region=.
*/
// tag::main[]
public
class
UseLocales
{
public
static
void
main
(
String
[]
args
)
{
Locale
frLocale
=
Locale
.
FRANCE
;
// predefined
Locale
ukLocale
=
new
Locale
(
"en"
,
"UK"
);
// English, UK version
DateTimeFormatter
defaultDateFormatter
=
DateTimeFormatter
.
ofLocalizedDateTime
(
FormatStyle
.
MEDIUM
);
DateTimeFormatter
frDateFormatter
=
DateTimeFormatter
.
ofLocalizedDateTime
(
FormatStyle
.
MEDIUM
).
localizedBy
(
frLocale
);
DateTimeFormatter
ukDateFormatter
=
DateTimeFormatter
.
ofLocalizedDateTime
(
FormatStyle
.
MEDIUM
).
localizedBy
(
ukLocale
);
LocalDateTime
now
=
LocalDateTime
.
now
();
System
.
out
.
println
(
"Default: "
+
' '
+
now
.
format
(
defaultDateFormatter
));
System
.
out
.
println
(
frLocale
.
getDisplayName
()
+
' '
+
now
.
format
(
frDateFormatter
));
System
.
out
.
println
(
ukLocale
.
getDisplayName
()
+
' '
+
now
.
format
(
ukDateFormatter
));
}
}
// end::main[]
The program prints the locale name and formats the date in each of the locales:
$ <strong>java i18n.UseLocales</strong> Default: Oct 16, 2019, 4:41:45 PM French (France) 16 oct. 2019 à 16:41:45 English (UK) Oct 16, 2019, 4:41:45 PM$
You need to create a resource bundle for use with I18N.
A resource bundle is simply a collection of names and values. You could write
a java.util.ResourceBundle
subclass, but it is easier to create textual
Properties files (see Recipe 7.10) that you then load with
ResourceBundle.getBundle( )
. The files can be created using any plain text
editor. Leaving it in a text file format also allows user customization in
desktop applications; a user whose language is not provided for, or who
wishes to change the wording somewhat due to local variations in dialect,
should be able to edit the file.
Note that the resource bundle text file should not have the same name as any
of your Java classes. The reason is that the ResourceBundle
constructs a
class dynamically with the same name as the resource files.
Here is a sample properties file for a few menu items:
# Default Menu properties # The File Menu file.label=File Menu file.new.label=New File file.new.key=N file.save.label=Save file.new.key=S
Creating the default properties file is usually not a problem, but creating properties files for other languages might be. Unless you are a large multinational corporation, you will probably not have the resources (pardon the pun) to create resource files in-house. If you are shipping commercial software, or using the web for global reach, you need to identify your target markets and understand which of these are most sensitive to wanting menus and the like in their own languages. Then, hire a professional translation service that has expertise in the required languages to prepare the files. Test them well before you ship, as you would any other part of your software.
If you need special characters, multiline text, or other complex entry, remember that a ResourceBundle
is also a Properties
file, so see the documentation for java.util.Properties
.
This program is a very primitive text formatter, representative of what people used on most computing platforms before the rise of standalone graphics-based word processors, laser printers, and, eventually, desktop publishing and desktop office suites. It simply reads words from a file—previously created with a text editor—and outputs them until it reaches the right margin, when it calls println()
to append a line ending. For example, here is an input file:
It's a nice day, isn't it, Mr. Mxyzzptllxy? I think we should go for a walk.
Given the preceding as its input, the Fmt
program prints the lines formatted neatly:
It's a nice day, isn't it, Mr. Mxyzzptllxy? I think we should go for a walk.
As you can see, it fits the text we gave it to the margin and discards all the line breaks present in the original. Here’s the code:
public
class
Fmt
{
/** The maximum column width */
public
static
final
int
COLWIDTH
=
72
;
/** The file that we read and format */
final
BufferedReader
in
;
/** Where the output goes */
PrintWriter
out
;
/** If files present, format each one, else format the standard input. */
public
static
void
main
(
String
[]
av
)
throws
IOException
{
if
(
av
.
length
==
0
)
new
Fmt
(
System
.
in
).
format
();
else
for
(
String
name
:
av
)
{
new
Fmt
(
name
).
format
();
}
}
public
Fmt
(
BufferedReader
inFile
,
PrintWriter
outFile
)
{
this
.
in
=
inFile
;
this
.
out
=
outFile
;
}
public
Fmt
(
PrintWriter
out
)
{
this
(
new
BufferedReader
(
new
InputStreamReader
(
System
.
in
)),
out
);
}
/** Construct a Formatter given an open Reader */
public
Fmt
(
BufferedReader
file
)
throws
IOException
{
this
(
file
,
new
PrintWriter
(
System
.
out
));
}
/** Construct a Formatter given a filename */
public
Fmt
(
String
fname
)
throws
IOException
{
this
(
new
BufferedReader
(
new
FileReader
(
fname
)));
}
/** Construct a Formatter given an open Stream */
public
Fmt
(
InputStream
file
)
throws
IOException
{
this
(
new
BufferedReader
(
new
InputStreamReader
(
file
)));
}
/** Format the File contained in a constructed Fmt object */
public
void
format
()
throws
IOException
{
format
(
in
.
lines
(),
out
);
}
/** Format a Stream of lines, e.g., bufReader.lines() */
public
static
void
format
(
Stream
<
String
>
s
,
PrintWriter
out
)
{
StringBuilder
outBuf
=
new
StringBuilder
();
s
.
forEachOrdered
((
line
->
{
if
(
line
.
length
()
==
0
)
{
// null line
out
.
println
(
outBuf
);
// end current line
out
.
println
();
// output blank line
outBuf
.
setLength
(
0
);
}
else
{
// otherwise it's text, so format it.
StringTokenizer
st
=
new
StringTokenizer
(
line
);
while
(
st
.
hasMoreTokens
())
{
String
word
=
st
.
nextToken
();
// If this word would go past the margin,
// first dump out anything previous.
if
(
outBuf
.
length
()
+
word
.
length
()
>
COLWIDTH
)
{
out
.
println
(
outBuf
);
outBuf
.
setLength
(
0
);
}
outBuf
.
append
(
word
).
append
(
' '
);
}
}
}));
if
(
outBuf
.
length
()
>
0
)
{
out
.
println
(
outBuf
);
}
else
{
out
.
println
();
}
}
}
A slightly fancier version of this program, Fmt2
, is in the online source for this book. It uses “dot commands”—lines beginning with periods—to give limited control over the formatting. A family of “dot command” formatters includes Unix’s roff, nroff, troff, and groff, which are in the same family with programs called runoff on Digital Equipment systems. The original for this is J. Saltzer’s runoff, which first appeared on Multics and from there made its way into various OSes. To save trees, I did not include Fmt2
here; it subclasses Fmt
and overrides the format()
method to include additional functionality (the source code is in the full javasrc repository for the book).
The difficulties in comparing (American-style) names inspired the U.S. Census Bureau to develop the Soundex algorithm in the early 1900s. Each of a given set of consonants maps to a particular number, the effect being to map similar-sounding names together, on the grounds that in those days many people were illiterate and could not spell their family names consistently. But it is still useful today—for example, in a company-wide telephone book application. The names Darwin and Derwin, for example, map to D650, and Darwent maps to D653, which puts it adjacent to D650. All of these are believed to be historical variants of the same name. Suppose we needed to sort lines containing these names together: if we could output the Soundex numbers at the beginning of each line, this would be easy. Here is a simple demonstration of the Soundex
class:
public
class
SoundexSimple
{
/** main */
public
static
void
main
(
String
[]
args
)
{
String
[]
names
=
{
"Darwin, Ian"
,
"Davidson, Greg"
,
"Darwent, William"
,
"Derwin, Daemon"
};
for
(
String
name
:
names
)
{
System
.
out
.
println
(
Soundex
.
soundex
(
name
)
+
' '
+
name
);
}
}
}
Let’s run it:
> javac -d . SoundexSimple.java > java strings.SoundexSimple | sort D132 Davidson, Greg D650 Darwin, Ian D650 Derwin, Daemon D653 Darwent, William >
As you can see, the Darwin-variant names (including Daemon Derwin5) all sort together and are distinct from the Davidson (and Davis, Davies, etc.) names that normally appear between Darwin and Derwin when using a simple alphabetic sort. The Soundex algorithm has done its work.
Here is the Soundex
class itself—it uses String
s and StringBuilder
s to convert names into Soundex codes:
main/src/main/java/strings/Soundex.java
public
class
Soundex
{
static
boolean
debug
=
false
;
/* Implements the mapping
* from: AEHIOUWYBFPVCGJKQSXZDTLMNR
* to: 00000000111122222222334556
*/
public
static
final
char
[]
MAP
=
{
//A B C D E F G H I J K L M
'0'
,
'1'
,
'2'
,
'3'
,
'0'
,
'1'
,
'2'
,
'0'
,
'0'
,
'2'
,
'2'
,
'4'
,
'5'
,
//N O P W R S T U V W X Y Z
'5'
,
'0'
,
'1'
,
'2'
,
'6'
,
'2'
,
'3'
,
'0'
,
'1'
,
'0'
,
'2'
,
'0'
,
'2'
};
/** Convert the given String to its Soundex code.
* @return null If the given string can't be mapped to Soundex.
*/
public
static
String
soundex
(
String
s
)
{
// Algorithm works on uppercase (mainframe era).
String
t
=
s
.
toUpperCase
();
StringBuilder
res
=
new
StringBuilder
();
char
c
,
prev
=
'?'
,
prevOutput
=
'?'
;
// Main loop: find up to 4 chars that map.
for
(
int
i
=
0
;
i
<
t
.
length
()
&&
res
.
length
()
<
4
&&
(
c
=
t
.
charAt
(
i
))
!=
','
;
i
++)
{
// Check to see if the given character is alphabetic.
// Text is already converted to uppercase. Algorithm
// only handles ASCII letters, do NOT use Character.isLetter()!
// Also, skip double letters.
if
(
c
>=
'A'
&&
c
<=
'Z'
&&
c
!=
prev
)
{
prev
=
c
;
// First char is installed unchanged, for sorting.
if
(
i
==
0
)
{
res
.
append
(
c
);
}
else
{
char
m
=
MAP
[
c
-
'A'
];
if
(
debug
)
{
System
.
out
.
println
(
c
+
" --> "
+
m
);
}
if
(
m
!=
'0'
&&
m
!=
prevOutput
)
{
res
.
append
(
m
);
prevOutput
=
m
;
}
}
}
}
if
(
res
.
length
()
==
0
)
return
null
;
for
(
int
i
=
res
.
length
();
i
<
4
;
i
++)
res
.
append
(
'0'
);
return
res
.
toString
();
}
There are apparently some nuances of the full Soundex algorithm that are not implemented by this application.
A more complete test using JUnit
(see Recipe 1.10) is also online as SoundexTest.java,
in the src/tests/java/strings directory.
The dedicated reader may use this to provoke failures of such nuances,
and send a pull request with updated versions of the test and the code.
The Levenshtein string edit distance algorithm can be used for doing approximate string comparisons in a different fashion. You can find this in Apache Commons StringUtils. I show a non-Java (Perl) implementation of this algorithm in Recipe 18.5.
1 The two +.equals()+ calls are “equivalent” with the exception that the first can throw a +NullPointerException+ while the second cannot.
2 StringBuilder
was added in Java 5. It is functionally equivalent to the older StringBuffer
. We will delve into the details in Recipe 3.2.
3 Unless, perhaps, you’re as slow at updating personal records as I am.
4 Indeed, there are so many characters in Unicode that a fad has emerged of displaying your name upside down using characters that approximate upside-down versions of the Latin alphabet. Do a web search for “upside down unicode.”
5 In Unix terminology, a “daemon” is a server. The old English word has nothing to do with satanic “demons” but refers to a helper or assistant. Derwin Daemon was actually a character in Susannah Coleman’s “Source Wars” online comic strip, which long ago was online at a now-departed site called darby.daemonnews.org.
3.137.217.198