String
comparison
performance is highly dependent on both the string data and the
comparison algorithm (this is really a truism about collections in
general). The methods that come with the String
class have a performance advantage in being able to directly access
the underlying char
collection. So if you need to
make String
comparisons, String
methods usually provide better performance than your own methods,
provided that you can make your desired comparison fit in with one of
the String
methods. Another necessary
consideration is whether comparisons are case-sensitive or
-insensitive, and I will consider this in more detail shortly.
To optimize for string
comparisons, you need to look at the source of the comparison methods
so you know exactly how they work. As an example, consider the
String.equals( )
and
String.equalsIgnoreCase( )
methods from the Java 2
distribution.
String.equals(Object)
runs in a fairly straightforward way:
it first checks for object identity, then for
null
, then for String
type,
then for same-size strings, and then character by character, running
from the first characters to the last. Efficient and complete.
String.equalsIgnoreCase(String)
is a little more complex. It checks
for null
, and then for strings being the same size
(the String
type check is not needed, since this
method accepts only String
objects). Then, using a
case-insensitive comparison,
regionMatches( )
is
applied. regionMatches( )
runs a
character-by-character test from the first character to the last,
converting characters to uppercase before comparing them.
Immediately, you see that the more
differences there are between the two strings, the faster these
methods return. This behavior is common for collection comparisons,
and the order of the comparison is crucial. In these two cases, the
strings are compared starting with the first character, so the
earlier the difference occurs, the faster the methods return.
However, equals( )
returns faster if the two
String
objects are identical. It is unusual to
check String
s by identity, but there are a number
of situations where it is useful, for example, when you are using a
set of canonical String
s (see Chapter 4). Another example is when an application has
enough time during string input to intern( )
[38]
the strings, so that later comparisons
by identity are possible.
In any case, equals( )
returns immediately if the
two strings are identical, but equalsIgnoreCase( )
does not even check for identity (which may be reasonable given what
it does). This results in equals( )
running an
order of magnitude faster than equalsIgnoreCase( )
if the two strings are identical; identical strings is the fastest
test case resolvable for equals( )
, but the
slowest case for equalsIgnoreCase( )
.
On the other hand, if the two strings are different in size,
equalsIgnoreCase( )
has only two tests to make
before it returns, whereas equals( )
makes four
tests before it returns. This can make equalsIgnoreCase()
run 20% faster than equals( )
for what
may be the most common difference between strings.
There are more differences between these two methods. In almost every
possible case of string data, equals( )
runs
faster (often several times faster) than equalsIgnoreCase()
. However, in a test against the words from a particular
dictionary, I found that over 90% of the words were different in size
from a randomly chosen word. When comparing the performance of these
two methods for a comparison of a randomly chosen word against the
entire dictionary, the total comparison time taken by each of the two
methods was about the same. The many cases in which strings had
different lengths compensated almost exactly for the slower
comparison of equalsIgnoreCase( )
when the strings
were similar or equal. This illustrates how the data and the
algorithm interplay with each other to affect
performance.
Even though String
methods have access to the
internal char
s, it can be faster to use your own
methods if there are no String
methods appropriate
for your test. You can build methods that are tailored to the data
you have. One way to optimize an equality test is to look for ways to
make the strings identical. An alternative that can actually be
better for performance is to change the search strategy to reduce
search time. For example, a linear search through a large array of
String
s is slower than a binary search through the
same size array if the array is sorted. This, in turn, is slower than
a straight access to a hashed table. Note that when you are able and
willing to deploy changes to JDK classes (e.g., for servlets), you
can add methods directly to the String
class.
However, altering JDK classes can lead to maintenance
problems.[39]
When case-insensitive
searches are required, one standard
optimization is to use a second
collection containing all the strings uppercased. This second
collection is used for comparisons, thus avoiding the need to
repeatedly uppercase each character in the search methods. For
example, if you have a hash table containing
String
keys, you need to iterate over all the keys
to match keys case-insensitively. But, if you have a second hash
table with all the keys uppercased, retrieving the key simply
requires you to uppercase the element being searched for:
//The slow version, iterating through all the keys ignoring case //until the key matches. (hash is a Hashtable) public Object slowlyGet(String key) { Enumeration e = hash.keys( ); String hkey; while(e.hasMoreElements( )) { if (key.equalsIgnoreCase(hkey = (String) e.getNext( )) return hash.get(hkey); } return null; } //The fast version assumes that a second hashtable was created //with all the keys uppercased. Access is straightforward. public Object quicklyGet(String key) { return uppercasedHash.get(key.toUppercase( )); }
However, note that String.toUppercase()
(and
String.toLowercase()
) creates a complete copy of the String
object with a new char
array. Unlike
String.substring( )
,
String.toUppercase( )
has a processing time that
is linearly dependent on the size of the string and also creates an
extra object (a new char
array). This means that
repeatedly using String.toUppercase( )
(and
String.toLowercase( )
) can impose a heavy overhead
on an application. For each particular problem, you need to ensure
that the extra temporary objects created and the extra processing
overheads still provide a performance benefit rather than causing a
new bottleneck in the application.
[38]
String.intern( )
returns the
String
object that is being stored in the internal
VM string pool. If two String
s are equal, then
their intern( )
results are identical; for
example, if s1.equals(s2)
is
true
, then s1.intern( ) == s2.intern()
is also true
.
[39] Several of my colleagues have emphasized their view that changes to the JDK sources lead to severe maintenance problems.
3.17.150.163