5.5 String Comparisons and Searches

String comparison performance is highly dependent on both the string data and the comparison algorithm (this is really a truism about collections in general). The methods that come with the String class have a performance advantage in being able to directly access the underlying char collection. So if you need to make String comparisons, String methods usually provide better performance than your own methods, provided that you can make your desired comparison fit in with one of the String methods. Another necessary consideration is whether comparisons are case-sensitive or -insensitive, and I will consider this in more detail shortly.

To optimize for string comparisons, you need to look at the source of the comparison methods so you know exactly how they work. As an example, consider the String.equals( ) and String.equalsIgnoreCase( ) methods from the Java 2 distribution.

String.equals(Object) runs in a fairly straightforward way: it first checks for object identity, then for null, then for String type, then for same-size strings, and then character by character, running from the first character to the last. Efficient and complete.

String.equalsIgnoreCase(String) is a little more complex. It checks for null, and then for strings being the same size (the String type check is not needed, since this method accepts only String objects). Then, using a case-insensitive comparison, regionMatches( ) is applied. regionMatches( ) runs a character-by-character test from the first character to the last, converting each character to uppercase before comparing.

Immediately, you see that the more differences there are between the two strings, the faster these methods return. This behavior is common for collection comparisons, and the order of the comparison is crucial. In these two cases, the strings are compared starting with the first character, so the earlier the difference occurs, the faster the methods return. However, equals( ) returns faster if the two String objects are identical. It is unusual to check Strings by identity, but there are a number of situations where it is useful (for example, when you are using a set of canonical Strings; see Chapter 4). Another example is when an application has enough time during string input to intern( ) ^[5] the strings, so that later comparisons by identity are possible.

^[5] String.intern( ) returns the String object that is being stored in the internal VM string pool. If two Strings are equal, then their intern( ) results are identical; for example, if s1.equals(s2) is true, then s1.intern( ) = = s2.intern( ) is also true.

In any case, equals( ) returns immediately if the two strings are identical, but equalsIgnoreCase( ) does not even check for identity (which may be reasonable given what it does). This results in equals( ) running an order of magnitude faster than equalsIgnoreCase( ) if the two strings are identical; identical strings is the fastest test case resolvable for equals( ), but the slowest case for equalsIgnoreCase( ).

On the other hand, if the two strings are different in size, equalsIgnoreCase( ) has only two tests to make before it returns, whereas equals( ) makes four tests before it returns. This can make equalsIgnoreCase( ) run 20% faster than equals( ) for what may be the most common difference between strings.

There are more differences between these two methods. In almost every possible case of string data, equals( ) runs faster (often several times faster) than equalsIgnoreCase( ). However, in a test against the words from a particular dictionary, I found that over 90% of the words were different in size from a randomly chosen word. When comparing the performance of these two methods for a comparison of a randomly chosen word against the entire dictionary, the total comparison time taken by each of the two methods was about the same. The many cases in which strings had different lengths compensated almost exactly for the slower comparison of equalsIgnoreCase( ) when the strings were similar or equal. This illustrates how the data and the algorithm interplay with each other to affect performance.

Even though String methods have access to the internal chars, it can be faster to use your own methods if there are no String methods appropriate for your test. You can build methods that are tailored to the data you have. One way to optimize an equality test is to look for ways to make the strings identical. An alternative that can actually be better for performance is to change the search strategy to reduce search time. For example, a linear search through a large array of Strings is slower than a binary search through the same size array if the array is sorted. This, in turn, is slower than a straight access to a hashed table. Note that when you are able and willing to deploy changes to JDK classes (e.g., for servlets), you can add methods directly to the String class. However, altering JDK classes can lead to maintenance problems.^[6]

^[6] Several of my colleagues have emphasized their view that changes to the JDK sources lead to severe maintenance problems.

When case-insensitive searches are required, one standard optimization is to use a second collection containing all the strings uppercased. This second collection is used for comparisons, obviating the need to repeatedly uppercase each character in the search methods. For example, if you have a hash table containing String keys, you need to iterate over all the keys to match keys case-insensitively. But, if you have a second hash table with all the keys uppercased, retrieving the key simply requires you to uppercase the element being searched for:

//The slow version, iterating through all the keys ignoring case
//until the key matches. (hash is a Hashtable)
public Object slowlyGet(String key)
{
  Enumeration e = hash.keys(  );
  String hkey;
  while(e.hasMoreElements(  ))
  {
    if (key.equalsIgnoreCase(hkey = (String) e.getNext(  ))
      return hash.get(hkey);
  }
  return null;
}
  
//The fast version assumes that a second hashtable was created
//with all the keys uppercased. Access is straightforward.
public Object quicklyGet(String key)
{
  return uppercasedHash.get(key.toUppercase(  ));
}

However, note that String.toUppercase( ) (and String.toLowercase( )) creates a complete copy of the String object with a new char array. Unlike String.substring( ) , String.toUppercase( ) has a processing time that is linearly dependent on the size of the string and also creates an extra object (a new char array). This means that repeatedly using String.toUppercase( ) (and String.toLowercase( )) can impose a heavy overhead on an application. For each particular problem, you need to ensure that the extra temporary objects created and the extra processing overhead still provide a performance benefit rather than causing a new bottleneck in the application.