[ Team LiB ] |
5.6 Sorting Internationalized StringsOne big advantage with Strings is that they are built (almost) from the ground up to support internationalization. This means that the Unicode character set is the lingua franca in Java. Unfortunately, because Unicode uses two-byte characters, many string libraries based on one-byte characters that can be ported into Java do not work so well. Most string-search optimizations use tables to assist string searches, but the table size is related to the size of the character set. For example, a traditional Boyer-Moore string search takes a great deal of memory and a long initialization phase to use with Unicode.
Furthermore, sorting international Strings requires the ability to handle many kinds of localization issues, such as the sorted location for accented characters, characters that can be treated as character pairs, and so on. In these cases, it is difficult (and usually impossible) to handle the general case yourself. It is almost always easier to use the String helper classes Java provides, for example, the java.text.Collator class.[7]
Using the java.text.CollationKey object to represent each string is a standard optimization for repeated comparisons of internationalized Strings. You can use this when sorting an array of Strings, for example. CollationKeys perform more than twice as fast as using java.text.Collator.compare( ) . It is probably easiest to see how to use collation keys with a particular example. So let's look at tuning an internationalized String sort. For this, I use a standard quicksort algorithm (the quicksort implementation can be found in Section 11.9). The only modification to the standard quicksort is that for each optimization, the quicksort needs to be adjusted to use the appropriate comparison method and the appropriate data type. For example, the generic quicksort that sorts an array of Comparable objects has the signature: public static void quicksort(Comparable[ ] arr, int lo, int hi) and uses the Comparable.compareTo(Object) method when comparing two Comparable objects. On the other hand, a generic quicksort that sorts objects based on a java.util.Comparator has the signature: public static void quicksort(Object[ ] arr, int lo, int hi, Comparator c) and uses the java.util.Comparator.compare(Object, Object) method when comparing any two objects. (See java.util.Arrays.sort( ) for a specific example.) In each case the underlying algorithm is the same. Only the comparison method changes (and in general the data type too, though not in these examples where the data type was Object). The obvious first test, to get a performance baseline, is the straightforward internationalized sort: public runsort( ) { quicksort(stringArray,0,stringArray.length-1, Collator.getInstance( )); } public static void quicksort(String[ ] arr, int lo, int hi, java.text.Collator c) { ... int mid = ( lo + hi ) / 2; String middle = arr[ mid ]; //String data type ... //uses Collator.compare(String, String) if( c.compare(arr[ lo ], middle) > 0 ) ... } I use a large dictionary of words for the array of strings, inserted in random order, and I use the same random order for each of the tests. The first test took longer than expected. Looking at the Collator class, I can see that it does a huge amount of work, and I cannot possibly bypass its internationalized support if I want to support internationalized strings.[8]
However, as previously mentioned, the java.util.CollationKey class is specifically designed to provide for this type of speedup. It is simple to convert the sort in order to use this. You still need the Collator to generate the CollationKeys, so add a conversion method. The sort now looks like: public runsort( ) { quicksort(stringArray,0,stringArray.length-1, Collator.getInstance( )); } public static void quicksort(String[ ] arr, int lo, int hi, Collator c) { //convert to an array of CollationKeys CollationKey keys[ ] = new CollationKey[arr.length]; for (int i = arr.length-1; i >= 0; i--) keys[i] = c.getCollationKey(arr[i]); //Run the sort on the collation keys quicksort_collationKey(keys, 0, arr.length-1); //and unwrap so that we get our Strings in sorted order for (int i = arr.length-1; i >= 0; i--) arr[i] = keys[i].getSourceString( ); } public static void quicksort_collationKey(CollationKey[ ] arr, int lo, int hi) { ... int mid = ( lo + hi ) / 2; CollationKey middle = arr[ mid ]; //CollationKey data type ... //uses CollationKey.compareTo(CollationKey) if( arr[ lo ].compareTo(middle) > 0 ) ... } Normalizing the time for the first test to 100%, this test is much faster and takes less than half the time (see Table 5-10). This is despite the extra cost imposed by a whole new populated array of CollationKey objects, one for each string. Can it do better? Well, there is nothing further in the java.text package that suggests so. Instead look at the String class, and consider its implementation of the String.compareTo( ) method. This is a simple lexicographic ordering, basically treating the char array as a sequence of numbers and ordering sequence pairs as if there is no meaning to the object being Strings. Obviously, this is useless for internationalized support, but it is much faster. A quick test shows that sorting the test String array using the String.compareTo( ) method takes just 2% of time of the first test, which seems much more reasonable. But is this test incompatible with the desired internationalized sort? Well, maybe not. Sort algorithms usually execute faster if they operate on a partially sorted array. Perhaps using the String.compareTo( ) sort first might bring the array considerably closer to the final ordering of the internationalized sort, and at a fairly low cost. Testing this is straightforward: public runsort( ) { quicksort(stringArray,0,stringArray.length-1, Collator.getInstance( )); } public static void quicksort(String[ ] arr, int lo, int hi, Collator c) { //simple sort using String.compareTo( ) simple_quicksort(arr, lo, hi); //Full international sort on a hopefully partially sorted array intl_quicksort(arr, lo, hi, c); } public static void simple_quicksort(String[ ] arr, int lo, int hi) { ... int mid = ( lo + hi ) / 2; String middle = arr[ mid ]; //uses String data type ... //uses String.compareTo(String) if( arr[ lo ].compareTo(middle) > 0 ) ... } public static void intl_quicksort(String[ ] arr, int lo, int hi, Collator c) { //convert to an array of CollationKeys CollationKey keys[ ] = new CollationKey[arr.length]; for (int i = arr.length-1; i >= 0; i--) keys[i] = c.getCollationKey(arr[i]); //Run the sort on the collation keys quicksort_collationKey(keys, 0, arr.length-1); //and unwrap so that we get our Strings in sorted order for (int i = arr.length-1; i >= 0; i--) arr[i] = keys[i].getSourceString( ); } public static void quicksort_collationKey(CollationKey[ ] arr, int lo, int hi) { ... int mid = ( lo + hi ) / 2; CollationKey middle = arr[ mid ]; //CollationKey data type ... //uses CollationKey.compareTo(CollationKey) if( arr[ lo ].compareTo(middle) > 0 ) ... } This double-sorting implementation reduces the international sort time to a quarter of the original test time (see Table 5-10). Partially sorting the list first using a much simpler (and quicker) comparison test has doubled the speed of the total sort as compared to using only the CollationKeys optimization.
Of course, these optimizations have improved the situation only for the particular locale I have tested (my default locale is set for US English). However, running the test in a sampling of other locales (European and Asian locales), I find similar relative speedups. Without using locale-specific dictionaries, this locale variation test may not be fully valid, but the speedup will likely hold across all Latinized alphabets. You can also create a simple partial-ordering class-specific sort to some locales, which provides a similar speedup. For example, by duplicating the effect of using String.compareTo( ), you can provide the basis for a customized partial sorter: public class PartialSorter { String source; char[ ] stringArray; public Sorting(String s) { //retain the original string source = s; //and get the array of characters for our customized comparison stringArray = new char[s.length( )]; s.getChars(0, stringArray.length, stringArray, 0); } /* This compare method should be customized for different locales */ public static int compare(char[ ] arr1, char[ ] arr2) { //basically the String.compareTo( ) algorithm int n = Math.min(arr1.length, arr2.length); for (int i = 0; i < n; i++) { if (arr1[i] != arr2[i]) return arr1[i] - arr2[i]; } return arr1.length - arr2.length; } public static void quicksort(String[ ] arr, int lo, int hi) { //convert to an array of PartialSorters PartialSorter keys[ ] = new PartialSorter[arr.length]; for (int i = arr.length-1; i >= 0; i--) keys[i] = new PartialSorter(arr[i]); quicksort_mysorter(keys, 0, arr.length-1); //and unwrap so that we get our Strings in sorted order for (int i = arr.length-1; i >= 0; i--) arr[i] = keys[i].source; } public static void quicksort_mysorter(PartialSorter[ ] arr, int lo, int hi) { ... int mid = ( lo + hi ) / 2; PartialSorter middle = arr[ mid ]; //PartialSorter data type ... //Use the PartialSorter.compare( ) method to compare the char arrays if( compare(arr[ lo ].stringArray, middle.stringArray) > 0 ) ... } } This PartialSorter class works similarly to the CollationKey class, wrapping a string and providing its own comparison method. The particular comparison method shown here is just an implementation of the String.compareTo( ) method. It is pointless to use it exactly as defined here because object-creation overhead means that using the PartialSorter is twice as slow as using the String.compareTo( ) directly. But customizing the PartialSorter.compare( ) method for any particular locale is a reasonable task: remember, we are interested only in a simple algorithm that handles a partial sort, not the full intricacies of completely accurate locale-specific comparison. Generally, you cannot expect to support internationalized strings and retain the performance of simple one-byte-per-character strings. But, as shown here, you can certainly improve the performance. |