11.9 Finding the Index for Partially Matched Strings
The problem
considered here concerns a large number of string keys that need to
be accessed by full or partial match. Each string is unique, so the
full-match access can easily be handled by a standard hash-table
structure (e.g., java.util.HashMap). The
partial-match access needs to collect all objects that have string
keys starting with a particular substring.
Consider this hash table consisting of keys and values:
"hello" 1
"bye" 2
"hi" 3
The full match for key "hi" retrieves
3, and the partial match against strings starting
with "h" retrieves the collection
{1,3}. Using a hash-table structure for the
partial-match access is expensive because it requires that all keys
be iterated over, and then each key matching the corresponding object
needs to be collated.
Of course, I am considering here a large collection of strings.
Alternatives are not usually necessary for a few (or even a few
thousand) strings. But for large collections, performance-tuning
techniques become necessary.
To tune, look for data structures that quickly match any partial
string. The task is somewhat simpler than the most generic version of
this type of problem because you need to match only the first few
consecutive characters. This means that some sort of tree structure
is probably ideal. Of the structures available from the JDK,
TreeMap looks like it can provide exactly the
required functionality; it gives a minimal baseline and, if the
performance is adequate, there is no more tuning to do. But
TreeMap is 5 to 10 times slower than
HashMap for access and update. The target is to
obtain HashMap access speed for single-key access.
Don't get carried away searching for the perfect
data structure. Thinking laterally, you can consider other
possibilities. If you have the strings in a sorted collection, you
can apply a binary search to find the index of the string that is
greater than or less than the partial string, and then obtain all the
strings (and hence corresponding objects) in between.
More specifically, you can construct a sorted array of keys from the
hash table. Then, if you want to find all strings starting with
"h", you can run a binary search for the strings
"h" and "h\uFFFF". This gives
all the indexes of the band for all the keys that start with
"h". Note that a binary search can return the
index where the string would be even if it is not actually in the
array. (The correct solution actually goes from
"h" inclusive to "i" exclusive,
but this solution will do for strings that don't
include character \uFFFF.)
Having parallel collections can lead to all sorts of problems in
making sure both collections contain the same elements. Solutions
that involve parallel collections should hide all accesses and
updates to the parallel collections through a separate object to
ensure that all
accesses and updates are consistent.
The solution here is suitable mainly when the collections are updated
infrequently, e.g., when they are built once or periodically and read
from often. Here is a class implementing this solution:
package tuning.struct;
import java.util.Hashtable;
import java.util.Enumeration;
public class PartialSearcher
{
Hashtable hash;
String[ ] sortedArray;
public static void main(String args[ ])
{
//Populate a Hashtable with ten strings
Hashtable h = new Hashtable( );
h.put("hello", new Integer(1));
h.put("hell", new Integer(2));
h.put("alpha", new Integer(3));
h.put("bye", new Integer(4));
h.put("hello2", new Integer(5));
h.put("solly", new Integer(6));
h.put("sally", new Integer(7));
h.put("silly", new Integer(8));
h.put("zorro", new Integer(9));
h.put("hi", new Integer(10));
//Create the searching object
PartialSearcher p = new PartialSearcher(h);
//Match against all string keys given by
//the first command line argument
Object[ ] objs = p.match(args[0]);
//And print the matches out
for(int i = 0; i<objs.length; i++)
System.out.println(objs[i]);
}
public PartialSearcher(Hashtable h)
{
hash = h;
createSortedArray( );
}
public Object[ ] match(String s)
{
//find the start and end positions of strings that match the key
int startIdx = binarySearch(sortedArray, s,
0, sortedArray.length-1);
int endIdx = binarySearch(sortedArray, s+ '\uFFFF',
0, sortedArray.length-1);
//and return an array of the matched keys
Object[ ] objs = new Object[endIdx-startIdx];
for (int i = startIdx ; i < endIdx; i++)
objs[i-startIdx] = sortedArray[i];
return objs;
}
public void createSortedArray( )
{
//Create a sorted array of the keys of the hash table
sortedArray = new String[hash.size( )];
Enumeration e = hash.keys( );
for (int i = 0; e.hasMoreElements( ); i++)
sortedArray[i] = (String) e.nextElement( );
quicksort(sortedArray, 0, sortedArray.length-1);
}
/**
* Semi-standard binary search returning index of match location or
* where the location would match if it is not present.
*/
public static int binarySearch(String[ ] arr, String elem,
int fromIndex, int toIndex)
{
int mid,cmp;
while (fromIndex <= toIndex)
{
mid =(fromIndex + toIndex)/2;
if ( (cmp = arr[mid].compareTo(elem)) < 0)
fromIndex = mid + 1;
else if (cmp > 0)
toIndex = mid - 1;
else
return mid;
}
return fromIndex;
}
/**
* Standard quicksort
*/
public void quicksort(String[ ] arr, int lo, int hi)
{
if( lo >= hi )
return;
int mid = ( lo + hi ) / 2;
String tmp;
String middle = arr[ mid ];
if( arr[ lo ].compareTo(middle) > 0 )
{
arr[ mid ] = arr[ lo ];
arr[ lo ] = middle;
middle = arr[ mid ];
}
if( middle.compareTo(arr[ hi ]) > 0)
{
arr[ mid ] = arr[ hi ];
arr[ hi ] = middle;
middle = arr[ mid ];
if( arr[ lo ].compareTo(middle) > 0)
{
arr[ mid ] = arr[ lo ];
arr[ lo ] = middle;
middle = arr[ mid ];
}
}
int left = lo + 1;
int right = hi - 1;
if( left >= right )
return;
for( ;; )
{
while( arr[ right ].compareTo(middle ) > 0)
{
right--;
}
while( left < right && arr[ left ].compareTo(middle ) <= 0)
{
left++;
}
if( left < right )
{
tmp = arr[ left ];
arr[ left ] = arr[ right ];
arr[ right ] = tmp;
right--;
}
else
{
break;
}
}
quicksort(arr, lo, left);
quicksort(arr, left + 1, hi);
}
}
Note that
this solution has a wider application than only string keys. Any type
of object can be used as a key as long as you can create a
methodology to compare the order of the keys. This is a reasonable
solution for several types of indexing.
|