11.9 Finding the Index for Partially Matched Strings

The problem considered here concerns a large number of string keys that need to be accessed by full or partial match. Each string is unique, so the full-match access can easily be handled by a standard hash-table structure (e.g., java.util.HashMap). The partial-match access needs to collect all objects that have string keys starting with a particular substring.

Consider this hash table consisting of keys and values:

"hello"          1
"bye"            2
"hi"             3

The full match for key "hi" retrieves 3, and the partial match against strings starting with "h" retrieves the collection {1,3}. Using a hash-table structure for the partial-match access is expensive because it requires that all keys be iterated over, and then each key matching the corresponding object needs to be collated.

Of course, I am considering here a large collection of strings. Alternatives are not usually necessary for a few (or even a few thousand) strings. But for large collections, performance-tuning techniques become necessary.

To tune, look for data structures that quickly match any partial string. The task is somewhat simpler than the most generic version of this type of problem because you need to match only the first few consecutive characters. This means that some sort of tree structure is probably ideal. Of the structures available from the JDK, TreeMap looks like it can provide exactly the required functionality; it gives a minimal baseline and, if the performance is adequate, there is no more tuning to do. But TreeMap is 5 to 10 times slower than HashMap for access and update. The target is to obtain HashMap access speed for single-key access.

Don't get carried away searching for the perfect data structure. Thinking laterally, you can consider other possibilities. If you have the strings in a sorted collection, you can apply a binary search to find the index of the string that is greater than or less than the partial string, and then obtain all the strings (and hence corresponding objects) in between.

More specifically, you can construct a sorted array of keys from the hash table. Then, if you want to find all strings starting with "h", you can run a binary search for the strings "h" and "h\uFFFF". This gives all the indexes of the band for all the keys that start with "h". Note that a binary search can return the index where the string would be even if it is not actually in the array. (The correct solution actually goes from "h" inclusive to "i" exclusive, but this solution will do for strings that don't include character \uFFFF.)

Having parallel collections can lead to all sorts of problems in making sure both collections contain the same elements. Solutions that involve parallel collections should hide all accesses and updates to the parallel collections through a separate object to ensure that all accesses and updates are consistent. The solution here is suitable mainly when the collections are updated infrequently, e.g., when they are built once or periodically and read from often. Here is a class implementing this solution:

package tuning.struct;
   
import java.util.Hashtable;
import java.util.Enumeration;
   
public class PartialSearcher
{
  Hashtable hash;
  String[  ] sortedArray;
   
  public static void main(String args[  ])
  {
    //Populate a Hashtable with ten strings
    Hashtable h = new Hashtable(  );
    h.put("hello", new Integer(1));
    h.put("hell", new Integer(2));
    h.put("alpha", new Integer(3));
    h.put("bye", new Integer(4));
    h.put("hello2", new Integer(5));
    h.put("solly", new Integer(6));
    h.put("sally", new Integer(7));
    h.put("silly", new Integer(8));
    h.put("zorro", new Integer(9));
    h.put("hi", new Integer(10));
   
    //Create the searching object
    PartialSearcher p = new PartialSearcher(h);
    //Match against all string keys given by
    //the first command line argument
    Object[  ] objs = p.match(args[0]);
    //And print the matches out
    for(int i = 0; i<objs.length; i++)
      System.out.println(objs[i]);
   
  }
   
  public PartialSearcher(Hashtable h)
  {
    hash = h;
    createSortedArray(  );
  }
   
  public Object[  ] match(String s)
  {
    //find the start and end positions of strings that match the key
    int startIdx = binarySearch(sortedArray, s,
                                0, sortedArray.length-1);
    int endIdx = binarySearch(sortedArray, s+ '\uFFFF',
                              0, sortedArray.length-1);
   
    //and return an array of the matched keys
    Object[  ] objs = new Object[endIdx-startIdx];
    for (int i = startIdx ; i < endIdx; i++)
      objs[i-startIdx] = sortedArray[i];
    return objs;
  }
   
  public void createSortedArray(  )
  {
    //Create a sorted array of the keys of the hash table
    sortedArray = new String[hash.size(  )];
    Enumeration e = hash.keys(  );
    for (int i = 0; e.hasMoreElements(  ); i++)
      sortedArray[i] = (String) e.nextElement(  );
    quicksort(sortedArray, 0, sortedArray.length-1);
  }
   
  /**
   * Semi-standard binary search returning index of match location or
   * where the location would match if it is not present.
   */
  public static int binarySearch(String[  ] arr, String elem,
                                 int fromIndex, int toIndex)
  {
    int mid,cmp;
    while (fromIndex <= toIndex)
    {
      mid =(fromIndex + toIndex)/2;
      if ( (cmp = arr[mid].compareTo(elem)) < 0)
        fromIndex = mid + 1;
      else if (cmp > 0)
        toIndex = mid - 1;
      else
        return mid;
    }
    return fromIndex;
  }
   
  /**
   * Standard quicksort
   */
  public void quicksort(String[  ] arr, int lo, int hi)
  {
    if( lo >= hi ) 
      return;
   
    int mid = ( lo + hi ) / 2;
    String tmp;
    String middle = arr[ mid ];
   
    if( arr[ lo ].compareTo(middle) > 0 )
    {
      arr[ mid ] = arr[ lo ];
      arr[ lo ] = middle;
      middle = arr[ mid ];
    }
  
    if( middle.compareTo(arr[ hi ]) > 0)
    {
      arr[ mid ] = arr[ hi ];
      arr[ hi ] = middle;
      middle = arr[ mid ];
   
      if( arr[ lo ].compareTo(middle) > 0)
      {
        arr[ mid ] = arr[ lo ];
        arr[ lo ] = middle;
        middle = arr[ mid ];
      }
    }
   
    int left = lo + 1;
    int right = hi - 1;
   
    if( left >= right ) 
      return;
   
    for( ;; ) 
    {
      while( arr[ right ].compareTo(middle ) > 0)
      {
        right--;
      }
   
      while( left < right && arr[ left ].compareTo(middle ) <= 0)
      {
        left++;
      }
   
      if( left < right )
      {
        tmp = arr[ left ];
        arr[ left ] = arr[ right ];
        arr[ right ] = tmp;
        right--;
      }
      else
      {
        break;
      }
    }
   
    quicksort(arr, lo, left);
    quicksort(arr, left + 1, hi);
  }
}

Note that this solution has a wider application than only string keys. Any type of object can be used as a key as long as you can create a methodology to compare the order of the keys. This is a reasonable solution for several types of indexing.