I l@ve RuBoard

2.6 Looking for Items in a Sorted Sequence Using Binary Search

Credit: Noah Spurrier, Alex Martelli

2.6.1 Problem

You need to look for a lot of items in a sequence.

2.6.2 Solution

Binary search is a basic algorithm provided by bisect in Python. How to perform a binary search to check if a value is present in a list can be summarized in three lines of code:

thelist.sort(  )
item_insert_point = bisect.bisect(thelist, theitem)
is_present = thelist[item_insert_point-1:item_insert_point] == [theitem]

2.6.3 Discussion

A friend of mine was searching a large file for missing lines. A linear search was taking forever, so a binary search was needed. The name of the bisect module misled him into believing that Python did not have a binary search algorithm. I created an example to show how bisect is used to perform a binary search.

The third line in the recipe may not be immediately obvious. If we know that thelist is not empty, this is much simpler and more obvious:

is_present = thelist[item_insert_point-1] == theitem

However, this simpler approach raises an exception for an empty thelist, since indexing must check for a valid index (the only way to return an indicator of "no such index" is by raising IndexError). Slicing is more relaxed than indexing, since it results in an empty slice for invalid slice-boundary indexes. Therefore, in general:

somelist[x:x+1]

yields the same one-item list as:

[somelist[x]]

when x is a valid index in somelist. However, it results in an empty list ([]) when the indexing would raise an IndexError. The third line in the recipe uses this idea to avoid having to deal with exceptions, thereby handling empty and nonempty cases for thelist in a perfectly uniform way. Another approach is:

is_present = thelist and thelist[item_insert_point-1] == theitem

This exploits and's short-circuiting behavior to guard the indexing, instead of using slicing. Both approaches are worth keeping in mind, since their areas of applicability overlap but do not coincide.

In my friend's case, of course, an auxiliary dictionary was also a possibility, because strings are hashable. However, the approach based on a sorted list may be the only way out in a few other cases, with items that are not hashable (but only if they are comparable, of course, since otherwise the list cannot be sorted). Also, if the list is already sorted, and the number of items you need to look up in it is not extremely large, it may be faster to use bisect rather than to build an auxiliary dictionary, since the investment of time in the latter operation might not be fully amortized.

2.6.4 See Also

Documentation for the bisect module in the Library Reference.

I l@ve RuBoard