2.6 Looking for Items in a Sorted Sequence Using Binary Search
Credit: Noah Spurrier, Alex Martelli
2.6.1 Problem
You need to look for a lot of
items in a sequence.
2.6.2 Solution
Binary search is a basic algorithm provided by
bisect in Python. How to perform a binary search
to check if a value is present in a list can be summarized in three
lines of code:
thelist.sort( )
item_insert_point = bisect.bisect(thelist, theitem)
is_present = thelist[item_insert_point-1:item_insert_point] == [theitem]
2.6.3 Discussion
A friend of mine was searching a large file for missing lines. A
linear search was taking forever, so a binary search was needed. The
name of the bisect module misled him into believing that
Python did not have a binary search algorithm. I created an example
to show how bisect is used to perform a binary
search.
The third line in the recipe may not be immediately obvious. If we
know that thelist is not empty, this is much
simpler and more obvious:
is_present = thelist[item_insert_point-1] == theitem
However, this simpler approach raises an exception for an empty
thelist, since indexing must check for a valid
index (the only way to return an indicator of "no
such index" is by raising
IndexError). Slicing is more relaxed than
indexing, since it results in an empty slice for invalid
slice-boundary indexes. Therefore, in general:
somelist[x:x+1]
yields the same one-item list as:
[somelist[x]]
when x is a valid index in
somelist. However, it results in an empty list
([]) when the indexing would raise an
IndexError. The third line in the recipe uses this
idea to avoid having to deal with exceptions, thereby handling empty
and nonempty cases for thelist in a perfectly
uniform way. Another approach is:
is_present = thelist and thelist[item_insert_point-1] == theitem
This exploits and's
short-circuiting behavior to guard the indexing, instead of using
slicing. Both approaches are worth keeping in mind, since their areas
of applicability overlap but do not coincide.
In my friend's case, of course, an auxiliary
dictionary was also a possibility, because strings are hashable.
However, the
approach based on a sorted list may be the only way out in a few
other cases, with items that are not hashable (but only if they are
comparable, of course, since otherwise the list cannot be sorted).
Also, if the list is already sorted, and the number of items you need
to look up in it is not extremely large, it may be faster to use
bisect rather than to build an auxiliary
dictionary, since the investment of time in the latter operation
might not be fully amortized.
2.6.4 See Also
Documentation for the bisect module in the
Library Reference.
|