17.7 Counting Items and Sorting by Incidence (Histograms)
Credit: John Jensen, Fred Bremmer
17.7.1 Problem
You
need to produce ascending- or descending-count histograms, such as
the most or least common words in a file, popular pages on a web
site, etc.
17.7.2 Solution
Histogramming is basically an issue of counting item occurrences (a
Python
dictionary makes this quite easy)
and sorting by the counts. In Python, the two actions, and the
dictionary that holds the counts, are easily wrapped into a class:
class Counter:
def _ _init_ _(self):
self.dict = {}
def add(self, item):
count = self.dict.get(item, 0)
self.dict[item] = count + 1
def counts(self, desc=None):
""" Returns list of keys sorted by values.
Pass desc as 1 if you want a descending sort. """
result = map(None, self.dict.values(), self.dict.keys( ))
result.sort( )
if desc: result.reverse( )
return result
17.7.3 Discussion
The
add
method shows the normal Python idiom for counting occurrences of
arbitrary (but hashable) items, using a dictionary to hold the
counts. The counts method is where all the action is. It
takes the dictionary and produces an ascending or descending sort of
keys by values, returning a list of pairs representing the desired
histogram. The
map
call takes advantage of an interesting but little-known tidbit of
documented Python behavior. While the values and
keys methods of a dictionary return their results
in an arbitrary order, the ordering is compatible when the two
methods are called without any intervening modification to the
dictionary object. In other words, d[d.keys( )[x]]
is d.values(x) for any valid index
x. This lets us elegantly zip values and keys with
the value as the first item and the key as the second item in each
pair, so the
sort
method will work right (by using map with a first
argument of None rather than
zip, we keep compatibility with 1.5.2).
Here is an example:
sentence = "Hello there this is a test. Hello there this was a test, " \
"but now it is not."
words = sentence.split( )
c = Counter( )
for word in words:
c.add(word)
print "Ascending count:"
print c.counts( )
print "Descending count:"
print c.counts(1)
This produces:
Ascending count:
[(1, 'but'), (1, 'it'), (1, 'not.'), (1, 'now'), (1, 'test,'), (1, 'test.'),
(1, 'was'), (2, 'Hello'), (2, 'a'), (2, 'is'), (2, 'there'), (2, 'this')]
Descending count:
[(2, 'this'), (2, 'there'), (2, 'is'), (2, 'a'), (2, 'Hello'), (1, 'was'),
(1, 'test.'), (1, 'test,'), (1, 'now'), (1, 'not.'), (1, 'it'), (1, 'but')]
If you give up on 1.5.2 compatibility and use a list comprehension
instead of the map call, the code arguably becomes
a little easier to read:
def counts(self, desc=None):
result = [(val, key) for key, val in self.dict.items( )]
result.sort( )
if desc: result.reverse( )
return result
However, if this issue ever arises in a spot of your program that is
a critical speed bottleneck, you should measure performance
accurately for each version of counts. Often (but
not always), map displays surprisingly good
performance characteristics when compared to list comprehensions (at
least when no lambda is involved in the use of
map).
|