I l@ve RuBoard Previous Section Next Section

17.7 Counting Items and Sorting by Incidence (Histograms)

Credit: John Jensen, Fred Bremmer

17.7.1 Problem

You need to produce ascending- or descending-count histograms, such as the most or least common words in a file, popular pages on a web site, etc.

17.7.2 Solution

Histogramming is basically an issue of counting item occurrences (a Python dictionary makes this quite easy) and sorting by the counts. In Python, the two actions, and the dictionary that holds the counts, are easily wrapped into a class:

class Counter:
    def _ _init_ _(self):
        self.dict = {}
    def add(self, item):
        count = self.dict.get(item, 0)
        self.dict[item] = count + 1
    def counts(self, desc=None):
        """ Returns list of keys sorted by values.
        Pass desc as 1 if you want a descending sort. """
        result = map(None, self.dict.values(), self.dict.keys(  ))
        result.sort(  )
        if desc: result.reverse(  )
        return result

17.7.3 Discussion

The add method shows the normal Python idiom for counting occurrences of arbitrary (but hashable) items, using a dictionary to hold the counts. The counts method is where all the action is. It takes the dictionary and produces an ascending or descending sort of keys by values, returning a list of pairs representing the desired histogram. The map call takes advantage of an interesting but little-known tidbit of documented Python behavior. While the values and keys methods of a dictionary return their results in an arbitrary order, the ordering is compatible when the two methods are called without any intervening modification to the dictionary object. In other words, d[d.keys( )[x]] is d.values(x) for any valid index x. This lets us elegantly zip values and keys with the value as the first item and the key as the second item in each pair, so the sort method will work right (by using map with a first argument of None rather than zip, we keep compatibility with 1.5.2).

Here is an example:

sentence = "Hello there this is a test.  Hello there this was a test, " \
           "but now it is not."
words = sentence.split(  )
c = Counter(  )
for word in words:
    c.add(word)
print "Ascending count:"
print c.counts(  )
print "Descending count:"
print c.counts(1)

This produces:

Ascending count:
[(1, 'but'), (1, 'it'), (1, 'not.'), (1, 'now'), (1, 'test,'), (1, 'test.'),
(1, 'was'), (2, 'Hello'), (2, 'a'), (2, 'is'), (2, 'there'), (2, 'this')]
Descending count:
[(2, 'this'), (2, 'there'), (2, 'is'), (2, 'a'), (2, 'Hello'), (1, 'was'),
(1, 'test.'), (1, 'test,'), (1, 'now'), (1, 'not.'), (1, 'it'), (1, 'but')]

If you give up on 1.5.2 compatibility and use a list comprehension instead of the map call, the code arguably becomes a little easier to read:

def counts(self, desc=None):
    result = [(val, key) for key, val in self.dict.items(  )]
    result.sort(  )
    if desc: result.reverse(  )
    return result

However, if this issue ever arises in a spot of your program that is a critical speed bottleneck, you should measure performance accurately for each version of counts. Often (but not always), map displays surprisingly good performance characteristics when compared to list comprehensions (at least when no lambda is involved in the use of map).

    I l@ve RuBoard Previous Section Next Section