I l@ve RuBoard Previous Section Next Section

17.5 Removing Duplicates from a Sequence While Maintaining Sequence Order

Credit: Alex Martelli

17.5.1 Problem

You have a sequence that may include duplicates, and you need to remove the duplicates in the fastest possible way. Also, the output sequence must respect the item ordering of the input sequence.

17.5.2 Solution

The need to respect the item ordering of the input sequence means that picking unique items will be a very different problem than that explored in Recipe 17.4. This kind of need often arises in conjunction with a function f that defines an equivalence relation among items (i.e., x is equivalent to y if and only if f(x)==f(y)), in which case the need to remove duplicates may be better described as picking the first representative of each occurring equivalence class:

# f defines an equivalence relation among items of sequence seq, and
# f(x) must be hashable for each item x of seq (e.g., cPickle.dumps)
def uniquer(seq, f=None):
    """ Keeps earliest occurring item of each f-defined equivalence class """
    if f is None:    # f's default is the identity function
        def f(x): return x
    already_seen = {}
    result = []
    for item in seq:
        marker = f(item)
        # Python 2.2-ism; in older Pythons, use not already_seen.get(marker, 0)
        if marker not in already_seen:
            already_seen[marker] = 1
            result.append(item)
    return result

Picking the most recent (last occurring) representative of each equivalence class is a bit harder:

def uniquest(seq, f=None):
    """ Keeps last occurring item of each f-defined equivalence class.
    However, it's O(N+N1*log(N1)), in which N1 is the count of "unique" items. """
    import sys
    if f is None:
        def f(x): return x
    already_seen = {}
    for item, index in zip(seq, xrange(sys.maxint)):
        marker = f(item)
        already_seen[marker] = index, item
    auxlist = already_seen.values(  )
    auxlist.sort(  )    # the O(N1*log(N1)) step
    return [item for index, item in auxlist]

def uniquique(seq, f=None):
    """ Keeps last occurring item of each f-defined equivalence class.
    O(N), but slower than uniquest in many practical cases. """
    if f is None:
        def f(x): return x
    already_seen = {}
    result = []
    seq = list(seq)
    seq.reverse(  )
    for item in seq:
        marker = f(item)
        # Python 2.2-ism; in older Pythons, use not already_seen.get(marker, 0)
        if marker not in already_seen:
            already_seen[marker] = 1
            result.append(item)
    result.reverse(  )
    return result

def uniquoque(seq, f=None):
    """ Keeps last occurring item of each f-defined equivalence class.
    Also O(N). """
    import sys
    if f is None:
        def f(x): return x
    where_seen = {}
    output_this_item = [0]*len(seq)
    for item, index in zip(seq, xrange(sys.maxint)):
        marker = f(item)
        previously_seen = where_seen.get(marker)
        if previously_seen is not None:
            output_this_item[previously_seen] = 0
        output_this_item[index] = 1
        where_seen[marker] = index
    return [item for item, output_this in zip(seq, output_this_item)
            if output_this]

These functions can be made more general (without adding substantial complication) by adding another argument p, which is a function that picks the most suitable item of each equivalence class, either when presented with a pair of candidates (index and item) or with a list of indexes and items for each whole equivalence class:

def fancy_unique(seq, f, p):
    """ Keeps "most-appropriate" item of each f-defined equivalence class,
    with precedence function p doing pairwise choice of (index, item) """
    already_seen = {}
    for item, index in zip(seq, xrange(sys.maxint)):
        marker = f(item)
        if already_seen.has_key(marker):  # or, "if marker in already_seen"
            # It's NOT a problem to rebind index and item within the
            # for loop: the next leg of the loop does not use their binding
            index, item = p((index, item), already_seen[marker])
        already_seen[marker] = index, item
    auxlist = already_seen.values(  )
    auxlist.sort(  )
    return [item for index, item in auxlist]

def fancier_uniquer(seq, f, p):
    """ Keeps "most-appropriate" item of each f-defined equivalence class,
    with precedence function p choosing appropriate (index, item) for each
    equivalence class from the list of candidates passed to it """
    already_seen = {}
    for item, index in zip(seq, xrange(sys.maxint)):
        marker = f(item)
        already_seen.setdefault(marker, []).append((index, item))
    auxlist = [p(candidates) for candidates in already_seen.values(  )]
    auxlist.sort(  )
    return [item for index, item in auxlist]

17.5.3 Discussion

Recipe 17.4 is applicable only if you do not care about item ordering or, in other words, if the sequences involved are meaningful only as sets of items, which is often the case. When sequential order is significant, a different approach is needed.

If the items are hashable, it's not hard to maintain sequence order, keeping only the first occurrence of each value. If the items are not hashable, but are of types supported by cPickle.dumps, it might be worth using this function for long-enough sequences. Another possibility suggested by this approach is to handle uniqueness within equivalence classes. In other words, have the uniqueness function accept as an argument a function f that must return hashable objects, such that f(x)==f(y) if and only if items x and y are equivalent. Identity (in the mathematical sense, not in the Python sense) is used as the default if no argument f is supplied, but the caller can pass cPickle.dumps or whatever other equivalence-defining function is appropriate. This approach is shown in the uniquer function in the solution.

If you need to keep the last occurring rather than the earliest occurrence of an item in each equivalence class, a different approach may be appropriate, as shown in the uniquest function in the solution. In this case, we do one pass through the input sequence, associating the latest index in it to each equivalence class, then sort those indexes to reconstruct the ordering for the output sequence.

However, the sort degrades performance to O(N1x log(N1)), in which N1 is the number of unique items. To keep the last occurring with O(N) performance, it's simplest to reverse the input sequence (or a copy thereof into a local list, since the input sequence might be immutable) and reverse the result, as shown in uniquique. An alternative approach, shown in uniquoque, is to build and maintain a list of flags parallel to seq, in which each flag is true if and only if the corresponding item must be part of the output sequence. Then we can use a list comprehension (or a loop) to build the output in a separate second pass. Each of these general idioms has many uses and is worth keeping in mind as a worthwhile sequence-processing technique.

But coming back to uniquest, it's interesting to notice that it easily generalizes to cases in which the choice among multiple items in the same equivalence class depends on an arbitrary precedence function p that considers both the actual items and their indexes of occurrence. As long as function p can operate pairwise, you only need to replace the simple assignment used in uniquest:

already_seen[marker] = index, item

with a call to the precedence function, which returns the (index, item) pair for the chosen occurrence among the two. Precedence functions that need to examine the whole set of equivalent items to make their choice can also be accommodated, of course, but you need to build the set in one pass and perform only the selections when that pass is finished. These fancy approaches are clearly only useful for substantial equivalence functions (not for identity, nor for functions meant to act as proxies for identity, such as cPickle.dumps), so f defaulting to the identity function has been removed from the fancy_unique and fancier_uniquer functions, which show these (perhaps overgeneralized) approaches.

An example of fancy_unique may help. Say we're given a list of words, and we need to get a sublist from it, respecting order, such that no two words on the sublist begin with the same letter. Out of all the words in the original list that begin with each given letter, we need to keep the longest word and, in case of equal lengths, the word appearing later on the list. This sounds complicated, but with fancy_unique to help us, it's really not that bad:

def complicated_choice(words):
    def first_letter(aword): return aword[0].lower(  )
    def prefer((indx1, word1), (indx2, word2)):
        if len(word2) > len(word1): return indx2, word2
        else: return indx1, word1
    return fancy_unique(words, first_letter, prefer)

The prefer function is simplified, because it knows fancy_unique always calls it with indx2<indx1. So the older indx2, word2 pair must be returned only when word2 is longer than word1; otherwise, indx1, word1 is always the proper result. The automatic tuple unpacking in prefer's signature is debatable, stylewise, but I personally like it (it reminds me of Haskell).

Out of all the general programming techniques presented in the various functions of this recipe, that of writing higher-order functions, which organize a computation and appropriately call back to functions they receive as arguments, is easily the most precious. This is well worth keeping in mind in several circumstances, and not just for old Haskell-heads, as it often works great in Python.

17.5.4 See Also

Recipe 17.4.

    I l@ve RuBoard Previous Section Next Section