Recipe 4.8 Finding Elements in One Array but Not Another

4.8.1 Problem

You want to find elements that are in one array but not another.

4.8.2 Solution

You want to find elements in @A that aren't in @B. Build a hash of the keys of @B to use as a lookup table. Then check each element in @A to see whether it is in @B.

4.8.2.1 Straightforward implementation

# assume @A and @B are already loaded
%seen = ( );                    # lookup table to test membership of B
@aonly = ( );                   # answer

# build lookup table
foreach $item (@B) { $seen{$item} = 1 }

# find only elements in @A and not in @B
foreach $item (@A) {
    unless ($seen{$item}) {
        # it's not in %seen, so add to @aonly
        push(@aonly, $item);
    }
}

4.8.2.2 More idiomatic version

my %seen;     # lookup table
my @aonly;    # answer

# build lookup table
@seen{@B} = ( );

foreach $item (@A) {
    push(@aonly, $item) unless exists $seen{$item};
}

4.8.2.3 Loopless version

my @A = ...;
my @B = ...;

my %seen;
@seen {@A} = ( );
delete @seen {@B};

my @aonly = keys %seen;

4.8.3 Discussion

As with nearly any problem in Perl that asks whether a scalar is in one list or another, this one uses a hash. First, process @B so that the %seen hash records each element from @B by setting its value to 1. Then process @A one element at a time, checking whether that particular element had been in @B by consulting the %seen hash.

The given code retains duplicate elements in @A. This can be fixed easily by adding the elements of @A to %seen as they are processed:

foreach $item (@A) {
    push(@aonly, $item) unless $seen{$item};
    $seen{$item} = 1;                       # mark as seen
}

The first two solutions differ mainly in how they build the hash. The first iterates through @B. The second uses a hash slice to initialize the hash. A hash slice is easiest illustrated by this example:

$hash{"key1"} = 1;
$hash{"key2"} = 2;

which is equivalent to:

@hash{"key1", "key2"} = (1,2);

The list in the curly braces holds the keys; the list on the right holds the values. We initialize %seen in the first solution by looping over each element in @B and setting the appropriate value of %seen to 1. In the second, we simply say:

@seen{@B} = ( );

This uses items in @B as keys for %seen, setting each corresponding value to undef, because there are fewer values on the right than places to put them. This works out here because we check for existence of the key, not logical truth or definedness of the value. If we needed true values, a slice could still shorten our code:

@seen{@B} = (1) x @B;

In the third solution, we make use of this property even further and avoid explicit loops altogether. (Not that avoiding loops should be construed as being particularly virtuous; we're just showing you that there's more than one way to do it.) The slice assignment makes any element that was in @A a key, and the slice deletion removes from the hash any keys that were elements of @B, leaving those that were only in @A.

A fairly common situation where this might arise is when you have two files and would like to know which lines from the second file either were or weren't in the first. Here's a simple solution based on this recipe:

open(OLD, $path1)        || die "can't open $path1: $!";
@seen{ <OLD> } = ( );
open(NEW, $path2)        || die "can't open $path2: $!";
while (<NEW>) {
    print if exists $seen{$_};  
}

This shows the lines in the second file that were already seen in the first one. Use unless instead of if to show the lines in the second file that were not in the first.

Imagine two files, the first containing the lines:

red
yellow
green
blue

and the second containing:

green
orange
purple
black
yellow

The output using if would be:

green 
yellow

and the output using unless would be:

orange 
purple
black

You could even do this from the command line; given a suitable cat(1) program, it's easy:

% perl -e '@s{`cat OLD`}=( ); exists $s{$_} && print for `cat NEW`'
% perl -e '@s{`cat OLD`}=( ); exists $s{$_} || print for `cat NEW`'

You'd find that you just emulated these calls to the Unix fgrep(1) program:

% fgrep -Ff  OLD NEW
% fgrep -vFf OLD NEW

4.8.4 See Also

Hash slices are explained in perldata(1) and the "Variables" section of Chapter 2 of Programming Perl; Chapter 5; we use hashes in a similar fashion in Recipe 4.7 and Recipe 4.9

[ Team LiB ]