[ Team LiB ] Previous Section Next Section

Recipe 6.8 Extracting a Range of Lines

6.8.1 Problem

You want to extract all lines from a starting pattern through an ending pattern or from a starting line number up to an ending line number.

A common example of this is extracting the first 10 lines of a file (line numbers 1 to 10) or just the body of a mail message (everything past the blank line).

6.8.2 Solution

Use the operators .. or ... with patterns or line numbers.

The .. operator will test the right operand on the same iteration that the left operand flips the operator into the true state.

while (<>) {
    if (/BEGIN PATTERN/ .. /END PATTERN/) {
        # line falls between BEGIN and END in the
        # text, inclusive.
    }
}

while (<>) {
    if (FIRST_LINE_NUM .. LAST_LINE_NUM) {
        # operate only between first and last line, inclusive.
    }
}

But the ... operator waits until the next iteration to check the right operand.

while (<>) {
    if (/BEGIN PATTERN/ ... /END PATTERN/) {
        # line is between BEGIN and END on different lines
    }
}

while (<>) {
    if (FIRST_LINE_NUM ... LAST_LINE_NUM) {
        # operate only between first and last line, not inclusive
    }
}

6.8.3 Discussion

The range operators, .. and ..., are probably the least understood of Perl's myriad operators. They were designed to allow easy extraction of ranges of lines without forcing the programmer to retain explicit state information. Used in scalar context, such as in the test of if and while statements, these operators return a true or false value that's partially dependent on what they last returned. The expression left_operand .. right_operand returns false until left_operand is true, but once that test has been met, it stops evaluating left_operand and keeps returning true until right_operand becomes true, after which it restarts the cycle. Put another way, the first operand turns on the construct as soon as it returns a true value, whereas the second one turns it off as soon as it returns true.

The two operands are completely arbitrary. You could write mytestfunc1( ) .. mytestfunc2( ), although this is rarely seen. Instead, the range operators are usually used with either line numbers as operands (the first example), patterns as operands (the second example), or both.

# command-line to print lines 15 through 17 inclusive (see below)
perl -ne 'print if 15 .. 17' datafile

# print all <XMP> .. </XMP> displays from HTML doc
while (<>) {
    print if m#<XMP>#i .. m#</XMP>#i;
}

# same, but as shell command
% perl -ne 'print if m#<XMP>#i .. m#</XMP>#i' document.html

If either operand is a numeric literal, the range operators implicitly compare against the $. variable ($NR or $INPUT_LINE_NUMBER if you use English). Be careful with implicit line number comparisons here. You must specify literal numbers in your code, not variables containing line numbers. That means you simply say 3 .. 5 in a conditional, but not $n .. $m where $n and $m are 3 and 5 respectively. For that, be more explicit by testing the $. variable directly.

perl -ne 'BEGIN { $top=3; $bottom=5 }  print if $top .. $bottom' /etc/passwd
       # WRONG 
perl -ne 'BEGIN { $top=3; $bottom=5 }
    print if $. =  = $top .. $. =  =     $bottom' /etc/passwd    # RIGHT
perl -ne 'print if 3 .. 5' /etc/passwd   # also RIGHT

The difference between .. and ... is their behavior when both operands become true on the same iteration. Consider these two cases:

print if /begin/ ..  /end/;
print if /begin/ ... /end/;

Given the line "You may not end ere you begin", both versions of the previous range operator return true. But the code using .. won't print any further lines. That's because .. tests both conditions on the same line once the first test matches, and the second test tells it that it's reached the end of its region. On the other hand, ... continues until the next line that matches /end/ because it never tries to test both operands on the same line.

You may mix and match conditions of different sorts, as in:

while (<>) {
    $in_header =   1  .. /^$/;
    $in_body   = /^$/ .. eof( );
}

The first assignment sets $in_header to be true from the first input line until after the blank line separating the header, such as from a mail message, a USENET news posting, or even an HTTP header. (Technically, an HTTP header should have linefeeds and carriage returns as network line terminators, but in practice, servers are liberal in what they accept.) The second assignment sets $in_body to true as soon as the first blank line is encountered, up through end-of-file. Because range operators do not retest their initial condition, any further blank lines, like those between paragraphs, won't be noticed.

Here's an example. It reads files containing mail messages and prints addresses it finds in headers. Each address is printed only once. The extent of the header is from a line beginning with a "From:" up through the first blank line. If we're not within that range, go on to the next line. This isn't an RFC-822 notion of an address, but it is easy to write.

%seen = ( );
while (<>) {
    next unless /^From:?\s/i .. /^$/;
    while (/([^<>( ),;\s]+\@[^<>( ),;\s]+)/g) {
        print "$1\n" unless $seen{$1}++;
    }
}

6.8.4 See Also

The .. and ... operators in the "Range Operator" sections of perlop(1) and Chapter 3 of Programming Perl; the entry for $NR in perlvar(1) and the "Per-Filehandle Variables" section of Chapter 28 of Programming Perl

    [ Team LiB ] Previous Section Next Section