Recipe 6.8 Extracting a Range of Lines
6.8.1 Problem
You want to extract all lines from a
starting pattern through an ending pattern or from a starting line
number up to an ending line number.
A common example of this is extracting the first 10 lines of a file
(line numbers 1 to 10) or just the body of a mail message (everything
past the blank line).
6.8.2 Solution
Use the operators
.. or ... with patterns or line
numbers.
The .. operator will test the right operand on the
same iteration that the left operand flips the operator into the true
state.
while (<>) {
if (/BEGIN PATTERN/ .. /END PATTERN/) {
# line falls between BEGIN and END in the
# text, inclusive.
}
}
while (<>) {
if (FIRST_LINE_NUM .. LAST_LINE_NUM) {
# operate only between first and last line, inclusive.
}
}
But the ... operator waits until the next
iteration to check the right operand.
while (<>) {
if (/BEGIN PATTERN/ ... /END PATTERN/) {
# line is between BEGIN and END on different lines
}
}
while (<>) {
if (FIRST_LINE_NUM ... LAST_LINE_NUM) {
# operate only between first and last line, not inclusive
}
}
6.8.3 Discussion
The range operators, .. and
..., are probably the least understood of Perl's
myriad operators. They were designed to allow easy extraction of
ranges of lines without forcing the programmer to retain explicit
state information. Used in scalar context, such as in the test of
if and while statements, these
operators return a true or false value that's partially dependent on
what they last returned. The expression
left_operand ..
right_operand returns false until
left_operand is true, but once that test has been
met, it stops evaluating left_operand and keeps
returning true until right_operand becomes true,
after which it restarts the cycle. Put another way, the first operand
turns on the construct as soon as it returns a true value, whereas
the second one turns it off as soon as it
returns true.
The two operands are completely arbitrary. You could write
mytestfunc1( ) ..
mytestfunc2( ), although this is rarely seen.
Instead, the range operators are usually used with either line
numbers as operands (the first example), patterns as operands (the
second example), or both.
# command-line to print lines 15 through 17 inclusive (see below)
perl -ne 'print if 15 .. 17' datafile
# print all <XMP> .. </XMP> displays from HTML doc
while (<>) {
print if m#<XMP>#i .. m#</XMP>#i;
}
# same, but as shell command
% perl -ne 'print if m#<XMP>#i .. m#</XMP>#i' document.html
If either operand is a numeric literal, the range operators
implicitly compare against the $. variable
($NR or $INPUT_LINE_NUMBER if
you use English). Be careful
with implicit line number comparisons here. You must specify literal
numbers in your code, not variables containing line numbers. That
means you simply say 3 ..
5 in a conditional, but not $n
.. $m where
$n and $m are 3 and 5
respectively. For that, be more explicit by testing the
$. variable directly.
perl -ne 'BEGIN { $top=3; $bottom=5 } print if $top .. $bottom' /etc/passwd
# WRONG
perl -ne 'BEGIN { $top=3; $bottom=5 }
print if $. = = $top .. $. = = $bottom' /etc/passwd # RIGHT
perl -ne 'print if 3 .. 5' /etc/passwd # also RIGHT
The difference between .. and
... is their behavior when both operands become
true on the same iteration. Consider these two cases:
print if /begin/ .. /end/;
print if /begin/ ... /end/;
Given the line "You may
not end ere
you begin", both versions of
the previous range operator return true. But the code using
.. won't print any further lines. That's because
.. tests both conditions on the same line once the
first test matches, and the second test tells it that it's reached
the end of its region. On the other hand, ...
continues until the next line that matches
/end/ because it never tries to test both operands
on the same line.
You may mix and match conditions of different sorts, as in:
while (<>) {
$in_header = 1 .. /^$/;
$in_body = /^$/ .. eof( );
}
The first assignment sets $in_header to be true
from the first input line until after the blank line separating the
header, such as from a mail message, a USENET news posting, or even
an HTTP header. (Technically, an HTTP header should have linefeeds
and carriage returns as network line terminators, but in practice,
servers are liberal in what they accept.) The second assignment sets
$in_body to true as soon as the first blank line
is encountered, up through end-of-file. Because range operators do
not retest their initial condition, any further blank lines, like
those between paragraphs, won't be noticed.
Here's an example. It reads files containing mail messages and prints
addresses it finds in headers. Each address is printed only once. The
extent of the header is from a line beginning with a
"From:" up through the first
blank line. If we're not within that range, go on to the next line.
This isn't an RFC-822 notion of an address, but it is easy to write.
%seen = ( );
while (<>) {
next unless /^From:?\s/i .. /^$/;
while (/([^<>( ),;\s]+\@[^<>( ),;\s]+)/g) {
print "$1\n" unless $seen{$1}++;
}
}
6.8.4 See Also
The .. and ... operators in the
"Range Operator" sections of perlop(1) and
Chapter 3 of Programming Perl; the entry for
$NR in perlvar(1) and the
"Per-Filehandle Variables" section of Chapter 28 of
Programming Perl
|