3.7 Objective 7: Making Use of
Regular Expressions
In Objective
3, filename globbing with wildcards is described, which
enables us to list or find files with common elements (i.e.,
filenames or file extensions) at once. File globs make use of
special characters such as *, which have special
meanings in the context of the command line. There are a
handful of shell wildcard characters understood by bash, enough to handle the relatively
simple problem of globbing filenames. Other problems
aren't so simple, and extending the glob concept into any
generic text form (files, text streams, program string
variables, etc.) can open up a wide new range of capability.
This is done using regular
expressions.
Two tools that are important for the LPIC
Level 1 exams and that make use of regular expressions are
grep and sed. These tools are useful for text
searches. There are many other tools that make use of regular
expressions, including the awk, Perl, and Python
languages and other utilities, but you don't need to be
concerned with them for the purpose of the LPIC Level 1 exams.
3.7.1 Using grep
A long time ago, as the idea
of regular expressions was catching on, the line editor ed contained a command to display
lines of a file being edited that matched a given regular
expression. The command is: g/regular expression/p
That is, "on a global basis, print the
current line when a match for regular
expression is found," or more simply, "global regular
expression print." This function was so useful that it was
made into a standalone utility named, appropriately, grep. Later, the regular expression
grammar of grep was expanded in
a new command called egrep (for
"extended grep"). You'll find both commands on
your Linux system today, and they differ slightly in the way
they handle regular expressions. For the purposes of Exam 101,
we'll stick with grep, which
can also make use of the "extended" regular expressions when
used with the -E option. You
will find some form of grep on
just about every Unix or Unix-like system available.
Syntaxgrep [options] regex [files]
Description
Search files or standard input for
lines containing a match to regular expression
regex. By default,
matching lines will be displayed and nonmatching lines will
not be displayed. When multiple files are specified, grep displays the filename as a
prefix to the output lines (use the -h option to suppress filename
prefixes).
Frequently used options
- -c
-
Display only a
count of matched lines, but not the lines themselves.
- -h
-
Display matched lines, but do not include
filenames for multiple file input.
- -i
-
Ignore uppercase and lowercase
distinctions, allowing abc to
match both abc and ABC.
- -n
-
Display matched lines prefixed with their
line numbers. When used with multiple files, both the filename and line number
are prefixed.
- -v
-
Print all lines that do not match regex. This is an important and
useful option. You'll want to use regular expressions, not
only to select information
but also to eliminate
information. Using -v
inverts the output this way.
Examples
Since regular expressions can contain both
metacharacters and literals, grep can be used with an entirely
literal regex. For example, to find all lines in
file1 that contain either Linux or linux, you could use grep like this: $ grep -i linux file1
In this example, the regex is simply
"linux." The uppercase L in
"Linux" is matched by the command-line option -i. This is fine for literal
expressions that are common. However, in situations in which
regex includes regular expression metacharacters that
are also shell special characters (such as $ or *), the regex must be quoted
to prevent shell expansion and pass the metacharacters on to
grep.
As a simplistic example of this, suppose you
have files in your local directory named abc,
abc1, and abc2. When combined with bash's echo expression, the abc*
wildcard expression lists all files that begin with
abc, as follows: $ echo abc*
abc abc1 abc2
Now suppose that these files contain lines
with the strings abc, abcc, abccc, and so on, and you wish to use
grep to find them. You can use
the shell wildcard expression abc* to expand to all the
abc files as displayed with echo above, and you'd use an
identical regular expression abc* to find all
occurrences of lines containing abc, abcc,
abccc, etc. Without using quotes to prevent shell
expansion, the command would be: $ grep abc* abc*
After shell expansion, this yields: $ grep abc abc1 abc2 abc abc1 abc2 # no!
This is not
what you intended! grep would
search for the literal expression abc, because it
appears as the first command argument. Instead, quote the
regular expression with single or double quotes to protect
it:
$ grep 'abc*' abc*
or: $ grep "abc*" abc*
After expansion, both examples yield the same
results: $ grep abc* abc abc1 abc2
Now this is what you're after. The three
files abc, abc1, and abc2 will be searched for the regular
expression abc*. It is good to stay in the habit of
quoting regular expressions on the command line to avoid these
problems -- they won't be at all obvious because the shell
expansion is invisible to you unless you use the echo command.
The use of
grep and its options is common. You should be
familiar with what each option does, as well as the
concept of piping the results of other commands into
grep for matching. |
3.7.2 Using sed
In Objective 2, we introduce sed, the stream editor. In that section, we
talk about how sed uses addresses to locate text upon which
it will operate. Among the addressing mechanisms mentioned is
the use of regular expressions delimited between slash
characters. Let's recap how sed
can be invoked.
Syntaxsed [options] 'command1' [files]
sed [options] -e 'command1' [-e 'command2'] [files]
sed [options] -f script [files]
Description
Note that command1 is contained within single
quotes. This is necessary for the same reasons as with grep. The text in command1 must be protected from
evaluation and expansion by the shell.
The address part of a sed command may contain regular
expressions, which are enclosed in slashes. For example, to
show the contents of file1 except for blank lines, the
sed delete (d) command could be invoked like
this: $ sed '/^$/ d' file1
In this case, the regular expression ^$ matches blank lines and the d command removes those matching
lines from sed's output.
3.7.2.1 Quoting
As shown in the examples for grep and sed, it is necessary to quote regular expression
metacharacters if you wish to preserve their special meaning.
Failing to do this can lead to unexpected results when the
shell interprets the metacharacters as file globbing
characters. There are three forms of quoting you may use to preserve special characters:
- \ (an unquoted backslash
character)
-
By applying a backslash before a special
character, it will not be interpreted by the shell but will
be passed through unaltered to the command you're entering.
For example, the *
metacharacter may be used in a regular expression like this:
$ grep abc\* abc abc1 abc2
Here, files abc, abc1, and
abc2 are searched for the regular expression
abc*.
- Single quotes
-
Surrounding metacharacters with the
single-quote character also protects them from
interpretation by the shell. All characters inside a pair of
single quotes are assumed to have their literal value.
- Double quotes
-
Surrounding
metacharacters with the double-quote character has the same
effect as single quotes, with the exception of the $, ' (single quote),
and \ (backslash) characters.
Both $ and ' retain
their special meaning within double quotes. The backslash
retains its special meaning when followed by $, ', another backslash,
or a newline.
In general, single quotes are safest for
preserving regular expressions.
Pay special attention to quoting
methods used to preserve special characters, because the
various forms don't necessarily yield the same result.
|
3.7.3 Regular Expressions
Linux
offers many tools for system administrators to use for
processing text. Many, such as sed and the awk and Perl languages, are capable
of automatically editing multiple files, providing you with a
wide range of text-processing capability. To harness that
capability, you need to be able to define and delineate
specific text segments from within files, text streams, and
string variables. Once the text you're after is identified,
you can use one of these tools or languages to do useful
things to it.
These tools and others understand a loosely
defined pattern language. The language and the patterns
themselves are collectively called regular expressions (often
abbreviated just regexp or regex). While regular
expressions are similar in concept to file globs, many more
special characters exist for regular expressions, extending
the utility and capability of tools that understand them.
Regular expressions are the topic of entire
books (such as Jeffrey E. F. Friedl's excellent and very
readable Mastering Regular
Expressions, published by O'Reilly & Associates).
Exam 101 requires the use of simple regular expressions and
related tools, specifically to perform searches from text
sources. This section covers only the basics of regular
expressions, but it goes without saying that their power
warrants a full understanding. Digging deeper into the regular
expression world is highly recommended when you have the
chance.
3.7.3.1 Regular expression
syntax
It would not be unreasonable to assume that
some specification defines how regular expressions are
constructed. Unfortunately, there isn't one. Regular
expressions have been incorporated as a feature in a number of
tools over the years, with varying degrees of consistency and
completeness. The result is a cart-before-the-horse scenario,
in which utilities and languages have defined their own flavor
of regular expression syntax, each with its own extensions and
idiosyncrasies. Formally defining the regular expression
syntax came later, as did efforts to make it more consistent.
Regular expressions are defined by arranging strings of text,
or patterns. Those patterns are
composed of two types of characters:
- Metacharacters
-
Like the special file globbing characters, regular
expression metacharacters
take on a special meaning in the context of the tool in
which they're used. There are a few metacharacters that are
generally thought of to be among the "extended set" of
metacharacters, specifically those introduced into egrep after grep was created. Now, most of
those can also be handled by grep using the -E option. Examples of
metacharacters include the ^
symbol, which means "the beginning of a line," and the $ symbol, which means "the end of a
line." A complete listing of metacharacters follows in Table
3-6, Table
3-7, and Table
3-8.
- Literals
-
Everything that is not a metacharacter is
just plain text, or literal text.
It is often helpful to consider regular
expressions as their own language, where literal text acts as
words and phrases. The "grammar" of the language is defined by
the use of metacharacters. The two are combined according to
specific rules (which, as mentioned earlier, may differ
slightly among various tools) to communicate ideas and get
real work done. When you construct regular expressions, you
use metacharacters and literals to specify three basic ideas
about your input text:
- Position anchors
-
A position anchor is used to specify the
position of one or more character sets in relation to the
entire line of text (such as the beginning of a line).
- Character sets
-
A character set matches text. It could be a
series of literals, metacharacters that match individual or
multiple characters, or combinations of these.
- Quantity modifiers
-
Quantity modifiers follow a character set
and indicate the number of times the set should be repeated.
These characters "give elasticity" to a regular expression
by allowing the matches to have variable length.
The next section lists commonly used
metacharacters. The examples given with the metacharacters are
very basic, intended just to demonstrate the use of the
metacharacter in question. More involved regular expressions
are covered later.
3.7.4 Regular Expression
Examples
Now that the gory details are out of the way,
here are some examples of simple regular expression usage that
you may find useful.
3.7.4.1 Anchors
Anchors are used to describe position
information. Table
3-6 lists anchor characters.
Table 3-6. Regular Expression
Position Anchors
^ |
Match at the beginning of a line. This
interpretation makes sense only when the ^
character is at the lefthand side of the regex.
|
$ |
Match at the end of a line. This
interpretation makes sense only when the $
character is at the righthand side of the regex.
|
Example 1
Display all lines from file1 where the
string "Linux" appears at the start of the line: $ grep '^Linux' file1
Example 2
Display lines in file1 where the last
character is an "x": $ grep 'x$' file1
Display the number of empty lines in
file1 by finding lines with nothing between the
beginning and the end: $ grep -c '^$' file1
Display all lines from file1
containing only the word "null" by itself: $ grep '^null$' file1
3.7.4.2 Groups and ranges
Characters can be placed into groups and
ranges to make regular expressions more efficient, as shown in
Table
3-7.
Table 3-7. Regular Expression
Character Sets
[abc]
[a-z] |
Single-character groups and ranges. In
the first form, match any single character from among
the enclosed characters a, b, or
c. In the second form, match any single
character from among the range of characters bounded by
a and z. The brackets are for grouping
only and are not matched themselves. |
[^abc]
[^a-z] |
Inverse match. Match any single
character not among the enclosed characters a,
b, and c or in the range a-z.
Be careful not to confuse this inversion with the anchor
character ^, described earlier. |
\<word\> |
Match words. Words are essentially
defined as being character sets surrounded by whitespace
and adjacent to the start of line, the end of line, or
punctuation marks. The backslashes are required and
enable this interpretation of < and
>. |
. (the single dot) |
Match any single character except a
newline. |
\ |
As mentioned in the section on quoting
earlier, turn off (escape) the special meaning of the
character that follows, turning metacharacters in to
literals. |
Example 1
Display all lines from file1
containing either "Linux," "linux," "TurboLinux," and so on:
$ grep '[Ll]inux' file1
Example 2
Display all lines from file1 which
contain three adjacent digits: $ grep '[0-9][0-9][0-9]' file1
Example 3
Display all lines from file1 beginning
with any single character other than a digit: $ grep '^[^0-9]' file1
Example 4
Display all lines from file1 that
contain the whole word "Linux" or "linux," but not "LinuxOS"
or "TurboLinux": $ grep '\<[Ll]inux\>' file1
Example
5
Display all lines from file1 with five
or more characters on a line (excluding the newline
character): $ grep '.....' file1
Example 6
Display all nonblank lines from file1
(i.e., that have at least one character): $ grep '.' file1
Example 7
Display all lines from file1 that
contain a period (normally a metacharacter) using escape: $ grep '\.' file1
3.7.4.3 Modifiers
Modifiers change the meaning of other
characters in a regular expression. Table
3-8 lists these modifiers.
Table 3-8. Regular Expression
Modifiers
* |
Match an unknown number (zero or more)
of the single character (or single-character
regex) that precedes it. |
? |
Match zero or one instance of the
preceding regex. This modifier is an "extended" feature
and available in grep
only when the -E
command-line option is used. |
+ |
Match one or more instances of the
preceding regex. This modifier is an "extended"
feature and available in grep only when the -E command-line option is used.
|
\{n,m\} |
Match a range of occurrences of the
single character or regex that precedes this construct.
\{n\}
matches n
occurrences,\{n,\} matches at least
n occurrences, and
\{n,m\} matches any number
of occurrences between n
and m, inclusively. The
backslashes are required and enable this interpretation
of{ and }. |
| |
Alternation. Match either the
regex specified before or after the vertical bar. This
modifier is an "extended" feature and available in grep only when the -E command-line option is used.
|
Example 1
Display all lines from file1 that
contain "ab," "abc," "abcc," "abccc," and so on: $ grep 'abc*' file1
Example 2
Display all lines from file1 that
contain "abc," "abcc," "abccc," and so on, but not "ab": $ grep 'abcc*' file1
Example 3
Display all lines from file1 that
contain two or more adjacent digits: $ grep '[0-9][0-9][0-9]*' file1
Example 4
Display lines from file1 that contain
"file" (because ? can match
zero occurrences), file1, or
file2: $ grep -E 'file[12]?' file1
Example 5
Display all lines from file1
containing at least one digit: $ grep -E '[0-9]+' file1
Example 6
Display all lines from file1 that
contain "111," "1111," or "11111" on a line by itself: $ grep '^1\{3,5\}$' file1
Example 7
Display all lines from file1 that
contain any three-, four-, or five-digit number: $ grep '\<[0-9]\{3,5\}\>' file1
Example 8
Display all lines from file1 that
contain "Happy," "happy," "Sad," "sad," "Angry," or "angry":
$ grep -E '[Hh]appy|[Ss]ad|[Aa]ngry' file1
3.7.4.4 Basic regular expression
patterns
Example 1
Match any letter: [A-Za-z]
Example 2
Match any symbol (not a letter or digit):
[^0-9A-Za-z]
Example 3
Match an uppercase letter, followed by zero
or more lowercase letters: [A-Z][a-z]*
Example 4
Match a U.S. Social Security Number
(123-45-6789) by specifying groups of three, two, and four
digits separated by dashes: [0-9]\{3\}-[0-9]\{2\}-[0-9]\{4\}
Example 5
Match a dollar amount, using an escaped
dollar sign, zero or more spaces or digits, an escaped period,
and two more digits: \$[ 0-9]*\.[0-9]\{2\}
Example 6
Match the month of June and its abbreviation,
"Jun." The question mark matches zero or one instance of the
e : June?
3.7.4.5 Using regular expressions
as addresses in sed
These examples are commands you would issue
to sed. For example, the
commands could take the place of command1 in this usage: $ sed [options] 'command1' [files]
These commands could also appear in a
standalone sed script.
Example 1
Delete blank lines: /^$/d
Example 2
Delete any line that doesn't contain
#keepme:: /#keepme/!d
Example 3
Delete lines containing only whitespace
(spaces or tabs). In this example, tab means the single tab character
and is preceded by a single space: /^[ tab]*$/d
Example 4
Delete lines beginning with periods or pound
signs: /^[.#]/d
Example 5
Substitute a single space for any number of
spaces wherever they occur on the line: s/ */ /g
Example 6
Substitute def
for abc from line 11 to 20,
wherever it occurs on the line: 11,20s/abc/@@@/g
Example 7
Translate the characters a, b,
and c to the @ character from line 11 to 20,
wherever they occur on the line: 11,20y/abc/@@@/
Make certain you are clear about the
difference between file
globbing and the use of regular expressions. |
|