Book: LPI Linux Certification in a Nutshell
Section: Chapter 3.  GNU and Unix Commands (Topic 1.3)



3.7 Objective 7: Making Use of Regular Expressions

In Objective 3, filename globbing with wildcards is described, which enables us to list or find files with common elements (i.e., filenames or file extensions) at once. File globs make use of special characters such as *, which have special meanings in the context of the command line. There are a handful of shell wildcard characters understood by bash, enough to handle the relatively simple problem of globbing filenames. Other problems aren't so simple, and extending the glob concept into any generic text form (files, text streams, program string variables, etc.) can open up a wide new range of capability. This is done using regular expressions.

Two tools that are important for the LPIC Level 1 exams and that make use of regular expressions are grep and sed. These tools are useful for text searches. There are many other tools that make use of regular expressions, including the awk, Perl, and Python languages and other utilities, but you don't need to be concerned with them for the purpose of the LPIC Level 1 exams.

3.7.1 Using grep

A long time ago, as the idea of regular expressions was catching on, the line editor ed contained a command to display lines of a file being edited that matched a given regular expression. The command is:

g/regular expression/p

That is, "on a global basis, print the current line when a match for regular expression is found," or more simply, "global regular expression print." This function was so useful that it was made into a standalone utility named, appropriately, grep. Later, the regular expression grammar of grep was expanded in a new command called egrep (for "extended grep"). You'll find both commands on your Linux system today, and they differ slightly in the way they handle regular expressions. For the purposes of Exam 101, we'll stick with grep, which can also make use of the "extended" regular expressions when used with the -E option. You will find some form of grep on just about every Unix or Unix-like system available.

grep

Syntax

grep [options] regex [files]

Description

Search files or standard input for lines containing a match to regular expression regex. By default, matching lines will be displayed and nonmatching lines will not be displayed. When multiple files are specified, grep displays the filename as a prefix to the output lines (use the -h option to suppress filename prefixes).

Frequently used options

-c

Display only a count of matched lines, but not the lines themselves.

-h

Display matched lines, but do not include filenames for multiple file input.

-i

Ignore uppercase and lowercase distinctions, allowing abc to match both abc and ABC.

-n

Display matched lines prefixed with their line numbers. When used with multiple files, both the filename and line number are prefixed.

-v

Print all lines that do not match regex. This is an important and useful option. You'll want to use regular expressions, not only to select information but also to eliminate information. Using -v inverts the output this way.

Examples

Since regular expressions can contain both metacharacters and literals, grep can be used with an entirely literal regex. For example, to find all lines in file1 that contain either Linux or linux, you could use grep like this:

$ grep -i linux file1

In this example, the regex is simply "linux." The uppercase L in "Linux" is matched by the command-line option -i. This is fine for literal expressions that are common. However, in situations in which regex includes regular expression metacharacters that are also shell special characters (such as $ or *), the regex must be quoted to prevent shell expansion and pass the metacharacters on to grep.

As a simplistic example of this, suppose you have files in your local directory named abc, abc1, and abc2. When combined with bash's echo expression, the abc* wildcard expression lists all files that begin with abc, as follows:

$ echo abc*
abc abc1 abc2

Now suppose that these files contain lines with the strings abc, abcc, abccc, and so on, and you wish to use grep to find them. You can use the shell wildcard expression abc* to expand to all the abc files as displayed with echo above, and you'd use an identical regular expression abc* to find all occurrences of lines containing abc, abcc, abccc, etc. Without using quotes to prevent shell expansion, the command would be:

$ grep abc* abc* 

After shell expansion, this yields:

$ grep abc abc1 abc2 abc abc1 abc2  # no!

This is not what you intended! grep would search for the literal expression abc, because it appears as the first command argument. Instead, quote the regular expression with single or double quotes to protect it:[22]

[22] The difference between single quotes and double quotes on the command line is subtle and is explained later in this section.

$ grep 'abc*' abc*

or:

$ grep "abc*" abc*

After expansion, both examples yield the same results:

$ grep abc* abc abc1 abc2

Now this is what you're after. The three files abc, abc1, and abc2 will be searched for the regular expression abc*. It is good to stay in the habit of quoting regular expressions on the command line to avoid these problems -- they won't be at all obvious because the shell expansion is invisible to you unless you use the echo command.

On the Exam

The use of grep and its options is common. You should be familiar with what each option does, as well as the concept of piping the results of other commands into grep for matching.

3.7.2 Using sed

In Objective 2, we introduce sed, the stream editor. In that section, we talk about how sed uses addresses to locate text upon which it will operate. Among the addressing mechanisms mentioned is the use of regular expressions delimited between slash characters. Let's recap how sed can be invoked.

sed

Syntax

sed [options] 'command1' [files]
sed [options] -e 'command1' [-e 'command2'] [files]
sed [options] -f script [files]

Description

Note that command1 is contained within single quotes. This is necessary for the same reasons as with grep. The text in command1 must be protected from evaluation and expansion by the shell.

The address part of a sed command may contain regular expressions, which are enclosed in slashes. For example, to show the contents of file1 except for blank lines, the sed delete (d) command could be invoked like this:

$ sed '/^$/ d' file1

In this case, the regular expression ^$ matches blank lines and the d command removes those matching lines from sed's output.

3.7.2.1 Quoting

As shown in the examples for grep and sed, it is necessary to quote regular expression metacharacters if you wish to preserve their special meaning. Failing to do this can lead to unexpected results when the shell interprets the metacharacters as file globbing characters. There are three forms of quoting you may use to preserve special characters:

\ (an unquoted backslash character)

By applying a backslash before a special character, it will not be interpreted by the shell but will be passed through unaltered to the command you're entering. For example, the * metacharacter may be used in a regular expression like this:

$ grep abc\* abc abc1 abc2

Here, files abc, abc1, and abc2 are searched for the regular expression abc*.

Single quotes

Surrounding metacharacters with the single-quote character also protects them from interpretation by the shell. All characters inside a pair of single quotes are assumed to have their literal value.

Double quotes

Surrounding metacharacters with the double-quote character has the same effect as single quotes, with the exception of the $, ' (single quote), and \ (backslash) characters. Both $ and ' retain their special meaning within double quotes. The backslash retains its special meaning when followed by $, ', another backslash, or a newline.

In general, single quotes are safest for preserving regular expressions.

On the Exam

Pay special attention to quoting methods used to preserve special characters, because the various forms don't necessarily yield the same result.

3.7.3 Regular Expressions

Linux offers many tools for system administrators to use for processing text. Many, such as sed and the awk and Perl languages, are capable of automatically editing multiple files, providing you with a wide range of text-processing capability. To harness that capability, you need to be able to define and delineate specific text segments from within files, text streams, and string variables. Once the text you're after is identified, you can use one of these tools or languages to do useful things to it.

These tools and others understand a loosely defined pattern language. The language and the patterns themselves are collectively called regular expressions (often abbreviated just regexp or regex). While regular expressions are similar in concept to file globs, many more special characters exist for regular expressions, extending the utility and capability of tools that understand them.

Regular expressions are the topic of entire books (such as Jeffrey E. F. Friedl's excellent and very readable Mastering Regular Expressions, published by O'Reilly & Associates). Exam 101 requires the use of simple regular expressions and related tools, specifically to perform searches from text sources. This section covers only the basics of regular expressions, but it goes without saying that their power warrants a full understanding. Digging deeper into the regular expression world is highly recommended when you have the chance.

3.7.3.1 Regular expression syntax

It would not be unreasonable to assume that some specification defines how regular expressions are constructed. Unfortunately, there isn't one. Regular expressions have been incorporated as a feature in a number of tools over the years, with varying degrees of consistency and completeness. The result is a cart-before-the-horse scenario, in which utilities and languages have defined their own flavor of regular expression syntax, each with its own extensions and idiosyncrasies. Formally defining the regular expression syntax came later, as did efforts to make it more consistent. Regular expressions are defined by arranging strings of text, or patterns. Those patterns are composed of two types of characters:

Metacharacters

Like the special file globbing characters, regular expression metacharacters take on a special meaning in the context of the tool in which they're used. There are a few metacharacters that are generally thought of to be among the "extended set" of metacharacters, specifically those introduced into egrep after grep was created. Now, most of those can also be handled by grep using the -E option. Examples of metacharacters include the ^ symbol, which means "the beginning of a line," and the $ symbol, which means "the end of a line." A complete listing of metacharacters follows in Table 3-6, Table 3-7, and Table 3-8.

Literals

Everything that is not a metacharacter is just plain text, or literal text.

It is often helpful to consider regular expressions as their own language, where literal text acts as words and phrases. The "grammar" of the language is defined by the use of metacharacters. The two are combined according to specific rules (which, as mentioned earlier, may differ slightly among various tools) to communicate ideas and get real work done. When you construct regular expressions, you use metacharacters and literals to specify three basic ideas about your input text:

Position anchors

A position anchor is used to specify the position of one or more character sets in relation to the entire line of text (such as the beginning of a line).

Character sets

A character set matches text. It could be a series of literals, metacharacters that match individual or multiple characters, or combinations of these.

Quantity modifiers

Quantity modifiers follow a character set and indicate the number of times the set should be repeated. These characters "give elasticity" to a regular expression by allowing the matches to have variable length.

The next section lists commonly used metacharacters. The examples given with the metacharacters are very basic, intended just to demonstrate the use of the metacharacter in question. More involved regular expressions are covered later.

3.7.4 Regular Expression Examples

Now that the gory details are out of the way, here are some examples of simple regular expression usage that you may find useful.

3.7.4.1 Anchors

Anchors are used to describe position information. Table 3-6 lists anchor characters.

Table 3-6. Regular Expression Position Anchors

Regular Expression

Description

^

Match at the beginning of a line. This interpretation makes sense only when the ^ character is at the lefthand side of the regex.

$

Match at the end of a line. This interpretation makes sense only when the $ character is at the righthand side of the regex.


Example 1

Display all lines from file1 where the string "Linux" appears at the start of the line:

$ grep '^Linux' file1

Example 2

Display lines in file1 where the last character is an "x":

$ grep 'x$' file1

Display the number of empty lines in file1 by finding lines with nothing between the beginning and the end:

$ grep -c '^$' file1

Display all lines from file1 containing only the word "null" by itself:

$ grep '^null$' file1 
3.7.4.2 Groups and ranges

Characters can be placed into groups and ranges to make regular expressions more efficient, as shown in Table 3-7.

Table 3-7. Regular Expression Character Sets

Regular Expression

Description

[abc]
[a-z]

Single-character groups and ranges. In the first form, match any single character from among the enclosed characters a, b, or c. In the second form, match any single character from among the range of characters bounded by a and z. The brackets are for grouping only and are not matched themselves.

[^abc]

[^a-z]

Inverse match. Match any single character not among the enclosed characters a, b, and c or in the range a-z. Be careful not to confuse this inversion with the anchor character ^, described earlier.

\<word\>

Match words. Words are essentially defined as being character sets surrounded by whitespace and adjacent to the start of line, the end of line, or punctuation marks. The backslashes are required and enable this interpretation of < and >.

. (the single dot)

Match any single character except a newline.

\

As mentioned in the section on quoting earlier, turn off (escape) the special meaning of the character that follows, turning metacharacters in to literals.


Example 1

Display all lines from file1 containing either "Linux," "linux," "TurboLinux," and so on:

$ grep '[Ll]inux' file1

Example 2

Display all lines from file1 which contain three adjacent digits:

$ grep '[0-9][0-9][0-9]' file1

Example 3

Display all lines from file1 beginning with any single character other than a digit:

$ grep '^[^0-9]' file1

Example 4

Display all lines from file1 that contain the whole word "Linux" or "linux," but not "LinuxOS" or "TurboLinux":

$ grep '\<[Ll]inux\>' file1

Example 5

Display all lines from file1 with five or more characters on a line (excluding the newline character):

$ grep '.....' file1 

Example 6

Display all nonblank lines from file1 (i.e., that have at least one character):

$ grep '.' file1 

Example 7

Display all lines from file1 that contain a period (normally a metacharacter) using escape:

$ grep '\.' file1 
3.7.4.3 Modifiers

Modifiers change the meaning of other characters in a regular expression. Table 3-8 lists these modifiers.

Table 3-8. Regular Expression Modifiers

Regular Expression

Description

*

Match an unknown number (zero or more) of the single character (or single-character regex) that precedes it.

?

Match zero or one instance of the preceding regex. This modifier is an "extended" feature and available in grep only when the -E command-line option is used.

+

Match one or more instances of the preceding regex. This modifier is an "extended" feature and available in grep only when the -E command-line option is used.

\{n,m\}

Match a range of occurrences of the single character or regex that precedes this construct. \{n\} matches n occurrences,\{n,\} matches at least n occurrences, and \{n,m\} matches any number of occurrences between n and m, inclusively. The backslashes are required and enable this interpretation of{ and }.

|

Alternation. Match either the regex specified before or after the vertical bar. This modifier is an "extended" feature and available in grep only when the -E command-line option is used.


Example 1

Display all lines from file1 that contain "ab," "abc," "abcc," "abccc," and so on:

$ grep 'abc*' file1 

Example 2

Display all lines from file1 that contain "abc," "abcc," "abccc," and so on, but not "ab":

$ grep 'abcc*' file1 

Example 3

Display all lines from file1 that contain two or more adjacent digits:

$ grep '[0-9][0-9][0-9]*' file1 

Example 4

Display lines from file1 that contain "file" (because ? can match zero occurrences), file1, or file2:

$ grep -E 'file[12]?' file1 

Example 5

Display all lines from file1 containing at least one digit:

$ grep -E '[0-9]+' file1

Example 6

Display all lines from file1 that contain "111," "1111," or "11111" on a line by itself:

$ grep '^1\{3,5\}$' file1

Example 7

Display all lines from file1 that contain any three-, four-, or five-digit number:

$ grep '\<[0-9]\{3,5\}\>' file1

Example 8

Display all lines from file1 that contain "Happy," "happy," "Sad," "sad," "Angry," or "angry":

$ grep -E '[Hh]appy|[Ss]ad|[Aa]ngry' file1 
3.7.4.4 Basic regular expression patterns

Example 1

Match any letter:

[A-Za-z]

Example 2

Match any symbol (not a letter or digit):

[^0-9A-Za-z]

Example 3

Match an uppercase letter, followed by zero or more lowercase letters:

[A-Z][a-z]*

Example 4

Match a U.S. Social Security Number (123-45-6789) by specifying groups of three, two, and four digits separated by dashes:

 [0-9]\{3\}-[0-9]\{2\}-[0-9]\{4\}

Example 5

Match a dollar amount, using an escaped dollar sign, zero or more spaces or digits, an escaped period, and two more digits:

\$[ 0-9]*\.[0-9]\{2\}

Example 6

Match the month of June and its abbreviation, "Jun." The question mark matches zero or one instance of the e :

June?
3.7.4.5 Using regular expressions as addresses in sed

These examples are commands you would issue to sed. For example, the commands could take the place of command1 in this usage:

$ sed [options] 'command1' [files]

These commands could also appear in a standalone sed script.


Example 1

Delete blank lines:

/^$/d

Example 2

Delete any line that doesn't contain #keepme::

/#keepme/!d

Example 3

Delete lines containing only whitespace (spaces or tabs). In this example, tab means the single tab character and is preceded by a single space:

/^[ tab]*$/d

Example 4

Delete lines beginning with periods or pound signs:

/^[.#]/d

Example 5

Substitute a single space for any number of spaces wherever they occur on the line:

s/  */ /g

Example 6

Substitute def for abc from line 11 to 20, wherever it occurs on the line:

11,20s/abc/@@@/g

Example 7

Translate the characters a, b, and c to the @ character from line 11 to 20, wherever they occur on the line:

11,20y/abc/@@@/

On the Exam

Make certain you are clear about the difference between file globbing and the use of regular expressions.