[ Team LiB ] Previous Section

1.11 Shell Tools

awk, sed, and egrep are a related set of Unix shell tools for text processing. awk and egrep use a DFA match engine, and sed uses an NFA engine. For an explanation of the rules behind these engines, see Section 1.2.

This reference covers GNU egrep 2.4.2, a program for searching lines of text; GNU sed 3.02, a tool for scripting editing commands; and GNU awk 3.1, a programming language for text processing.

1.11.1 Supported Metacharacters

awk, egrep, and sed support the metacharacters and metasequences listed in Table 1-46 through Table 1-50. For expanded definitions of each metacharacter, see Section 1.2.1.

Table 1-46. Character representations

Sequence

Meaning

Tool

\a

Alert (bell).

awk, sed

\b

Backspace; supported only in character class.

awk

\f

Form feed.

awk, sed

\n

Newline (line feed).

awk, sed

\r

Carriage return.

awk, sed

\t

Horizontal tab.

awk, sed

\v

Vertical tab.

awk, sed

\ooctal

A character specified by a one-, two-, or three-digit octal code.

sed

\octal

A character specified by a one-, two-, or three-digit octal code.

awk

\xhex

A character specified by a two-digit hexadecimal code.

awk, sed

\ddecimal

A character specified by a one, two, or three decimal code.

awk, sed

\cchar

A named control character (e.g., \cC is Control-C).

awk, sed

\b

Backspace.

awk

\metacharacter

Escape the metacharacter so that it literally represents itself.

awk, sed, egrep

Table 1-47. Character classes and class-like constructs

Class

Meaning

Tool

[...]

Matches any single character listed or contained within a listed range.

awk, sed, egrep

[^...]

Matches any single character that is not listed or contained within a listed range.

awk, sed, egrep

.

Matches any single character, except newline.

awk, sed, egrep

\w

Matches an ASCII word character, [a-zA-Z0-9_].

egrep, sed

\W

Matches a character that is not an ASCII word character, [^a-zA-Z0-9_].

egrep, sed

[:prop:]

Matches any character in the POSIX character class.

awk, sed

[^[:prop:]]

Matches any character not in the POSIX character class.

awk, sed

Table 1-48. Anchors and other zero-width testshell tools

Sequence

Meaning

Tool

^

Matches only start of string, even if newlines are embedded.

awk, sed, egrep

$

Matches only end of search string, even if newlines are embedded.

awk, sed, egrep

\<

Matches beginning of word boundary.

egrep

\>

Matches end of word boundary.

egrep

Table 1-49. Comments and mode modifiers

Modifier

Meaning

Tool

flag: i or I

Case-insensitive matching for ASCII characters.

sed

command-line option: -i

Case-insensitive matching for ASCII characters.

egrep

set IGNORECASE to non-zero

Case-insensitive matching for Unicode characters.

awk

Table 1-50. Grouping, capturing, conditional, and control

Sequence

Meaning

Tool

(PATTERN)

Grouping.

awk

\(PATTERN\)

Group and capture sub-matches, filling \1,\2,...,\9.

sed

\n

Contains the nth earlier submatch.

sed

...|...

Alternation; match one or the other.

egrep, awk, sed

Greedy quantifiers

  

*

Match 0 or more times.

awk, sed, egrep

+

Match 1 or more times.

awk, sed, egrep

?

Match 1 or 0 times.

awk, sed, egrep

\{n\}

Match exactly n times.

sed, egrep

\{n,\}

Match at least n times.

sed, egrep

\{x,y\}

Match at least x times, but no more than y times.

sed, egrep

egrep

egrep [options] pattern files

egrep searches files for occurrences of pattern and prints out each matching line.

Example

$ echo 'Spiderman Menaces City!' > dailybugle.txt
$ egrep -i 'spider[- ]?man' dailybugle.txt
Spiderman Menaces City!
sed

sed '[address1][,address2]s/pattern/replacement/[flags]' files
sed -f script files

By default, sed applies the substitution to every line in files. Each address can be either a line number or a regular expression pattern. A supplied regular expression must be defined within the forward slash delimiters (/...). If address1 is supplied, substitution will begin on that line number or the first matching line, and continue until either the end of the file or the line indicated or matched by address2.

Two subsequences, & and \n, will be interpreted in replacement based on the results of the match. The sequence & is replaced with the text matched by pattern. The sequence \n corresponds to a capture group (1..9) in the current match.

The available flags are:

n

Substitute the nth match in a line, where n is between 1 and 512.

g

Substitute all occurrences of pattern in a line.

p

Print lines with successful substitutions.

w file

Write lines with successful substitutions to file.

Example

Change date formats from MM/DD/YYYY to DD.MM.YYYY.

$ echo 12/30/1969' |
 sed 's!\([0-9][0-9]\)/\([0-9][0-9]\)/\([0-9]\{2,4\}\)!\2.\1.\3!g'
awk

awk 'instructions' files
awk -f script files

The awk script contained in either instructions or script should be a series of /pattern/ {action} pairs. The action code is applied to each line matched by pattern. awk also supplies several functions for pattern matching.

Functions

match( text, pattern)

If pattern matches in text, returns the position in text where the match starts. A failed match returns zero. A successful match also sets the variable RSTART to the position where the match started and the variable RLENGTH to the number of characters in the match.

gsub( pattern, replacement, text)

Substitutes each match of pattern in text with replacement and returns the number of substitutions. Defaults to $0 if text is not supplied.

sub (pattern, replacement, text)

Substitutes first match of pattern in text with replacement. A successful substitution returns 1, and an unsuccessful substitution returns 0. Defaults to $0 if text is not supplied.

Example

Create an awk file and then run it from the command line.

$ cat sub.awk
{
    gsub(/https?:\/\/[a-z_.\\w\/\\#~:?+=&;%@!-]*/,
                    "<a href=\"\&\">\&</a>");
 
    print
}

$ echo "Check the website, http://www.oreilly.com/catalog/repr" | awk -f sub.awk

1.11.2 Other Resources

  • sed & awk, by Dale Dougherty and Arnold Robbins (O'Reilly), is an introduction and reference to both tools.

    [ Team LiB ] Previous Section