Book: LPI Linux Certification in a Nutshell
Section: Chapter 3.  GNU and Unix Commands (Topic 1.3)



3.2 Objective 2: Process Text Streams Using Text-Processing Filters

Many of the commands on Linux systems are intended to be used as filters, which modify text in helpful ways. Text fed into the command's standard input or read from files is modified in some useful way and sent to standard output or to a new file. Multiple commands can be combined to produce text streams, which are modified at each step in a pipeline formation. This section describes basic use and syntax for the filtering commands important for Exam 101. Refer to a Linux command reference for full details on each command and the many other available commands.

cut

Syntax

cut options [files]

Description

Cut out (that is, print) selected columns or fields from one or more files. The source file is not changed. This is useful if you need quick access to a vertical slice of a file. By default, the slices are delimited by a tab.

Frequently used options

-b list

Print bytes in list positions.

-c list

Print characters in list columns.

-d delim

Set field delimiter for -f.

-f list

Print list fields.

Examples

Show usernames (in the first colon-delimited field) from /etc/passwd:

$ cut -d: -f1 /etc/passwd

Show first column of /etc/passwd:

$ cut -c 1 /etc/passwd
expand

Syntax

expand [options] files

Description

Convert tabs to spaces. Sometimes the use of tab characters can make output that is attractive on one output device look bad on another. This command eliminates tabs and replaces them with the equivalent number of spaces. By default, tabs are assumed to be eight spaces apart.

Frequently used options

-t tabs

Specify tab stops, in place of default 8.

-i

Initial; convert only at start of lines.

fmt

Syntax

fmt [options] [files]

Description

Format text to a specified width by filling lines and removing newline characters. Multiple files from the command line are concatenated.

Frequently used options

-u

Use uniform spacing: one space between words and two spaces between sentences.

-w width

Set line width to width. The default is 75 characters.

head

Syntax

head [options] [files]

Description

Print the first few lines of one or more files (the "head" of the file or files). When more than one file is specified, a header is printed at the beginning of each file, and each is listed in succession.

Frequently used options

-c n

Print the first n bytes, or if n is followed by k or m, print the first n kilobytes or megabytes, respectively.

-l n

Print the first n lines. The default is 10.

join

Syntax

join [options] file1 file2

Description

Print a line for each pair of input lines, one each from file1 and file2, that have identical join fields. This function could be thought of as a very simple database table join, where the two files share a common index just as two tables in a database would.

Frequently used options

-j1 field

Join on field of file1.

-j2 field

Join on field of file2.

-j field

Join on field of both file1 and file2.

Example

Suppose file1 contains the following:

1 one
2 two
3 three

and file2 contains:

1 11
2 22
3 33

Issuing the command:

$ join -j 1 file1 file2

yields the following output:

1 one 11
2 two 22
3 three 33
nl

Syntax

nl [options] [files]

Description

Number the lines of files, which are concatenated in the output. This command is used for numbering lines in the body of text, including special header and footer options normally excluded from the line numbering. The numbering is done for each logical page, which is defined as having a header, a body, and a footer. These are delimited by the special strings \:\:\:, \:\:, and \:, respectively.

Frequently used options

-b style

Set body numbering style to style, t by default.

-f style

Set footer number style to style, n by default.

-h style

Set header numbering style to style, n by default.

Styles can be in these forms:

A

Number all lines.

t

Only number non-empty lines.

n

Do not number lines.

pREGEXP

Only number lines that contain a match for regular expression REGEXP.

Example

Suppose file file1 contains the following text:

\:\:\:
header
\:\:
line1
line2
line3
\:
footer
\:\:\:
header
\:\:
line1
line2
line3
\:
footer

If the following command is given:

$ nl -h a file1

the output would yield numbered headers and body lines but no numbering on footer lines. Each new header represents the beginning of a new logical page and thus a restart of the numbering sequence:

1 header

2 line1
3 line2
4 line3

footer

1 header

2 line1
3 line2
4 line3

footer
od

Syntax

od [options] [files]

Description

Dump files in octal and other formats. This program prints a listing of a file's contents in a variety of formats. It is often used to examine the byte codes of binary files but can be used on any file or input stream. Each line of output consists of an octal byte offset from the start of the file followed by a series of tokens indicating the contents of the file. Depending on the options specified, these tokens can be ASCII, decimal, hexadecimal, or octal representations of the contents.

Frequently used options

-t type

Specify the type of output. Typical types include:

A

Named character

c

ASCII character or backslash escape

O

Octal (the default)

x

Hexadecimal

Example

If file1 contains:

a1\n
A1\n

where \n stands for the newline character. The od command specifying named characters yields the following output:

$ od -t a file1
00000000   a   1  nl   A   1  nl
00000006

A slight nuance is the ASCII character mode. This od command specifying named characters yields the following output with backslash-escaped characters rather than named characters:

$ od -t c file1
00000000   a   1  \n   A   1  \n
00000006

With numeric output formats, you can instruct od on how many bytes to use in interpreting each number in the data. To do this, follow the type specification by a decimal integer. This od command specifying single-byte hex results yields the following output:

$ od -t x1 file1
00000000  61 31 0a 41 31 0a
00000006

Doing the same thing in octal notation yields:

$ od -t o1 file1
00000000  141 061 012 101 061 012
00000006

If you examine an ASCII chart with hex and octal representations, you'll see that these results match those tables.

paste

Syntax

paste [options] files

Description

Paste together corresponding lines of one or more files into vertical columns.

Frequently used options

-d'n'

Separate columns with character n in place of the default tab.

-s

Merge lines from one file into a single line. When multiple files are specified, their contents are placed on individual lines of output, one per file.

For the following three examples, file1 contains:

1
2
3

and file2 contains:

A
B
C

Example 1

A simple paste creates columns from each file in standard output:

$ paste file1 file2
1    A
2    B
3    C

Example 2

The column separator option yields columns separated by the specified character:

$ paste -d'@' file1 file2
1@A
2@B
3@C

Example 3

The single-line option (-s) yields a line for each file:

$ paste -s file1 file2
1    2    3
A    B    C
pr

Syntax

pr [options] [file]

Description

Convert a text file into a paginated, columnar version, with headers and page fills. This command is convenient for yielding nice output, such as for a line printer from raw uninteresting text files. The header will consist of the date and time, the filename, and a page number.

Frequently used options

-d

Double space.

-h header

Use header in place of the filename in the header.

-l lines

Set page length to lines. The default is 66.

-o width

Set the left margin to width.

split

Syntax

split [option] [infile] [outfile]

Description

Split infile into a specified number of line groups, with output going into a succession of files, outfileaa, outfileab, and so on (the default is xaa, xab, etc.). The infile remains unchanged. This command is handy if you have a very long text file that needs to be reduced to a succession of smaller files. This was often done to email large files in smaller chunks, because it was at one time considered bad practice to send single large email messages.

Frequently used option

-n

Split the infile into n-line segments. The default is 1000.

Example

Suppose file1 contains:

1  one
2  two
3  three
4  four
5  five
6  six

Then the command:

$ split -2 file1 splitout_

yields as output three new files, splitout_aa, splitout_ab, and splitout_ac. The file splitout_aa contains:

1  one
2  two

splitout_ab contains:

3  three
4  four

and splitout_ac contains:

5  five
6  six
tac

Syntax

tac [file]

Description

This command is named as an opposite for the cat command, which simply prints text files to standard output. In this case, tac prints the text files to standard output with lines in reverse order.

Example

Suppose file1 contains:

1  one
2  two
3  three

Then the command:

$ tac file1

yields as output:

3  three
2  two
1  one
tail

Syntax

tail [options] [files]

Description

Print the last few lines of one or more files (the "tail" of the file or files). When more than one file is specified, a header is printed at the beginning of each file, and each is listed in succession.

Frequently used options

-c n

This option prints the last n bytes, or if n is followed by k or m, the last n kilobytes or megabytes, respectively.

-f

Follow the output dynamically as new lines are added to the bottom of a file.

-n m

Prints the last m lines. The default is 10.

-f

Continuously display a file as it is actively written by another process. This is useful for watching log files as the system runs.

tr

Syntax

tr [options] [[string1 [string2]]

Description

Translate characters from string1 to the corresponding characters in string2. tr does not have file arguments and therefore must use standard input and output. If string1 and string2 specify ranges (a-z or A-Z), they should represent the same number of characters.

Frequently used options

-d

Delete characters in string1 from the output.

-s

Squeeze out repeated output characters in string1.

Example 1

To change all lowercase characters in file1 to uppercase, use either of these commands:

$ cat file1 | tr a-z A-Z

or:

$ tr a-z A-Z < file1

Example 2

To suppress repeated "a" characters from file1:

$ cat file1 | tr -s a

Example 3

To remove all "a," "b," and "c" characters from file1:

$ cat file1 | tr -d abc
wc

Syntax

wc [options] [files] 

Description

Print counts of characters, words, and lines for files. When multiple files are listed, statistics for each file output on a separate line with a cumulative total output last.

Frequently used options

-c

Print the character count only.

-l

Print the line count only.

-w

Print the word count only.

Example 1

Show all counts and totals for file1, file2, and file3:

$ wc file[123]

Example 2

Count the number of lines in file1:

$ wc -l file1
xargs

Syntax

xargs [options] [command] [initial-arguments]

Description

Execute command followed by its optional initial-arguments and append additional arguments found on standard input. Typically, the additional arguments are filenames in quantities too large for a single command line. xargs runs command multiple times to exhaust all arguments on standard input.

Frequently used options

-n maxargs

Limit the number of additional arguments to maxargs for each invocation of command.

-p

Interactive mode. Prompt the user for each execution of command.

Example

Use grep to search a long list of files, one by one, for the word "linux":

$ find / -type f | xargs -n 1 grep linux

find searches for normal files (-type f ) starting at the root directory. xargs executes grep once for each of them due to the -n 1 option.

3.2.1 The Stream Editor, sed

Another filtering program found on nearly every Unix system is sed, the stream editor. It is called a stream editor because it is intended as a filter, with text usually flowing from standard input, through the utility, to standard output. Unlike the previously listed commands, sed is a programmable utility with a range of capabilities. During processing, sed interprets instructions from a sed script, processing the text according to those instructions. The script may be a single command or a longer list of commands. It is important to understand sed and its use for Exam 101, although detailed knowledge is not required or offered in this brief introduction.

The sed utility is usually used either to automate repetitive editing tasks or to process text in pipes of Unix commands (see Objective 4). The scripts that sed executes can be single commands or more complex lists of editing instructions. It is invoked using one of the following methods.

sed

Syntax

sed [options] 'command1' [files]
sed [options] -e 'command1' [-e 'command2'...] [files]
sed [options] -f script [files]

Description

The first form invokes sed with a one-line command1. The second form invokes sed with two (or more) commands. Note that in this case the -e parameter is required for all commands specified. The commands are specified in quotes to prevent the shell from interpreting and expanding them. The last form instructs sed to take editing commands from file script (which does not need to be executable). In all cases, if files are not specified, input is taken from standard input. If multiple files are specified, the edited output of each successive file is concatenated.

Frequently used options

-e cmd

The next argument is a command. This is not needed for single commands but is required for all commands when multiple commands are specified.

-f file

The next argument is a script.

-g

Treat all substitutions as global.

The sed utility operates on text through the use of addresses and editing commands. The address is used to locate lines of text to be operated upon, and editing commands modify text. During operation, each line (that is, text separated by newlinecharacters) of input to sed is processed individually and without regard to adjacent lines. If multiple editing commands are to be used (through the use of a script file or multiple -e options), they are all applied in order to each line before moving on to the next line.

Input to sed can come from standard input or from files. When input is received from standard input, the original versions of the input text are lost. However, when input comes from files, the files themselves are not changed by sed. The output of sed represents a modified version of the contents of the files but does not affect them.

Addressing

Addresses in sed locate lines of text to which commands will be applied. The addresses can be:

  • A line number (note that sed counts lines continuously across multiple input files).

  • A line number with an interval. The form is n~s, where n is the starting line number and s is the step, or interval, to apply. For example, to match every odd line in the input, the address specification would be 1~2 (start at line 1 and match every two lines thereafter). This feature is a GNU extension to sed.

  • The symbol $, indicating the last line of the last input file.

  • A regular expression delimited by forward slashes (/regex/ ). See Objective 7 for more information on using regular expressions.

Zero, one, or two such addresses can be used with a sed command. If no addresses are given, commands are applied to all input lines by default. If a single address is given, commands are applied only to a line or lines matching the address. If two comma-separated addresses are given, an inclusive range is implied. Finally, any address may be followed by the ! character, and commands are applied to lines that do not match the address.

Commands

The sed command immediately follows the address specification if present. Commands generally consist of a single letter or symbol, unless they have arguments. Following are some basic sed editing commands to get you started.

d

Delete lines.

s

Make substitutions.This is a very popular sed command. The syntax is:

s/pattern/replacement/[flags]

The following flags can be specified for the s command:

g

Replace all instances of pattern, not just the first.

n

Replace n th instance of pattern; the default is 1.

p

Print the line if a successful substitution is done. Generally used with the -n command-line option.

w file

Print the line to file if a successful substitution is done.

y

Translate characters. This command works in a fashion similar to the tr command, described earlier.

Example 1

Delete lines 3 through 5 of file1:

$ sed '3,5d' file1

Example 2

Delete lines of file1 that contain a # at the beginning of the line:

$ sed '/^#/d' file1

Example 3

Translate characters:

y/abc/xyz/

Every instance of a is translated to x, b to y, and c to z.

Example 4

Write the @ symbol for all empty lines in file1 (that is, lines with only a newline character but nothing more):

$ sed 's/^$/@/' file1

Example 5

Remove all double quotation marks from all lines in file1:

$ sed 's/"//g' file1

Example 6

Using sed commands from external file sedcmds, replace the third and fourth double quotation marks with ( and ) on lines 1 through 10 in file1. Make no changes from line 11 to the end of the file. Script file sedcmds contains:

1,10{
s/"/(/3
s/"/)/4
}

The command is executed using the -f option:

$ sed -f sedcmds file1

This example employs the positional flag for the s (substitute) command. The first of the two commands substitutes ( for the third double-quote character. The next command substitutes ) for the fourth double-quote character. Note, however, that the position count is interpreted independently for each subsequent command in the script. This is important because each command operates on the results of the commands preceding it. In this example, since the third double quote has been replaced with ( , it is no longer counted as a double quote by the second command. Thus, the second command will operate on the fifth double quote character in the original file1. If the input line starts out with:

""""""

after the first command, which operates on the third double quote, the result is:

""("""

At this point, the numbering of the double-quote characters has changed, and the fourth double quote in the line is now the fifth character. Thus, after the second command executes, the output is:

""(")"

As you can see, creating scripts with sed requires that the sequential nature of the command execution be kept in mind.

If you find yourself making repetitive changes to many files on a regular basis, a sed script is probably warranted. Many more commands are available in sed than are listed here.