3.2 Objective 2: Process Text
Streams Using Text-Processing Filters
Many of
the commands on Linux systems are intended to be used as filters,
which modify text in helpful ways. Text fed into the command's
standard input or read from files is modified in some useful
way and sent to standard output or to a new file. Multiple
commands can be combined to produce text streams, which are modified at each step in a
pipeline formation. This section describes basic use and
syntax for the filtering commands important for Exam 101.
Refer to a Linux command reference for full details on each
command and the many other available commands.
Syntaxcut options [files]
Description
Cut out (that is, print) selected columns or
fields from one or more files. The source file is not
changed. This is useful if you need quick access to a vertical
slice of a file. By default, the slices are delimited by a
tab.
Frequently used options
- -b list
-
Print bytes in list
positions.
- -c list
-
Print characters in list
columns.
- -d delim
-
Set field delimiter for -f.
- -f list
-
Print list fields.
Examples
Show usernames (in the first colon-delimited
field) from /etc/passwd: $ cut -d: -f1 /etc/passwd
Show first column of /etc/passwd: $ cut -c 1 /etc/passwd
Syntaxexpand [options] files
Description
Convert tabs to spaces. Sometimes the use of
tab characters can make output that is attractive on one
output device look bad on another. This command eliminates
tabs and replaces them with the equivalent number of spaces.
By default, tabs are assumed to be eight spaces apart.
Frequently used options
- -t tabs
-
Specify tab stops, in place of default
8.
- -i
-
Initial; convert only at start of
lines.
Syntax fmt [options] [files]
Description
Format text to a specified width by filling
lines and removing newline characters. Multiple files
from the command line are concatenated.
Frequently used options
- -u
-
Use uniform spacing: one space between
words and two spaces between sentences.
- -w width
-
Set line width to width. The default
is 75 characters.
Syntaxhead [options] [files]
Description
Print the first few lines of one or more
files (the "head" of the file or files). When more than one
file is specified, a header is printed at the beginning of
each file, and each is listed in succession.
Frequently used options
- -c n
-
Print the first n bytes, or if
n is followed by k or m, print the
first n kilobytes or megabytes, respectively.
- -l n
-
Print the first n lines. The default
is 10.
Syntaxjoin [options] file1 file2
Description
Print a line for each pair of input lines,
one each from file1 and file2, that have
identical join fields. This
function could be thought of as a very simple database table
join, where the two files share a common index just as two
tables in a database would.
Frequently used options
- -j1 field
-
Join on field of file1.
- -j2 field
-
Join on field of file2.
- -j field
-
Join on field of both file1
and file2.
Example
Suppose file1 contains the
following: 1 one
2 two
3 three
and file2
contains: 1 11
2 22
3 33
Issuing the command: $ join -j 1 file1 file2
yields the following output: 1 one 11
2 two 22
3 three 33
Syntaxnl [options] [files]
Description
Number the lines of files, which are
concatenated in the output. This command is used for numbering
lines in the body of text, including special header and footer
options normally excluded from the line numbering. The
numbering is done for each logical
page, which is defined as having a header, a body, and
a footer. These are delimited by the special strings
\:\:\:, \:\:, and \:, respectively.
Frequently used options
- -b style
-
Set body numbering style to style,
t by default.
- -f style
-
Set footer number style to style,
n by default.
- -h style
-
Set header numbering style to style,
n by default.
Styles can be in these forms:
- A
-
Number all lines.
- t
-
Only number non-empty lines.
- n
-
Do not number lines.
- pREGEXP
-
Only number lines that contain a match for
regular expression REGEXP.
Example
Suppose file file1 contains the following
text: \:\:\:
header
\:\:
line1
line2
line3
\:
footer
\:\:\:
header
\:\:
line1
line2
line3
\:
footer
If the following command is given: $ nl -h a file1
the output would yield numbered headers and
body lines but no numbering on footer lines. Each new header
represents the beginning of a new logical page and thus a
restart of the numbering sequence: 1 header
2 line1
3 line2
4 line3
footer
1 header
2 line1
3 line2
4 line3
footer
Syntaxod [options] [files]
Description
Dump files in octal
and other formats. This program prints a listing of a file's
contents in a variety of formats. It is often used to examine
the byte codes of binary files but can be used on any file or
input stream. Each line of output consists of an octal byte
offset from the start of the file followed by a series of
tokens indicating the contents of the file. Depending on the
options specified, these tokens can be ASCII, decimal,
hexadecimal, or octal representations of the contents.
Frequently used options
- -t type
-
Specify the type of output. Typical
types include:
- A
-
Named character
- c
-
ASCII character or backslash
escape
- O
-
Octal (the default)
- x
-
Hexadecimal
Example
If file1 contains: a1\n
A1\n
where \n
stands for the newline character. The od command specifying named
characters yields the following output: $ od -t a file1
00000000 a 1 nl A 1 nl
00000006
A slight nuance is the ASCII character mode.
This od command specifying
named characters yields the following output with
backslash-escaped characters rather than named characters:
$ od -t c file1
00000000 a 1 \n A 1 \n
00000006
With numeric output formats, you can instruct
od on how many bytes to use in
interpreting each number in the data. To do this, follow the
type specification by a decimal integer. This od command specifying single-byte hex
results yields the following output: $ od -t x1 file1
00000000 61 31 0a 41 31 0a
00000006
Doing the same thing in octal notation
yields: $ od -t o1 file1
00000000 141 061 012 101 061 012
00000006
If you examine an ASCII chart with hex and
octal representations, you'll see that these results match
those tables.
Syntaxpaste [options] files
Description
Paste together corresponding lines of one or
more files into vertical
columns.
Frequently used options
- -d'n'
-
Separate columns with character n in
place of the default tab.
- -s
-
Merge lines from one file into a single
line. When multiple files are specified, their contents are
placed on individual lines of output, one per file.
For the following three examples,
file1 contains: 1
2
3
and file2
contains: A
B
C
Example 1
A simple paste creates columns from each file
in standard output: $ paste file1 file2
1 A
2 B
3 C
Example 2
The column separator option yields columns
separated by the specified character: $ paste -d'@' file1 file2
1@A
2@B
3@C
Example 3
The single-line option (-s) yields a line for each file: $ paste -s file1 file2
1 2 3
A B C
Syntaxpr [options] [file]
Description
Convert a text file into a paginated,
columnar version, with headers and page fills. This command is
convenient for yielding nice output, such as for a line
printer from raw uninteresting text files. The header will
consist of the date and time, the filename, and a page number.
Frequently used options
- -d
-
Double space.
- -h header
-
Use header in place of the filename
in the header.
- -l lines
-
Set page length to lines. The
default is 66.
- -o width
-
Set the left margin to
width.
Syntaxsplit [option] [infile] [outfile]
Description
Split infile into a specified number
of line groups, with output going into a succession of files,
outfileaa, outfileab, and so on (the default is
xaa, xab, etc.). The infile remains
unchanged. This command is handy if you have a very long text
file that needs to be reduced to a succession of smaller
files. This was often done to email large files in smaller
chunks, because it was at one time considered bad practice to
send single large email messages.
Frequently used option
- -n
-
Split the infile into n-line
segments. The default is 1000.
Example
Suppose file1 contains: 1 one
2 two
3 three
4 four
5 five
6 six
Then the command: $ split -2 file1 splitout_
yields as output three new files,
splitout_aa, splitout_ab, and splitout_ac. The
file splitout_aa contains: 1 one
2 two
splitout_ab contains: 3 three
4 four
and splitout_ac contains: 5 five
6 six
Syntaxtac [file]
Description
This command is named as an opposite for the
cat command, which simply
prints text files to standard output. In this case, tac prints the text files to
standard output with lines in reverse order.
Example
Suppose file1 contains: 1 one
2 two
3 three
Then the command: $ tac file1
yields as output: 3 three
2 two
1 one
Syntaxtail [options] [files]
Description
Print the last few lines of one or more
files (the "tail" of the file or files). When more than
one file is specified, a header is printed at the beginning of
each file, and each is listed in succession.
Frequently used options
- -c n
-
This option prints the last n bytes,
or if n is followed by k or m, the last
n kilobytes or megabytes, respectively.
- -f
-
Follow the output dynamically as new lines
are added to the bottom of a file.
- -n m
-
Prints the last m lines. The default
is 10.
- -f
-
Continuously display a file as it is
actively written by another process. This is useful for
watching log files as the system runs.
Syntaxtr [options] [[string1 [string2]]
Description
Translate characters from string1 to
the corresponding characters in string2. tr does not have file arguments and therefore
must use standard input and output. If string1 and
string2 specify ranges (a-z or A-Z),
they should represent the same number of characters.
Frequently used options
- -d
-
Delete characters in string1 from
the output.
- -s
-
Squeeze out repeated output characters in
string1.
Example 1
To change all lowercase characters in file1 to uppercase, use either of
these commands: $ cat file1 | tr a-z A-Z
or: $ tr a-z A-Z < file1
Example 2
To suppress repeated "a" characters from
file1: $ cat file1 | tr -s a
Example 3
To remove all "a," "b," and "c" characters
from file1: $ cat file1 | tr -d abc
Syntaxwc [options] [files]
Description
Print counts of characters, words, and lines
for files. When multiple files are listed, statistics
for each file output on a separate line with a cumulative
total output last.
Frequently used options
- -c
-
Print the character count only.
- -l
-
Print the line count only.
- -w
-
Print the word count only.
Example 1
Show all counts and totals for file1,
file2, and file3: $ wc file[123]
Example 2
Count the number of lines in
file1: $ wc -l file1
Syntaxxargs [options] [command] [initial-arguments]
Description
Execute command followed by its
optional initial-arguments and append additional
arguments found on standard input. Typically, the additional
arguments are filenames in quantities too large for a single
command line. xargs runs
command multiple times to exhaust all arguments on
standard input.
Frequently used options
- -n maxargs
-
Limit the number of additional arguments to
maxargs for each invocation of command.
- -p
-
Interactive mode. Prompt the user for each
execution of command.
Example
Use grep to
search a long list of files, one by one, for the word "linux":
$ find / -type f | xargs -n 1 grep linux
find searches
for normal files (-type f )
starting at the root directory. xargs executes grep once for each of them due to the
-n 1 option.
3.2.1 The Stream Editor, sed
Another filtering program found on nearly
every Unix system is sed, the
stream editor. It is called a
stream editor because it is intended as a filter, with text
usually flowing from standard input, through the utility, to
standard output. Unlike the previously listed commands, sed is a programmable utility with a
range of capabilities. During processing, sed interprets instructions from a
sed script, processing the text
according to those instructions. The script may be a single
command or a longer list of commands. It is important to
understand sed and its use for
Exam 101, although detailed knowledge is not required or
offered in this brief introduction.
The sed
utility is usually used either to automate repetitive editing
tasks or to process text in pipes of Unix commands (see
Objective 4). The scripts that sed executes can be single commands
or more complex lists of editing instructions. It is invoked
using one of the following methods.
Syntaxsed [options] 'command1' [files]
sed [options] -e 'command1' [-e 'command2'...] [files]
sed [options] -f script [files]
Description
The first form invokes sed with a one-line
command1. The second
form invokes sed with two (or
more) commands. Note that in this case the -e parameter is required for all
commands specified. The
commands are specified in quotes to prevent the shell from
interpreting and expanding them. The last form instructs sed to take editing commands from
file script (which does not need to be executable). In all cases, if files are
not specified, input is taken from standard input. If multiple files are
specified, the edited output of each successive file is
concatenated.
Frequently used options
- -e cmd
-
The next argument is a command. This is not needed for single
commands but is required for all commands when multiple
commands are specified.
- -f file
-
The next argument is a script.
- -g
-
Treat all substitutions as
global.
The sed
utility operates on text through the use of addresses and editing commands. The address is used
to locate lines of text to be operated upon, and editing
commands modify text. During
operation, each line (that is, text separated by
newlinecharacters) of input to sed is processed individually and
without regard to adjacent lines. If multiple editing commands are to
be used (through the use of a script file or multiple -e options), they are all applied in
order to each line before moving on to the next line.
Input to sed can come from standard
input or from files.
When input is received from standard input, the original
versions of the input text are lost. However, when input comes
from files, the files themselves are not changed by sed.
The output of sed represents a
modified version of the contents of the files but does not
affect them.
Addressing
Addresses in sed locate lines of text to which
commands will be applied. The
addresses can be:
-
A line number (note that sed counts lines continuously
across multiple input files).
-
A line number with an interval. The form is n~s, where
n is the starting line number and s is the
step, or interval, to apply.
For example, to match every odd line in the input, the
address specification would be 1~2 (start at line 1
and match every two lines thereafter). This feature is a GNU extension
to sed.
-
The symbol $, indicating the last line of the
last input file.
-
A regular expression delimited by forward
slashes (/regex/ ). See Objective 7 for more
information on using regular expressions.
Zero, one, or two such addresses can be used
with a sed command. If no addresses are given, commands
are applied to all input lines by default. If a single address is given,
commands are applied only to a line or lines matching the
address. If two comma-separated
addresses are given, an inclusive range is implied. Finally, any address may be
followed by the ! character,
and commands are applied to lines that do not match the address.
Commands
The sed
command immediately follows the address specification if
present. Commands generally
consist of a single letter or symbol, unless they have
arguments. Following are some
basic sed editing
commands to get you started.
- d
-
Delete lines.
- s
-
Make substitutions.This is a very popular
sed command. The syntax is:
s/pattern/replacement/[flags]
The following flags can be specified
for the s command:
- g
-
Replace all instances of pattern,
not just the first.
- n
-
Replace n th
instance of pattern; the default is 1.
- p
-
Print the line if a successful substitution
is done. Generally used with
the -n command-line option.
- w
file
-
Print the line to file if a successful substitution
is done.
- y
-
Translate characters. This command works in
a fashion similar to the tr
command, described earlier.
Example 1
Delete lines 3 through 5 of file1: $ sed '3,5d' file1
Example 2
Delete lines of file1 that contain a
# at the beginning of the line: $ sed '/^#/d' file1
Example 3
Translate characters: y/abc/xyz/
Every instance of a is translated to
x, b to y, and c to z.
Example 4
Write the @ symbol for all empty
lines in file1 (that is, lines with only a newline
character but nothing more): $ sed 's/^$/@/' file1
Example
5
Remove all double quotation marks from all
lines in file1: $ sed 's/"//g' file1
Example 6
Using sed
commands from external file sedcmds, replace the third
and fourth double quotation marks with ( and ) on lines 1 through 10 in
file1. Make no changes
from line 11 to the end of the file. Script file sedcmds
contains: 1,10{
s/"/(/3
s/"/)/4
}
The command is executed using the -f option: $ sed -f sedcmds file1
This example employs the positional flag for
the s (substitute) command. The first of the two commands
substitutes ( for the third
double-quote character. The
next command substitutes ) for
the fourth double-quote character. Note, however, that the position
count is interpreted independently for each subsequent
command in the script. This is important because each command
operates on the results of the commands preceding it. In this example, since the third
double quote has been replaced with ( , it is no longer counted as a
double quote by the second command. Thus, the second command will
operate on the fifth double
quote character in the original file1. If the input line starts out
with: """"""
after the first command, which operates on
the third double quote, the result is: ""("""
At this point, the numbering of the
double-quote characters has changed, and the fourth double
quote in the line is now the fifth character. Thus, after the second command
executes, the output is: ""(")"
As you can see, creating scripts with sed requires that the sequential
nature of the command execution be kept in mind.
If you find yourself making repetitive
changes to many files on a regular basis, a sed script is probably warranted. Many more commands are available in
sed than are listed here.
|