One of the simplest applications of awk (33.11) is building a name and address database. It is a good exercise for learning awk as well. It involves organizing the information as a record and then writing programs that extract information from the records for display in reports. The scripts in this article use nawk (33.12) instead of awk, but the principles are the same.
The first thing to decide is the structure of a record. At the very least we'd like to have the following fields:
Name
Street
City
State
Zip
But we may wish to have a more complex record structure:
Name
Title
Company
Division
Street
City
State
Zip
Phone
Fax
Directory
Comments
It doesn't matter to our programming effort whether the record has five fields or thirteen. It does matter that the structure is decided upon before you begin programming.
The next decision we must make is how to distinguish one field from the next and how to distinguish one record from another. If your records are short, you could have one record per line and use an oddball character as a field delimiter:
Name~Street~City~State~Zip
Name1~Street1~City1~State1~Zip1
The downside of this solution is that it can be difficult to edit the records. (We are going to try to avoid writing programs for automating data entry. Instead, we will assume that you create the record with a text editor-vi or Emacs, for example.)
Another solution is to put each field on a line by itself and separate the records with a blank line:
Name
Street
City
State
Zip
Name1
Street1
City1
State1
Zip1
This is a good solution. You have to be careful that the data does not itself contain blank lines. For instance, if you wanted to add a field for Company name, and not all records have a value for Company, then you must use a placeholder character to indicate an empty value.
Another solution is to put each record in its own file and put each field on its own line. This is the record organization we will implement for our program. Two advantages of it are that it permits variable length records and it does not require the use of special delimiter characters. It is therefore pretty easy to create or edit a record. It is also very easy to select a subset of records for processing.
We will give each file a name that uniquely identifies it in the current directory. A list of records is the same as a list of files. Here is a sample record in a file named pmui:
Peter Mui International Sales Manager O'Reilly & Associates, Inc. East Coast Division 90 Sherman Street Cambridge MA 01240 617-354-5800 617-661-1116 [email protected] /home/peter Any number of lines may appear as a comment.
In this record, there are thirteen fields, any of which can be blank (but the blank line must be there to save the position), and the last field can have as many lines as needed.
Our record does not contain labels that identify what each field contains. While we could put that information in the record itself, it is better to maintain the labels separately so they can be changed in a single location. (You can create a record template that contains the labels to help you identify fields when adding a new record.)
We have put the labels for these fields in a separate file named dict. We won't show this file because its contents describe the record structure as shown above.
We are going to have three programs and they share the same syntax:
command record-list
The record-list
is a list of one or more filenames. You
can use wildcard characters, of course, on the command line to
specify multiple records.
The first program, read.base, reads the dict file to get the labels and outputs a formatted record.
%read.base record
pmui: 1. Name: Peter Mui 2. Title: International Sales Manager 3. Company: O'Reilly & Associates, Inc. 4. Division: East Coast Division 5. Street: 90 Sherman Street 6. City: Cambridge 7. State: MA 8. Zip: 01240 9. Phone: 617-354-5800 10. Fax: 617-661-1116 11. Email: [email protected] 12. Directory: /home/peter 13. Comments: Any number of lines may appear as a comment.
read.base first outputs the record name and then lists each field. Let's look at read.base:
nawk 'BEGIN { FS=":" # test to see that at least one record was specified if (ARGC < 2) { print "Please supply record list on command line" exit } # name of local file containing field labels: record_template = "dict" # loop to read the record_template # field_inc = the number of fields # fields[] = an array of labels indexed by position field_inc=0 while ((getline < record_template) > 0) { ++field_inc fields[field_inc] = $1 } field_tot=field_inc } # Now we are reading the records # Print filename for each new record FNR == 1 { field_inc=0 print "\n" FILENAME ":" } { # Print the field's position, label and value # The last field can have any number of lines without a label. if (++field_inc <= field_tot){ if (field_inc >= 10) space = ". " else space = ". " print field_inc space fields[field_inc] ":\t" $NF } else print $NF }' $*
Note that the program is not doing any input validation. If the record is missing a Division name (and you didn't leave the fourth line blank), whatever is on line 4 will match up with Division, even if it's really a street address. One of the uses of read.base is simply to verify that what you entered in the file is correct.
If you specify more than one record, then you will get all of those records output in the order that you specified them on the command line.
The second program is mail.base. It extracts mailing label information.
%mail.base pmui
Peter Mui International Sales Manager O'Reilly & Associates, Inc. East Coast Division 90 Sherman Street Cambridge, MA 01240
If you supply a record-list, then you will get a list of mailing labels.
Here is the mail.base program:
nawk 'BEGIN { FS="\n"; # test that user supplies a record if (ARGC < 2) { print "Please supply record list on command line" exit } } # ignore blank lines /^$/ { next } # this is hard-coded to record format; # print first 5 fields and then print # city, state zip on one line. { if (FNR < 6) print $0 else if (FNR == 6) printf $0 ", " else if (FNR == 7) printf $0 else if (FNR == 8) printf " " $0 "\n\n" }' $*
Variations on this very simple program can be written to extract or compile other pieces of information. You could also output formatting codes used when printing the labels.
The last program is list.base. It prepares a tabular list of names and records and allows you to select a particular record.
%list.base lwalsh pmui jberlin
# NAME & COMPANY FILE 1. Linda Walsh, O'Reilly & Associates, Inc. lwalsh 2. Peter Mui, O'Reilly & Associates, Inc. pmui 3. Jill Berlin, O'Reilly & Associates, Inc. jberlin Select a record by number: 2
When you select the record number, that record is displayed by using read.base. I have not built in any paging capability, so the list will scroll continuously rather than pause after 24 lines or so as it might.
Here is the list.base program:
nawk 'BEGIN { # Do everything as BEGIN procedure # test that user supplied record-list if (ARGC < 2) { print "Please supply record list on command line" exit } # Define report format string in one place. FMTSTR = "%3s %-40s %-15s\n" # print report header printf(FMTSTR, "#","NAME & COMPANY", "FILE") # For each record, get Name, Title and Company and print it. inc=0 for (x=1; x < ARGC; x++){ getline NAME < ARGV[x] getline TITLE < ARGV[x] getline COMPANY < ARGV[x] record_list[x] = ARGV[x] printf(FMTSTR, ++inc ".", NAME ", " COMPANY, ARGV[x]) } # Prompt user to select a record by number printf "Select a record by number:" getline answer < "-" # Call read.base program to display the selected record system("read.base " record_list[answer]) } ' $*
Different versions of this program can be written to examine individual pieces of information across a set of records.
Article 45.22 shows how to write a shell script that creates a prompt-driven front end to collect names and addresses. (It needs to be modified to put out a blank line for empty fields and not to write the labels into the file.)
-