A C program consists of individual building blocks called functions, which can invoke one another. Each function performs a certain task. Ready-made functions are available in the standard library; other functions are written by the programmer as necessary. A special function name is main( ): this designates the first function invoked when a program starts. All other functions are subroutines.
Figure 1-1 illustrates the structure of a C program. The program shown consists of the functions main() and showPage(), and prints the beginning of a text file to be specified on the command line when the program is started.
The statements that make up the functions, together with the necessary declarations and preprocessing directives, form the source code of a C program. For small programs, the source code is written in a single source file. Larger C programs consist of several source files, which can be edited and compiled separately. Each such source file contains functions that belong to a logical unit, such as functions for output to a terminal, for example. Information that is needed in several source files, such as declarations, is placed in header files. These can then be included in each source file via the #include directive.
Source files have names ending in .c; header files have names ending in .h. A source file together with the header files included in it is called a translation unit.
There is no prescribed order in which functions must be defined. The function showPage() in Figure 1-1 could also be placed before the function main(). A function cannot be defined within another function, however.
The compiler processes each source file in sequence and decomposes its contents into tokens, such as function names and operators. Tokens can be separated by one or more whitespace characters, such as space, tab, or newline characters. Thus only the order of tokens in the file matters. The layout of the source code—line breaks and indentation, for example—is unimportant. The preprocessing directives are an exception to this rule, however. These directives are commands to be executed by the preprocessor before the actual program is compiled, and each one occupies a line to itself, beginning with a hash mark (#).
Comments are any strings enclosed either between /* and */, or between // and the end of the line. In the preliminary phases of translation, before any object code is generated, each comment is replaced by one space. Then the preprocessing directives are executed.
ANSI C defines two character sets. The first is the source character set, which is the set of characters that may be used in a source file. The second is the execution character set, which consists of all the characters that are interpreted during the execution of the program, such as the characters in a string constant.
Each of these character sets contains a basic character set, which includes the following:
· The 52 upper- and lower-case letters of the Latin alphabet:
· A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
· The ten decimal digits (where the value of each character after 0 is one greater than the previous digit):
0 1 2 3 4 5 6 7 8 9
· The following 29 graphic characters:
· ! " # % & ' ( ) * + , - . / : ;
< = > ? [ \ ] ^ _ { | } ~
· The five whitespace characters:
space, horizontal tab, vertical tab, newline, form feed
In addition, the basic execution character set contains the following:
· The null character \0, which terminates a character string
· The control characters represented by simple escape sequences, shown in Table 1-1, for controlling output devices such as terminals or printers
Any other characters, depending on the given compiler, can be used in comments, strings, and character constants. These may include the dollar sign or diacriticals, for example. However, the use of such characters may affect portability.
The set of all usable characters is called the extended character set, which is always a superset of the basic character set.
Certain languages use characters that require more than one byte. These multibyte characters may be included in the extended character set. Furthermore, ANSI C99 provides the integer type wchar_t (wide character type), which is large enough to represent any character in the extended character set. The modern Unicode character encoding is often used, which extends the standard ASCII code to represent some 35,000 characters from 24 countries.
C99 also introduces trigraph sequences. These sequences, shown in Table 1-2, can be used to input graphic characters that are not available on all keyboards. The sequence ??!, for example, can be entered to represent the "pipe" character |.
Table 1-2. The trigraph sequences |
|||||||||
Trigraph |
??= |
??( |
??/ |
??) |
??' |
??< |
??! |
??> |
??- |
Meaning |
# |
[ |
\ |
] |
^ |
{ |
| |
} |
~ |
Identifiers are names of variables, functions, macros, types, etc. Identifiers are subject to the following formative rules:
· An identifier consists of a sequence of letters (A to Z, a to z), digits (0 to 9), and underscores (_).
· The first character of an identifier must not be a digit.
· Identifiers are case-sensitive.
· There is no restriction on the length of an identifier. However, only the first 31 characters are generally significant.
Keywords are reserved and must not be used as identifiers. Following is a list of keywords:
External names—that is, identifiers of externally linked functions and variables—may be subject to other restrictions, depending on the linker: in portable C programs, external names should be chosen so that only the first eight characters are significant, even if the linker is not case-sensitive.
Some examples of identifiers are:
Valid: a, DM, dm, FLOAT, _var1, topOfWindow
Invalid: do, 586_cpu, zähler, nl-flag, US_$
Each identifier belongs to exactly one of the following four categories:
· Label names
· The tags of structures, unions, and enumerations. These are identifiers that follow one of the keywords struct, union, or enum (see Section 1.10).
· Names of structure or union members. Each structure or union type has a separate name space for its members.
· All other identifiers, called ordinary identifiers.
Identifiers of different categories may be identical. For example, a label name may also be used as a function name. Such re-use occurs most often with structures: the same string can be used to identify a structure type, one of its members, and a variable; for example:
struct person {char *person; /*...*/} person;
The same names can also be used for members of different structures.
Each identifier in the source code has a scope . The scope is that portion of the program in which the identifier can be used. The four possible scopes are:
Identifiers in the list of parameter declarations of a function prototype (not a function definition) have function prototype scope . Because these identifiers have no meaning outside the prototype itself, they are little more than comments.
Function
Only label names have function scope. Their use is limited to the function block in which the label is defined. Label names must also be unique within the function. The goto statement causes a jump to a labelled statement within the same function.
Block
Identifiers declared in a block that are not labels have block scope. The parameters in a function definition also have block scope. Block scope begins with the declaration of the identifier and ends with the closing brace (}) of the block.
File
Identifiers declared outside all blocks and parameter lists have file scope. File scope begins with the declaration of the identifier and extends to the end of the source file.
An identifier that is not a label name is not necessarily visible throughout its scope. If an identifier with the same category as an existing identifier is declared in a nested block, for example, the outer declaration is temporarily hidden. The outer declaration becomes visible again when the scope of the inner declaration ends.