15.4 Element Grammar
The grammar of human language is rich with a
variety of sentence structures, verb tenses, and all sorts of
irregular constructs and exceptions to the rules. Nonetheless, you
mastered most of it by the age of three. Computer language grammars
typically are simple, regular, and have few exceptions. In fact,
computer grammars use only four rules to define how elements of a
language may be arranged: sequence, choice, grouping, and repetition.
15.4.1 Sequence, Choice, Grouping, and Repetition
Sequence rules define the exact order in which
elements appear in a language. For instance, if a
sequence grammar rule states that
element A is followed by B and then by C, your document must provide
elements A, B, and C in that exact order. A missing element (A and C,
but no B, for example), an extra element (A, B, E, then C), or an
element out of place (C, A, then B) violates the rule and does not
match the grammar.
In many grammars, XML included, sequences are defined by simply
listing the appropriate elements, in order and separated by commas.
Accordingly, our example sequence in the DTD would appear simply as
A, B, C.
Choice grammar rules provide flexibility by
letting the DTD author choose one element from among a group of valid
elements. For example, a choice rule might state that you may choose
elements D, E, or F; any one of these three elements would satisfy
the grammar. Like many other grammars, XML denotes choice rules by
listing the appropriate choices separated by a vertical bar
(|). Thus, our simple choice would be written in
the DTD as D | E | F. If you read the vertical bar as the word
or, choice rules become easy to understand.
Grouping rules collect two or more rules
into a single rule, building richer, more usable languages. For
example, a grouping rule might allow a sequence of elements, followed
by a choice, followed by a sequence. You can indicate groups within a
rule by enclosing them in parentheses in the DTD. For example:
Document ::= A, B, C, (D | E | F), G
requires that a document begin with elements A, B, and C, followed by
a choice of one element out of D, E, or F, followed by element G.
Repetition rules let you repeat one or
more elements some number of times. With XML, as with many other
languages, repetition is denoted by appending a special character
suffix to an element or group within a rule. Without the special
character, that element or group must appear exactly once in the
rule. Special characters include the plus sign
(+), meaning that the element may appear one or
more times in the document; the asterisk (*),
meaning that the element may appear zero or more times; and the
question mark (?), meaning that the element may
appear either zero or one time.
For example, the rule:
Document ::= A, B?, C*, (D | E | F)+, G*
creates an unlimited number of correct documents with the elements A
through F. According to the rule, each document must begin with A,
optionally followed by B, followed by zero or more occurrences of C,
followed by at least one, but perhaps more, of either D, E, or F,
followed by zero or more Gs. All of these documents (and many
others!) match this rule:
ABCDG
ACCCFFGGG
ACDFDFGG
You might want to work through these examples to prove to yourself
that they are, in fact, correct with respect to the repetition rule.
15.4.2 Multiple Grammar Rules
By now you can probably imagine that specifying an entire language
grammar in a single rule is difficult, although possible.
Unfortunately, the result would be an almost unreadable sequence of
nearly unintelligible rules. To remedy this situation, the items in a
rule may themselves be rules containing other elements and rules. In
these cases, the items in a grammar that are themselves rules are
known as
nonterminals,
while the items that are elements in the language are known as
terminals. Eventually, all the nonterminals must
reference rules that create sequences of terminals, or the grammar
would never produce a valid document.
For example, we can express our sample grammar in two rules:
Document ::= A, B?, C*, Choices+, G*
Choices ::= D | E | F
In this example, Document and Choices are nonterminals, while A, B,
C, D, E, F, and G are terminals.
There is no requirement in XML (or most other grammars) that dictates
or limits the number of nonterminals in your grammar. Most grammars
use nonterminals wherever it makes sense for clarity and ease of use.
15.4.3 XML Element Grammar
The rules for
defining the contents of an element match the grammar rules we just
discussed. You may use sequences, choices, groups, and repetition to
define the allowable contents of an element. The nonterminals in
rules must be names of other elements defined in your DTD.
A few examples show how this works. Consider the declaration of the
<html> tag, taken from the HTML DTD:
<!ELEMENT html (head, body)>
This defines the element named html whose content
is a head element followed by a
body element. Notice that you do not enclose the
element names in angle brackets within the DTD; that notation is used
only when the elements are actually used in a document.
Within the HTML DTD, you can find the declaration of the
<head> tag:
<!ELEMENT head (%head.misc;,
((title, %head.misc;, (base, %head.misc;)?) |
(base, %head.misc;, (title, %head.misc;))))>
Gulp. What on earth does this mean? First, notice that a parameter
entity named head.misc is used several times in
this declaration. Let's go get it:
<!ENTITY % head.misc "(script|style|meta|link|object)*">
Now things are starting to make sense: head.misc
defines a group of elements, from which you may choose one. However,
the trailing asterisk indicates that you may include zero or more of
these elements. The net result is that anywhere
%head.misc; appears, you can include zero or more
script, style,
meta, link, or
object elements, in any order. Sound familiar?
Returning to the head declaration, we see that we
are allowed to begin with any number of the head
miscellaneous elements. We must then make a choice: either a group
consisting of a title element, optional
miscellaneous items, and an optional base element
followed by miscellaneous items; or a group consisting of a
base element, miscellaneous items, a
title element, and some more miscellaneous items.
Why such a convoluted rule for the <head>
tag? Why not just write:
<!ELEMENT head (script|style|meta|link|object|base|title)*>
which allows any number of head elements to
appear, or none at all? The HTML standard requires that every
<head> tag contain exactly one
<title> tag. It also allows for only one
<base> tag, if any. Otherwise, the standard
does allow any number of the other head elements,
in any order.
Put simply, the head element declaration, while
initially confusing, forces the XML processor to ensure that exactly
one title element appears in the
head element and that, if specified, just one
base element appears as well. It then allows for
any of the other head elements, in any order.
This one example demonstrates a lot of the power of XML: the ability
to define commonly used elements using parameter entities and the use
of grammar rules to dictate document syntax. If you can work through
the head element declaration and understand it,
you are well on your way to reading any XML DTD.
15.4.4 Mixed Element Content
Mixed element content extends the
element grammar rules to include the special
#PCDATA keyword. PCDATA stands for "parsed
character data" and signifies that the content of
the element will be parsed by the XML processor for general entity
references. After the entities are replaced, the character data is
passed to the XML application for further processing.
What this boils down to is that parsed character data is the actual
content of your XML document. Elements that accept parsed character
data may contain plain ol' text, plus whatever other
tags you allow, as defined in the DTD.
For instance:
<!ELEMENT title (#PCDATA)>
means that the title element may contain only text
with entities. No other tags are allowed, just as in the HTML
standard.
A more complex example is the <p> tag, whose
element declaration is:
<!ELEMENT p %Inline;>
Another parameter entity! The %Inline; entity is
defined in the HTML DTD as:
<!ENTITY % Inline "(#PCDATA | %inline; | %misc;)*">
which expands to these entities when you replace the parameters:
<!ENTITY % special "br | span | bdo | object | img | map">
<!ENTITY % fontstyle "tt | i | b | big | small">
<!ENTITY % phrase "em | strong | dfn | code | q | sub | sup | samp | kbd |
var | cite | abbr | acronym">
<!ENTITY % inline.forms "input | select | textarea | label | button">
<!ENTITY % misc "ins | del | script | noscript">
<!ENTITY % inline "a | %special; | %fontstyle; | %phrase; | %inline.forms;">
What do we make of all this? The %Inline; entity
defines the contents of the p element as parsed
character data, plus any of the elements defined by
%inline; and any defined by
%misc;. Notice that case does matter:
%Inline; is different from
%inline;.
The %inline; entity includes lots of stuff:
special elements, font-style elements, phrase elements, and inline
form elements. %misc includes the
ins, del,
script, and noscript elements.
You can read the HTML DTD for the other entity declarations to see
which elements are also allowed as the contents of a
p element.
Why did the HTML DTD authors break up all these elements into
separate groups? If they were simply defining elements to be included
in the p element, they could have built a single
long list. However, HTML has rules that govern where inline elements
may appear in a document. The authors grouped elements that are
treated similarly into separate entities that could be referenced
several times in the DTD. This makes the DTD easier to read and
understand, as well as easier to maintain when a change is needed.
15.4.5 Empty Elements
Elements whose content is defined to be empty deserve a special
mention. XML introduced notational rules for empty elements,
different from the traditional HTML rules that govern them.
HTML authors are used to specifying an empty element as a single tag,
like <br> or <img>.
XML requires that every element have an opening and a closing tag, so
an image tag would be written as
<img></img>, with no embedded content.
Other empty elements would be written in a similar manner.
Since this format works well for non-empty tags but is a bit of
overkill for empty ones, you can use a special shorthand notation for
empty tags. To write an empty tag in XML, just place a slash
(/) immediately before the closing angle bracket
of the tag. Thus, a line break may be written as
<br/> and an image tag might be specified as
<img src="myimage.gif"/>.
Notice that the attributes of the empty element, if any, appear
before the closing slash and bracket.
|