Free-Form XML and Well-Formed Documents (Perl and XML)

2.9. Free-Form XML and Well-Formed Documents

XML's grandfather, SGML, required that every element and attribute be documented thoroughly with a long list of declarations in the DTD. We'll describe what we mean by that thorough documentation in the next section, but for now, imagine it as a blueprint for a document. This blueprint adds considerable overhead to the processing of a document and was a serious obstacle to SGML's status as a popular markup language for the Internet. HTML, which was originally developed as an SGML instance, was hobbled by this enforced structure, since any "valid" HTML document had to conform to the HTML DTD. Hence, extending the language was impossible without approval by a web committee.

XML does away with that requirement by allowing a special condition called free-form XML. In this mode, a document has to follow only minimal syntax rules to be acceptable. If it follows those rules, the document is well-formed. Following these rules is wonderfully liberating for a developer because it means that you don't have to scan a DTD every time you want to process a piece of XML. All a processor has to do is make sure that minimal syntax rules are followed.

In free-form XML, you can choose the name of any element. It doesn't have to belong to a sanctioned vocabulary, as is the case with HTML. Including frivolous markup into your program is a risk, but as long as you know what you're doing, it's okay. If you don't trust the markup to fit a pattern you're looking for, then you need to use element and attribute declarations, as we describe in the next section.

What are these rules? Here's a short list as seen though a coarse-grained spyglass:

A document can have only one top-level element, the document element, that contains all the other elements and data. This element does not include the XML declaration and document type declaration, which must precede it.
Every element with content must have both a start tag and an end tag.
Element and attribute names are case sensitive, and only certain characters can be used (letters, underscores, hyphens, periods, and numbers), with only letters and underscores eligible as the first character. Colons are allowed, but only as part of a declared namespace prefix.
All attributes must have values and all attribute values must be quoted.
Elements may never overlap; an element's start and end tags must both appear within the same element.
Certain characters, including angle brackets (< >) and the ampersand (&) are reserved for markup and are not allowed in parsed content. Use character entity references instead, or just stick the offending content into a CDATA section.
Empty elements must use a syntax distinguishing them from nonempty element start tags. The syntax requires a slash (/) before the closing bracket (>) of the tag.

You will encounter more rules, so for a more complete understanding of well-formedness, you should either read an introductory book on XML or look at the W3C's official recommendation at http://www.w3.org/XML.

If you want to be able to process your document with XML-using programs, make sure it is always well formed. (After all, there's no such thing as non-well-formed XML.) A tool often used to check this status is called a well-formedness checker, which is a type of XML parser that reports errors to the user. Often, such a tool can be detailed in its analysis and give you the exact line number in a file where the problem occurs. We'll discuss checkers and parsers in Chapter 3, "XML Basics: Reading and Writing".


2.8. Processing Instructions and Other Markup		2.10. Declaring Elements and Attributes