5.1 Overview
This chapter describes a number of modules that are used to parse
different file formats.
5.1.1 Markup Languages
Python comes with extensive support for the Extensible
Markup Language (XML) and Hypertext Markup
Language (HTML) file formats. Python also provides basic
support for Standard Generalized Markup
Language (SGML).
All these formats share the same basic structure because both HTML and XML are derived from SGML. Each
document contains a mix of start tags,
end tags, plain text (also called character
data), and entity references, as shown in the following:
<document name="sample.xml">
<header>This is a header</header>
<body>This is the body text. The text can contain
plain text ("character data"), tags, and
entities.
</body>
</document>
In the previous example, <document>,
<header>, and <body>
are start tags. For each start tag, there's a corresponding end tag
that looks similar, but has a slash before the tag name. The start
tag can also contain one or more attributes, like
the name attribute in this example.
Everything between a start tag and its matching end tag is called an
element. In the previous example, the
document element contains two other elements:
header and body.
Finally, " is a character entity. It is
used to represent reserved characters in the text sections. In this
case, it's an ampersand (&), which is used to
start the entity itself. Other common entities include
< for
"less than"
(<), and > for
"greater than" (>).
While XML, HTML, and SGML all share the same building blocks, there
are important differences between them. In XML, all elements must
have both start tags and end tags, and the tags must be properly
nested (if they are, the document is said to be
well-formed). In addition, XML is
case-sensitive, so <document> and
<Document> are two different element types.
HTML, in contrast, is much more flexible. The HTML parser can often
fill in missing tags; for example, if you open a new paragraph in HTML
using the <P> tag without closing the
previous paragraph, the parser automatically adds a
</P> end tag. HTML is also case-insensitive.
On the other hand, XML allows you to define your own elements, while
HTML uses a fixed element set, as defined by the HTML specifications.
SGML is even more flexible. In its full incarnation, you can use a
custom declaration to define how to translate the
source text into an element structure, and a document type
description (DTD) to validate the structure and fill in
missing tags. Technically, both HTML and XML are SGML
applications; they both have their own SGML declaration,
and HTML also has a standard DTD.
Python comes with parsers for all markup flavors. While SGML is the
most flexible of the formats, Python's sgmllib parser is actually
pretty simple. It avoids most of the problems by only understanding
enough of the SGML standard to be able to deal with HTML. It doesn't
handle DTDs either; instead, you can customize
the parser via subclassing.
Python's HTML support is built on the SGML parser. The
htmllib
parser delegates the actual rendering to a formatter object. The
formatter
module contains a couple of standard formatters.
Python's XML support is most complex. In Python 1.5.2, the built-in
support was limited to the xmllib parser, which is
pretty similar to the sgmllib module (with one
important difference; xmllib actually tries to
support the entire XML standard). Python 2.0 comes with more advanced XML tools, based on the optional
expat parser.
5.1.2 Configuration Files
The ConfigParser module
reads and writes a simple configuration file format, similar to
Windows INI files.
The netrc file
reads .netrc configuration files, and the
shlex module
can be used to read any configuration file using a shell script-like
syntax.
5.1.3 Archive Formats
Python's standard library provides support for the popular GZIP
and ZIP (2.0 only) formats. The gzip module reads
and writes GZIP files, and the zipfile reads and
writes ZIP files. Both modules depend on the zlib data compression
module.
|