Book HomePerl & XMLSearch this book

Chapter 3. XML Basics: Reading and Writing

Contents:

XML Parsers
XML::Parser
Stream-Based Versus Tree-Based Processing
Putting Parsers to Work
XML::LibXML
XML::XPath
Document Validation
XML::Writer
Character Sets and Encodings

This chapter covers the two most important tasks in working with XML: reading it into memory and writing it out again. XML is a structured, predictable, and standard data storage format, and as such carries a price. Unlike the line-by-line, make-it-up-as-you-go style that typifies text hacking in Perl, XML expects you to learn the rules of its game -- the structures and protocols outlined in Chapter 2, "An XML Recap" -- before you can play with it. Fortunately, much of the hard work is already done, in the form of module-based parsers and other tools that trailblazing Perl and XML hackers already created (some of which we touched on in Chapter 1, "Perl and XML").

Knowing how to use parsers is very important. They typically drive the rest of the processing for you, or at least get the data into a state where you can work with it. Any good programmer knows that getting the data ready is half the battle. We'll look deeply into the parsing process and detail the strategies used to drive processing.

Parsers come with a bewildering array of options that let you configure the output to your needs. Which character set should you use? Should you validate the document or merely check if it's well formed? Do you need to expand entity references, or should you keep them as references? How can you set handlers for events or tell the parser to build a tree for you? We'll explain these options fully so you can get the most out of parsing.

Finally, we'll show you how to spit XML back out, which can be surprisingly tricky if one isn't aware of XML's expectations regarding text encoding. Getting this step right is vital if you ever want to be able to use your data again without painful hand fixing.

3.1. XML Parsers

File I/O is an intrinsic part of any programming language, but it has always been done at a fairly low level: reading a character or a line at a time, running it through a regular expression filter, etc. Raw text is an unruly commodity, lacking any clear rules for how to separate discrete portions, other than basic, flat concepts such as newline-separated lines and tab-separated columns. Consequently, more data packaging schemes are available than even the chroniclers of Babel could have foreseen. It's from this cacophony that XML has risen, providing clear rules for how to create boundaries between data, assign hierarchy, and link resources in a predictable, unambiguous fashion. A program that relies on these rules can read any well-formed XML document, as if someone had jammed a babelfish into its ear.[11]

[11]Readers of Douglas Adams' book The Hitchhiker's Guide to the Galaxy will recall that a babelfish is a living, universal language-translation device, about the size of an anchovy, that fits, head-first, into a sentient being's aural canal.

Where can you get this babelfish to put in your program's ear? An XML parser is a program or code library that translates XML data into either a stream of events or a data object, giving your program direct access to structured data. The XML can come from one or more files or filehandles, a character stream, or a static string. It could be peppered with entity references that may or may not need to be resolved. Some of the parts could come from outside your computer system, living in some far corner of the Internet. It could be encoded in a Latin character set, or perhaps in a Japanese set. Fortunately for you, the developer, none of these details have to be accounted for in your program because they are all taken care of by the parser, an abstract tunnel between the physical state of data and the crystallized representation seen by your subroutines.

An XML parser acts as a bridge between marked-up data (data packaged with embedded XML instructions) and some predigested form your program can work with. In Perl's case, we mean hashes, arrays, scalars, and objects made of references to these old friends. XML can be complex, residing in many files or streams, and can contain unresolved regions (entities) that may need to be patched up. Also, a parser usually tries to accept only good XML, rejecting it if it contains well-formedness errors. Its output has to reflect the structure (order, containment, associative data) while ignoring irrelevant details such as what files the data came from and what character set was used. That's a lot of work. To itemize these points, an XML parser:

In XML, data and markup are mixed together, so the parser first has to sift through a character stream and tell the two apart. Certain characters delimit the instructions from data, primarily angle brackets (< and >) for elements, comments, and processing instructions, and ampersand (&) and semicolon (;) for entity references. The parser also knows when to expect a certain instruction, or if a bad instruction has occurred; for example, an element that contains data must bracket the data in both a start and end tag. With this knowledge, the parser can quickly chop a character stream into discrete portions as encoded by the XML markup.

The next task is to fill in placeholders. Entity references may need to be resolved. Early in the process of reading XML, the processor will have encountered a list of placeholder definitions in the form of entity declarations, which associate a brief identifier with an entity. The identifier is some literal text defined in the document's DTD, and the entity itself can be defined right there or at the business end of a URL. These entities can themselves contain entity references, so the process of resolving an entity can take several iterations before the placeholders are filled in.

You may not always want entities to be resolved. If you're just spitting XML back out after some minor processing, then you may want to turn entity resolution off or substitute your own routine for handling entity references. For example, you may want to resolve external entity references (entities whose values are in locations external to the document, pointed to by URLs), but not resolve internal ones. Most parsers give you the ability to do this, but none will let you use entity references without declaring them.

That leads to the third task. If you allow the parser to resolve external entities, it will fetch all the documents, local or remote, that contain parts of the larger XML document. In doing so, all these entities get smushed into one unbroken document. Since your program usually doesn't need to know how the document is distributed physically, information about the physical origin of any piece of data goes away once it knits the whole document together.

While interpreting the markup, the parser may trip over a syntactic error. XML was designed to make it very easy to spot such errors. Everything from attributes to empty element tags have rigid rules for their construction so a parser doesn't have to think very hard about it. For example, the following piece of XML has an obvious error. The start tag for the <decree> element contains an attribute with a defective value assignment. The value "now" is missing a second quote character, and there's another error, somewhere in the end tag. Can you see it?

<decree effective="now>All motorbikes 
shall be painted red.</decree<

When such an error occurs, the parser has little choice but to shut down the operation. There's no point in trying to parse the rest of the document. The point of XML is to make things unambiguous. If the parser had to guess how the document should look,[12] it would open up the data to uncertainty and you'd lose that precious level of confidence in your program. Instead, the XML framers (wisely, we feel) opted to make XML parsers choke and die on bad XML documents. If the parser likes your XML, it is said to be well formed.

[12]Most HTML browsers try to ignore well-formedness errors in HTML documents, attempting to fix them and move on. While ignoring these errors may seem to be more convenient to the reader, it actually encourages sloppy documents and results in overall degradation of the quality of information on the Web. After all, would you fix parse errors if you didn't have to?

What do we mean by "grammatical errors"? You will encounter them only with so-called validating parsers. A document is considered to be valid if it passes a test defined in a DTD. XML-based languages and applications often have DTDs to set a minimal standard above well-formedness for how elements and data should be ordered. For example, the W3C has posted at least one DTD to describe XHTML (the XML-compliant flavor of HTML), listing all elements that can appear, where they can go, and what they can contain. It would be grammatically correct to put a <p> element inside a <body>, but putting <p> inside <head>, for example, would be incorrect. And don't even think about inserting an element <blooby> anywhere in the document, because it isn't declared anywhere in the DTD.[13] If even one error of this type is in a document, then the whole document is considered invalid. It may be well formed, but not valid against the particular DTD. Often, this level of checking is more of a burden than a help, but it's available if you need it.

[13]If you insist on authoring a <blooby>-enabled web page in XML, you can design your own extension by drafting a DTD that uses entity references to pull in the XHTML DTD, and then defines your own special elements on top of it. At this point it's not officially XHTML anymore, but a subclass thereof.

Rounding out our list is the requirement that a parser ship the digested data to a program or end user. You can do this in many ways, and we devote much of the rest of the book in analyzing them. We can break up the forms into a few categories:

Event stream

First, a parser can generate an event stream: the parser converts a stream of markup characters into a new kind of stream that is more abstract, with data that is partially processed and easier to handle by your program.

Object Representation

Second, a parser can construct a data structure that reflects the information in the XML markup. This construction requires more resources from your system, but may be more convenient because it creates a persistent object that will wait around while you work on it.

Hybrid form

We might call the third group "hybrid" output. It includes parsers that try to be smart about processing, using some advance knowledge about the document to construct an object representing only a portion of your document.

3.1.1. Example (of What Not to Do): A Well-Formedness Checker

We've described XML parsers abstractly, but now it's time to get our hands dirty. We're going to write our own parser whose sole purpose is to check whether a document is well-formed XML or if it fails the basic test. This is about the simplest a parser can get; it doesn't drive any further processing, but just returns a "yes" or "no."

Our mission here is twofold. First, we hope to shave some of the mystique off of XML processing -- at the end of the day, it's just pushing text around. However, we also want to emphasize that writing a proper parser in Perl (or any language) requires a lot of work, which would be better spent writing more interesting code that uses one of the many available XML-parsing Perl modules. To that end, we'll write only a fraction of a pure-Perl XML parser with a very specific goal in mind.

WARNING: Feel free to play with this program, but please don't try to use this code in a production environment! It's not a real Perl and XML solution, but an illustration of the sorts of things that parsers do. Also, it's incomplete and will not always give correct results, as we'll show later. Don't worry; the rest of this book talks about real XML parsers and Perl tools you'll want to use.

The program is a loop in which regular expressions match XML markup objects and pluck them out of the text. The loop runs until nothing is left to remove, meaning the document is well formed, or until the regular expressions can't match anything in the remaining text, in which case it's not well-formed. A few other tests could abort the parsing, such as when an end tag is found that doesn't match the name of the currently open start tag. It won't be perfect, but it should give you a good idea of how a well-formedness parser might work.

Example 3-1 is a routine that parses a string of XML text, tests to see if it is well-formed, and returns a boolean value. We've added some pattern variables to make it easier to understand the regular expressions. For example, the string $ident contains regular expression code to match an XML identifier, which is used for elements, attributes, and processing instructions.

Example 3-1. A rudimentary XML parser

sub is_well_formed {
    my $text = shift;                     # XML text to check

    # match patterns
    my $ident = '[:_A-Za-z][:A-Za-z0-9\-\._]*';   # identifier
    my $optsp = '\s*';                            # optional space
    my $att1 = "$ident$optsp=$optsp\"[^\"]*\"";   # attribute
    my $att2 = "$ident$optsp=$optsp'[^']*'";      # attr. variant
    my $att = "($att1|$att2)";                    # any attribute

    my @elements = ( );                    # stack of open elems

    # loop through the string to pull out XML markup objects
    while( length($text) ) {

        # match an empty element
        if( $text =~ /^&($ident)(\s+$att)*\s*\/>/ ) {
            $text = $';

        # match an element start tag
        } elsif( $text =~ /^&($ident)(\s+$att)*\s*>/ ) {
            push( @elements, $1 );
            $text = $';

        # match an element end tag
        } elsif( $text =~ /^&\/($ident)\s*>/ ) {
            return unless( $1 eq pop( @elements ));
            $text = $';

        # match a comment
        } elsif( $text =~ /^&!--/ ) {
            $text = $';
            # bite off the rest of the comment
            if( $text =~ /-->/ ) {
                $text = $';
                return if( $` =~ /--/ );  # comments can't
                                            # contain '--'
            } else {
                return;
            }

        # match a CDATA section
        } elsif( $text =~ /^&!\[CDATA\[/ ) {
            $text = $';
            # bite off the rest of the comment
            if( $text =~ /\]\]>/ ) {
                $text = $';
            } else {
                return;
            }

        # match a processing instruction
        } elsif( $text =~ m|^&\?$ident\s*[^\?]+\?>| ) {
            $text = $';

        # match extra whitespace
        # (in case there is space outside the root element)
        } elsif( $text =~ m|^\s+| ) {
            $text = $';

        # match character data
        } elsif( $text =~ /(^[^&&>]+)/ ) {
            my $data = $1;
            # make sure the data is inside an element
            return if( $data =~ /\S/ and not( @elements ));
            $text = $';
            
        # match entity reference
        } elsif( $text =~ /^&$ident;+/ ) {
            $text = $';
         
        # something unexpected
        } else {
            return;
        }
    }
    return if( @elements );     # the stack should be empty
    return 1;
}

Perl's arrays are so useful partly due to their ability to masquerade as more abstract computer science data constructs.[14] Here, we use a data structure called a stack, which is really just an array that we access with push( ) and pop( ). Items in a stack are last-in, first-out (LIFO), meaning that the last thing put into it will be the first thing to be removed from it. This arrangement is convenient for remembering the names of currently open elements because at any time, the next element to be closed was the last element pushed onto the stack. Whenever we encounter a start tag, it will be pushed onto the stack, and it will be popped from the stack when we find an end tag. To be well-formed, every end tag must match the previous start tag, which is why we need the stack.

[14]The O'Reilly book Mastering Algorithms with Perl by Jon Orwant, Jarkko Hietaniemi, and John Macdonald devotes a chapter to this topic.

The stack represents all the elements along a branch of the XML tree, from the root down to the current element being processed. Elements are processed in the order in which they appear in a document; if you view the document as a tree, it looks like you're going from the root all the way down to the tip of a branch, then back up to another branch, and so on. This is called depth-first order, the canonical way all XML documents are processed.

There are a few places where we deviate from the simple looping scheme to do some extra testing. The code for matching a comment is several steps, since it ends with a three-character delimiter, and we also have to check for an illegal string of dashes "--" inside the comment. The character data matcher, which performs an extra check to see if the stack is empty, is also noteworthy; if the stack is empty, that's an error because nonwhitespace text is not allowed outside of the root element. Here is a short list of well-formedness errors that would cause the parser to return a false result:

Try the parser out on some test cases. Probably the simplest complete, well-formed XML document you will ever see is this:

<:-/> 

The next document should cause the parser to halt with an error. (Hint: look at the <message> end tag.)

<memo>
  <to>self</to>
  <message>Don't forget to mow the car and wash the
  lawn.<message>
</memo>

Many other kinds of syntax errors could appear in a document, and our program picks up most of them. However, it does miss a few. For example, there should be exactly one root element, but our program will accept more than one:

<root>I am the one, true root!</root>
<root>No, I am!</root>
<root>Uh oh...</root>

Other problems? The parser cannot handle a document type declaration. This structure is sometimes seen at the top of a document that specifies a DTD for validating parsers, and it may also declare some entities. With a specialized syntax of its own, we'd have to write another loop just for the document type declaration.

Our parser's most significant omission is the resolution of entity references. It can check basic entity reference syntax, but doesn't bother to expand the entity and insert it into the text. Why is that bad? Consider that an entity can contain more than just some character data. It can contain any amount of markup, too, from an element to a big, external file. Entities can also contain other entity references, so it might require many passes to resolve one entity reference completely. The parser doesn't even check to see if the entities are declared (it couldn't anyway, since it doesn't know how to read a document type declaration syntax). Clearly, there is a lot of room for errors to creep into a document through entities, right under the nose of our parser. To fix the problems just mentioned, follow these steps:

  1. Add a parsing loop to read in a document type declaration before any other parsing occurs. Any entity declarations would be parsed and stored, so we can resolve entity references later in the document.

  2. Parse the DTD, if the document type declaration mentions one, to read any entity declarations.

  3. In the main loop, resolve all entity references when we come across them. These entities have to be parsed, and there may be entity references within them, too. The process can be rather loopy, with loops inside loops, recursion, or other complex programming stunts.

What started out as a simple parser now has grown into a complex beast. That tells us two things: that the theory of parsing XML is easy to grasp; and that, in practice, it gets complicated very quickly. This exercise was useful because it showed issues involved in parsing XML, but we don't encourage you to write code like this. On the contrary, we expect you to take advantage of the exhaustive work already put into making ready-made parsers. Let's leave the dark ages and walk into the happy land of prepackaged parsers.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.