[ Team LiB ] |
Recipe 22.3 Parsing XML into SAX Events22.3.1 ProblemYou want to receive Simple API for XML (SAX) events from an XML parser because event-based parsing is faster and uses less memory than parsers that build a DOM tree. 22.3.2 SolutionUse the XML::SAX module from CPAN: use XML::SAX::ParserFactory; use MyHandler; my $handler = MyHandler->new( ); my $parser = XML::SAX::ParserFactory->parser(Handler => $handler); $parser->parse_uri($FILENAME); # or $parser->parse_string($XML); Logic for handling events goes into the handler class (MyHandler in this example), which you write: # in MyHandler.pm package MyHandler; use base qw(XML::SAX::Base); sub start_element { # method names are specified by SAX my ($self, $data) = @_; # $data is hash with keys like Name and Attributes # ... } # other possible methods include end_element( ) and characters( ) 1; 22.3.3 DiscussionAn XML processor that uses SAX has three parts: the XML parser that generates SAX events, the handler that reacts to them, and the stub that connects the two. The XML parser can be XML::Parser, XML::LibXML, or the pure Perl XML::SAX::PurePerl that comes with XML::SAX. The XML::SAX::ParserFactory module selects a parser for you and connects it to your handler. Your handler takes the form of a class that inherits from XML::SAX::Base. The stub is the program shown in the Solution. The XML::SAX::Base module provides stubs for the different methods that the XML parser calls on your handler. Those methods are listed in Table 22-2, and are the methods defined by the SAX1 and SAX2 standards at http://www.saxproject.org/. The Perl implementation uses more Perl-ish data structures and is described in the XML::SAX::Intro manpage.
The two data structures you need most often are those representing elements and attributes. The $data parameter to start_element and end_element is a hash reference. The keys of the hash are given in Table 22-3.
An attribute hash has a key for each attribute. The key is structured as "{namespaceURI}attrname". For example, if the current namespace URI is http://example.com/dtds/mailspec/ and the attribute is msgid, the key in the attribute hash is: {http://example.com/dtds/mailspec/}msgid The attribute value is a hash; its keys are given in Table 22-4.
Example 22-4 shows how to list the book titles using SAX events. It's more complex than the DOM solution because with SAX we must keep track of where we are in the XML document. Example 22-4. sax-titledumper# in TitleDumper.pm # TitleDumper.pm -- SAX handler to display titles in books file package TitleDumper; use base qw(XML::SAX::Base); my $in_title = 0; # if we're entering a title, increase $in_title sub start_element { my ($self, $data) = @_; if ($data->{Name} eq 'title') { $in_title++; } } # if we're leaving a title, decrease $in_title and print a newline sub end_element { my ($self, $data) = @_; if ($data->{Name} eq 'title') { $in_title--; print "\n"; } } # if we're in a title, print any text we get sub characters { my ($self, $data) = @_; if ($in_title) { print $data->{Data}; } } 1; The XML::SAX::Intro manpage provides a gentle introduction to XML::SAX parsing. 22.3.4 See AlsoChapter 5 of Perl & XML; the documentation for the CPAN modules XML::SAX, XML::SAX::Base, and XML::SAX::Intro |
[ Team LiB ] |