SAX doesn't distinguish between different elements; it leaves that burden up to you. You have to sort out the element name in the start_element( ) handler, and maybe use a stack to keep track of element hierarchy. Don't you wish there were some way to abstract that stuff? Ken MacLeod has done just that with his XML::Handler::Subs module.
This module defines an object that branches handler calls to more specific handlers. If you want a handler that deals only with <title> elements, you can write that handler and it will be called. The handler dealing with a start tag must begin with s_, followed by the element's name (replace special characters with an underscore). End tag handlers are the same, but start with e_ instead of s_.
That's not all. The base object also has a built-in stack and provides an accessor method to check if you are inside a particular element. The $self->{Names} variable refers to a stack of element names. Use the method in_element( $name ) to test whether the parser is inside an element named $name at any point in time.
To try this out, let's write a program that does something element-specific. Given an HTML file, the program outputs everything inside an <h1> element, even inline elements used for emphasis. The code, shown in Example 5-7, is breathtakingly simple.
use XML::Parser::PerlSAX; use XML::Handler::Subs # # initialize the parser # use XML::Parser::PerlSAX; my $parser = XML::Parser::PerlSAX->new( Handler => H1_grabber->new( ) ); $parser->parse( Source => {SystemId => shift @ARGV} ); ## Handler object: H1_grabber ## package H1_grabber; use base( 'XML::Handler::Subs' ); sub new { my $type = shift; my $self = {@_}; return bless( $self, $type ); } # # handle start of document # sub start_document { SUPER::start_document( ); print "Summary of file:\n"; } # # handle start of <h1>: output bracket as delineator # sub s_h1 { print "["; } # # handle end of <h1>: output bracket as delineator # sub e_h1 { print "]\n"; } # # handle character data # sub characters { my( $self, $props ) = @_; my $data = $props->{Data}; print $data if( $self->in_element( h1 )); }
Let's feed the program a test file:
<html> <head><title>The Life and Times of Fooby</title></head> <body> <h1>Fooby as a child</h1> <p>...</p> <h1>Fooby grows up</h1> <p>...</p> <h1>Fooby is in <em>big</em> trouble!</h1> <p>...</p> </body> </html>
This is what we get on the other side:
Summary of file: [Fooby as a child] [Fooby grows up] [Fooby is in big trouble!]
Even the text inside the <em> element was included, thanks to the call to in_element( ). XML::Handler::Subs is definitely a useful module to have when doing SAX processing.
Copyright © 2002 O'Reilly & Associates. All rights reserved.