XML::RSS (Perl and XML)

9.2.1. Introduction to RSS

RSS (short for Rich Site Summary or Really Simple Syndication, depending upon whom you ask) is one of the first XML applications whose use became rapidly popular on a global scale, thanks to the Web. While RSS itself is little more than an agreed-upon way to summarize web page content, it gives the administrators of news sites, web logs, and any other frequently updated web site a standard and sweat-free way of telling the world what's new. Programs that can parse RSS can do whatever they'd like with this document, perhaps telling its masters by mail or by web page what interesting things it has learned in its travels. A special type of RSS program is an aggregator, a program that collects RSS from various sources and then knits it together into new RSS documents combining the information, so that lazier RSS-parsing programs won't have to travel so far.

Current popular aggregators include Netscape, by way of its customizable my.netscape.com site (which was, in fact, the birthplace of the earliest RSS versions) and Dave Winer's http://www.scripting.com (whose aggregator has a public frontend at http://aggregator.userland.com/register). These aggregators, in turn, share what they pick up as RSS, turning them into one-stop RSS shops for other interested entities. Web sites that collect and present links to new stuff around the Web, such as the O'Reilly Network's Meerkat (http://meerkat.oreillynet.com), hit these aggregators often to get information on RSS-enabled web sites, and then present it to the site's user.

9.2.2. Using XML::RSS

The XML::RSS module is useful whether you're coming or going. It can parse RSS documents that you hand it, or it can help you write your own RSS documents. Naturally, you can combine these abilities to parse a document, modify it, and then write it out again; the module uses a simple and well-documented object model to represent documents in memory, just like the tree-based modules we've seen so far. You can think of this sort of XML helper module as a tricked-out version of a familiar general XML tool.

In the following examples, we'll work with a notional web log, a frequently updated and Web-readable personal column or journal. RSS lends itself to web logs, letting them quickly summarize their most recent entries within a single RSS document.

Here are a couple of web log entries (admittedly sampling from the shallow end of the concept's notional pool, but it works for short examples). First, here is how one might look in a web browser:

Oct 18, 2002 19:07:06

Today I asked lab monkey 45-X how he felt about his recent chess
victory against Dr. Baker. He responded by biting my kneecap. (The
monkey did, I mean.) I
think this could lead to a communications breakthrough. As well as
painful swelling, which is unfortunate.

Oct 27, 2002 22:56:11

On a tangential note, Dr. Xing's research of purple versus green monkey
trans-sociopolitical impact seems to be stalled, having gained no
ground for several weeks. Today she learned that her lab assistant
never mentioned on his job application that he was colorblind. Oh well.

Here it is again, as an RSS v1.0 document:

<?xml version="1.0" encoding="UTF-8"?>

<rdf:RDF
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
 xmlns="http://purl.org/rss/1.0/"
 xmlns:dc="http://purl.org/dc/elements/1.1/"
 xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/"
 xmlns:syn="http://purl.org/rss/1.0/modules/syndication/"
>

<channel rdf:about="http://www.jmac.org/linklog/">
<title>Link's Log</title>
<link>http://www.jmac.org/linklog/</link>
<description>Dr. Lance Link's online research journal</description>
<dc:language>en-us</dc:language>
<dc:rights>Copright 2002 by Dr. Lance Link</dc:rights>
<dc:date>2002-10-27T23:59:15+05:00</dc:date>
<dc:publisher>[email protected]</dc:publisher>
<dc:creator>[email protected]</dc:creator>
<dc:subject>llink</dc:subject>
<syn:updatePeriod>daily</syn:updatePeriod>
<syn:updateFrequency>1</syn:updateFrequency>
<syn:updateBase>2002-03-03T00:00:00+05:00</syn:updateBase>
<items>
 <rdf:Seq>
  <rdf:li rdf:resource="http://www.jmac.org/linklog?2002-10-27#22:56:11" />
  <rdf:li rdf:resource="http://www.jmac.org/linklog?2002-10-18#19:07:06" />
 </rdf:Seq>
</items>
</channel>

<item rdf:about="http://www.jmac.org/linklog?2002-10-27#22:56:11">
<title>2002-10-27 22:56:11</title>
<link>http://www.jmac.org/linklog?2002-10-27#22:56:11</link>
<description>
Today I asked lab monkey 45-X how he felt about his recent chess
victory against Dr. Baker. He responded by biting my kneecap. (The
monkey did, I mean.) I
think this could lead to a communications breakthrough. As well as
painful swelling, which is unfortunate.
</description>
</item>

<item rdf:about="http://www.jmac.org/linklog?2002-10-18#19:07:06">
<title>2002-10-18 19:07:06</title>
<link>http://www.jmac.org/linklog?2002-10-18#19:07:06</link>
<description>
On a tangential note, Dr. Xing's research of purple versus green monkey
trans-sociopolitical impact seems to be stalled, having gained no
ground for several weeks. Today she learned that her lab assistant
never mentioned on his job application that he was colorblind. Oh well.
</description>
</item>

</rdf:RDF>

Note RSS 1.0's use of various metadata-enabling namespaces before it gets into the meat of laying out the actual content.[30] The curious may wish to point their web browsers at the URIs with which they identify themselves, since they are good little namespaces who put their documentation where their mouth is. ("dc" is the Dublin Core, a standard set of elements for describing a document's source. "syn" points to a syndication namespace -- itself a sub-project by the RSS people -- holding a handful of elements that state how often a source refreshes itself with new content.) Then the whole document is wrapped up in an RDF element.

[30]I am careful to specify the RSS version here because RSS Version .9 and 0.91 documents are much simpler in structure, eschewing namespaces and RDF-encapsulated metadata in favor of a simple list of <item> elements wrapped in an <rss> element. For this reason, many people prefer to use pre-1.0 RSS, and socially astute RSS software can read from and write to all these versions. XML::RSS can do this, and as a side effect, allows easy conversion between these different versions (given a single original document).

9.2.2.1. Parsing

Using XML::RSS to read an existing document ought to look familiar if you've read the preceding chapters, and is quite simple:

use XML::RSS;

# Accept file from user arguments
my @rss_docs = @ARGV;

# For now, we'll assume they're all files on disk...
foreach my $rss_doc (@rss_docs) {

  # First, create a new RSS object that will represent the parsed doc
  my $rss = XML::RSS->new;
  
  # Now parse that puppy
  $rss->parsefile($rss_doc);
  
  # And that's all. Do whatever else we may want here.
}

9.2.2.2. Inheriting from XML::Parser

If that parsefile method looked familiar, it had good reason: it's the same one used by grandpappy XML::Parser, both in word and deed.

XML::RSS takes direct advantage of XML::Parser's inheritability right off the bat, placing this module into its @ISA array before getting down to business with all that map definition.

It shouldn't surprise those familiar with object-oriented Perl programming that, while it chooses to define its own new method, it does little more than invoke SUPER::new. In doing so, it lets XML::Parser initialize itself as it sees fit. Let's look at some code from that module itself -- specifically its constructor, new, which we invoked in our example:

sub new {
    my $class = shift;
    my $self = $class->SUPER::new(Namespaces    => 1,
                                  NoExpand      => 1,
                                  ParseParamEnt => 0,
                                  Handlers      => { Char    => \&handle_char,
                                                     XMLDecl => \&handle_dec,
                                                     Start   => \&handle_start})
;
    bless ($self,$class);
    $self->_initialize(@_);
    return $self;
}

Note how the module calls its parent's new with very specific arguments. All are standard and well-documented setup instructions in XML::Parser's public interface, but by taking these parameters out of the user's hands and into its own, the XML::RSS module knows exactly what it's getting -- in this case, a parser object with namespace processing enabled, but not expansion or parsing of parameter entities -- and defines for itself what its handlers are.

The result of calling SUPER::new is an XML::Parser object, which this module doesn't want to hand back to its users -- doing so would diminish the point of all this abstraction! Therefore, it reblesses the object (at this point, deemed to be a new $self for this class) using the Perl-itically correct two-argument method, so that the returned object claims fealty to XML::RSS, not XML::Parser.

9.2.3. The Object Model

Since we can see that XML::RSS is not very unique in terms of parser object construction and document parsing, let's look at where it starts to cut an edge of its own: through the shape of the internal data structure it builds and to which it applies its method-based API.

XML::RSS's code is made up mostly of accessors -- methods that read and write to predefined places in the structure it's building. Using nothing more complex than a few Perl hashes, XML::RSS builds maps of what it expects to see in the document, made of nested hash references with keys named after the elements and attributes it might encounter, nested to match the way one might find them in a real RSS XML document. The module defines one of these maps for each version of RSS that it handles. Here's the simplest one, which covers RSS Version 0.9:

my %v0_9_ok_fields = (
    channel => { 
        title       => '',
        description => '',
        link        => '',
        },
    image  => { 
        title => '',
        url   => '',
        link  => '' 
        },
    textinput => { 
        title       => '',
        description => '',
        name        => '',
        link        => ''
        },
    items => [],
    num_items => 0,
    version         => '',
    encoding        => ''
);

This model is not entirely made up of hash references, of course; the top-level "items" key holds an empty array reference, and otherwise, all the end values for all the keys are scalars -- all empty strings. The exception is num_items, which isn't among RSS's elements. Instead, it serves the role of convenience, making a small trade-off of structural elegance for the sake of convenience (presumably so the code doesn't have to keep explicitly dereferencing the items array reference and then getting its value in scalar context).

On the other hand, this example risks going out of sync with reality if what it describes changes and the programmer doesn't remember to update the number when that happens. However, this sort of thing often comes down to programming style, which is far beyond the bounds of this book.

There's good reason for this arrangement, besides the fact that hash values have to be set to something (or undef, which is a special sort of something). Each hash doubles as a map for the module's subroutines to follow and a template for the structures themselves. With that in mind, let's see what happens when an XML::Parser item is constructed via this module's new class method.

9.2.4. Input: User or File

After construction, an XML::RSS is ready to chew through an RSS document, thanks to the parsing powers afforded to it by its proud parent, XML::Parser. A user only needs to call the object's parse or parsefile methods, and off it goes -- filling itself up with data.

Despite this, many of these objects will live long [31] and productive lives without sinking their teeth into an existing XML document. Often RSS users would rather have the module help build a document from scratch -- or rather, from the bits of text that programs we write will feed to it. This is when all those accessors come in handy.

[31]Well, a few hundredths of a second on a typical whizbang PC, but we mean long in the poetic sense.

Thus, let's say we have a SQL database somewhere that contains some web log entries we'd like to RSS-ify. We could write up this little script:

#!/usr/bin/perl

# Turn the last 15 entries of Dr. Link's Weblog into an RSS 1.0 document,
# which gets pronted to STDOUT.

use warnings;
use strict;

use XML::RSS;
use DBIx::Abstract;

my $MAX_ENTRIES = 15;

my ($output_version) = @ARGV;
$output_version ||= '1.0';
unless ($output_version eq '1.0' or $output_version eq '0.9' 
                                 or $output_version eq '0.91') {
  die "Usage: $0 [version]\nWhere [version] is an RSS version to output: 
0.9, 0 .91, or 1.0\nDefault is 1.0\n";
}

my $dbh = DBIx::Abstract->connect({dbname=>'weblog',
                                   user=>'link',
                                   password=>'dirtyape'})
  or die "Couln't connect to database.\n";

my ($date) = $dbh->select('max(date_added)',
                          'entry')->fetchrow_array;
my ($time) = $dbh->select('max(time_added)',
                          'entry')->fetchrow_array;

my $time_zone = "+05:00"; # This happens to be where I live. :)
my $rss_time = "${date}T$time$time_zone";
# base time is when I started the blog, for the syndication info
my $base_time = "2001-03-03T00:00:00$time_zone";

# I'll choose to use RSS version 1.0 here, which stuffs some meta-information into 
# 'modules' that go into their own namespaces, such as 'dc' (for Dublin Core) or 
# 'syn' (for RSS Syndication), but fortunately it doesn't make defining the document 
# any more complex, as you can see below...

my $rss = XML::RSS->new(version=>'1.0', output=>$output_version);

$rss->channel(
              title=>'Dr. Links Weblog',
              link=>'http://www.jmac.org/linklog/',
              description=>"Dr. Link's weblog and online journal",
              dc=> {
                    date=>$rss_time,
                    creator=>'[email protected]',
                    rights=>'Copyright 2002 by Dr. Lance Link',
                    language=>'en-us',
                   },
              syn=> {
                     updatePeriod=>'daily',
                     updateFrequency=>1,
                     updateBase=>$base_time,
                    },
             );


$dbh->query("select * from entry order by id desc limit $MAX_ENTRIES");
while (my $entry = $dbh->fetchrow_hashref) {
  # Replace XML-naughty characters with entities
  $$entry{entry} =~ s/&/&/g;
  $$entry{entry} =~ s/</&lt;/g;
  $$entry{entry} =~ s/'/&apos;/g;
  $$entry{entry} =~ s/"/&quot;/g;
  $rss->add_item(
         title=>"$$entry{date_added} $$entry{time_added}",
         link=>"http://www.jmac.org/weblog?$$entry{date_added}#$$entry{time_added}",
         description=>$$entry{entry},
                );
}

# Just throw the results into standard output. :)
print $rss->as_string;

Did you see any XML there? We didn't. Well, OK, we did have to give the truth of the matter a little nod by tossing in those entity-escape regexes, but other than that, we were reading from a database and then stuffing what we found into an object by way of a few method calls (or rather, a single, looped call to its add_item method). These calls accepted, as their sole argument, a hash made of some straightforward strings. While we (presumably) wrote this program to let our web log take advantage of everything RSS has to offer, no actual XML was munged in the production of this file.