5.3 The xml.parsers.expat Module
(Optional) The xml.parsers.expat module is an interface to James Clark's Expat XML parser.
Example 5-3 demonstrates this full-featured and fast parser, which is an excellent choice for
production use.
Example 5-3. Using the xml.parsers.expat Module
File: xml-parsers-expat-example-1.py
from xml.parsers import expat
class Parser:
def _ _init_ _(self):
self._parser = expat.ParserCreate()
self._parser.StartElementHandler = self.start
self._parser.EndElementHandler = self.end
self._parser.CharacterDataHandler = self.data
def feed(self, data):
self._parser.Parse(data, 0)
def close(self):
self._parser.Parse("", 1) # end of data
del self._parser # get rid of circular references
def start(self, tag, attrs):
print "START", repr(tag), attrs
def end(self, tag):
print "END", repr(tag)
def data(self, data):
print "DATA", repr(data)
p = Parser()
p.feed("<tag>data</tag>")
p.close()
START u'tag' {}
DATA u'data'
END u'tag'
Note that the parser returns Unicode strings, even if you pass it
ordinary text. By default, the parser interprets the source text as
UTF-8 (as per the XML standard). To use other encodings, make sure
the XML file contains an encoding directive. Example 5-4 shows how to read ISO Latin-1 text using xml.parsers.expat.
Example 5-4. Using the xml.parsers.expat Module to Read ISO Latin-1 Text
File: xml-parsers-expat-example-2.py
from xml.parsers import expat
class Parser:
def _ _init_ _(self):
self._parser = expat.ParserCreate()
self._parser.StartElementHandler = self.start
self._parser.EndElementHandler = self.end
self._parser.CharacterDataHandler = self.data
def feed(self, data):
self._parser.Parse(data, 0)
def close(self):
self._parser.Parse("", 1) # end of data
del self._parser # get rid of circular references
def start(self, tag, attrs):
print "START", repr(tag), attrs
def end(self, tag):
print "END", repr(tag)
def data(self, data):
print "DATA", repr(data)
p = Parser()
p.feed("""\
<?xml version='1.0' encoding='iso-8859-1'?>
<author>
<name>fredrik lundh</name>
<city>linköping</city>
</author>
"""
)
p.close()
START u'author' {}
DATA u'\012'
START u'name' {}
DATA u'fredrik lundh'
END u'name'
DATA u'\012'
START u'city' {}
DATA u'link\366ping'
END u'city'
DATA u'\012'
END u'author'
|