I l@ve RuBoard |
12.7 Parsing an XML File with xml.parsers.expatCredit: Mark Nenadov 12.7.1 ProblemThe expat parser is normally used through the SAX interface, but sometimes you may want to use expat directly to extract the best possible performance. 12.7.2 SolutionPython is very explicit about the lower-level mechanisms that its higher-level modules' packages use. You're normally better off accessing the higher levels, but sometimes, in the last few stages of an optimization quest, or just to gain better understanding of what, exactly, is going on, you may want to access the lower levels directly from your code. For example, here is how you can use Expat directly, rather than through SAX: import xml.parsers.expat, sys class MyXML: Parser = "" # Prepare for parsing def _ _init_ _(self, xml_filename): assert xml_filename != "" self.xml_filename = xml_filename self.Parser = xml.parsers.expat.ParserCreate( ) self.Parser.CharacterDataHandler = self.handleCharData self.Parser.StartElementHandler = self.handleStartElement self.Parser.EndElementHandler = self.handleEndElement # Parse the XML file def parse(self): try: xml_file = open(self.xml_filename, "r") except: print "ERROR: Can't open XML file %s"%self.xml_filename raise else: try: self.Parser.ParseFile(xml_file) finally: xml_file.close( ) # to be overridden by implementation-specific methods def handleCharData(self, data): pass def handleStartElement(self, name, attrs): pass def handleEndElement(self, name): pass 12.7.3 DiscussionThis recipe presents a reusable way to use xml.parsers.expat directly to parse an XML file. SAX is more standardized and rich in functionality, but expat is also usable, and sometimes it can be even lighter than the already lightweight SAX approach. To reuse the MyXML class, all you need to do is define a new class, inheriting from MyXML. Inside your new class, override the inherited XML handler methods, and you're ready to go. Specifically, the MyXML class creates a parser object that does callbacks to the callables that are its attributes. The StartElementHandler callable is called at the start of each element, with the tag name and the attributes as arguments. EndElementHandler is called at the end of each element, with the tag name as the only argument. Finally, CharacterDataHandler is called for each text string the parser encounters, with the string as the only argument. The MyXML class uses the handleStartElement, handleEndElement, and handleCharData methods as such callbacks. Therefore, these are the methods you should override when you subclass MyXML to perform whatever application-specific processing you require. 12.7.4 See AlsoRecipe 12.2, Recipe 12.3, Recipe 12.4, and Recipe 12.6 for uses of the higher-level SAX API; while Expat was the brainchild of James Clark, Expat 2.0 is a group project, with a home page at http://expat.sourceforge.net/. |
I l@ve RuBoard |