I l@ve RuBoard |
12.12 Module: XML Lexing (Shallow Parsing)Credit: Paul Prescod It's not uncommon to want to work with the form of an XML document rather than with the structural information it contains (e.g., to change a bunch of entity references or element names). The XML may be slightly incorrect, enough to choke a traditional parser. In such cases, you need an XML lexer, also known as a shallow parser. You might be tempted to hack together a regular expression or two to do some simple parsing of XML (or other structured text format), rather than using the appropriate library module. Don't�it's not a trivial task to get the regular expressions right! However, the hard work has already been done for you in Example 12-1, which contains already-debugged regular expressions and supporting functions that you can use for shallow-parsing tasks on XML data (or, more importantly, on data that is almost, but not quite, correct XML, so that a real XML parser seizes up with error diagnostics when you try to parse your data with it). A traditional XML parser does a few tasks:
The shallow parser in Example 12-1 performs only the first task. It breaks up the document and presumes that you know how to deal with the fragments yourself. That makes it efficient and forgiving of errors in the document. The lexxml function is the code's entry point. Call lexxml(data) to get back a list of tokens (strings that are bits of the document). This lexer also makes it easy to get back the exact original content of the document. Unless there is a bug in the recipe, the following code should always succeed: tokens = lexxml(data) data2 = "".join(tokens) assert data == data2 If you find any bugs that disallow this, please report them! There is a second, optional argument to lexxml that allows you to get back only markup and ignore the text of the document. This is useful as a performance optimization when you care only about tags. The walktokens function in the recipe shows how to walk over the tokens and work with them. Example 12-1. XML lexingimport re class recollector: def _ _init_ _(self): self.res={} def add(self, name, reg ): re.compile(reg) # Check that it is valid self.res[name] = reg % self.res collector = recollector( ) a = collector.add a("TextSE" , "[^<]+") a("UntilHyphen" , "[^-]*-") a("Until2Hyphens" , "%(UntilHyphen)s(?:[^-]%(UntilHyphen)s)*-") a("CommentCE" , "%(Until2Hyphens)s>?") a("UntilRSBs" , "[^\\]]*](?:[^\\]]+])*]+") a("CDATA_CE" , "%(UntilRSBs)s(?:[^\\]>]%(UntilRSBs)s)*>" ) a("S" , "[ \\n\\t\\r]+") a("NameStrt" , "[A-Za-z_:]|[^\\x00-\\x7F]") a("NameChar" , "[A-Za-z0-9_:.-]|[^\\x00-\\x7F]") a("Name" , "(?:%(NameStrt)s)(?:%(NameChar)s)*") a("QuoteSE" , "\"[^\"]*\"|'[^']*'") a("DT_IdentSE" , "%(S)s%(Name)s(?:%(S)s(?:%(Name)s|%(QuoteSE)s))*" ) a("MarkupDeclCE" , "(?:[^\\]\"'><]+|%(QuoteSE)s)*>" ) a("S1" , "[\\n\\r\\t ]") a("UntilQMs" , "[^?]*\\?+") a("PI_Tail" , "\\?>|%(S1)s%(UntilQMs)s(?:[^>?]%(UntilQMs)s)*>" ) a("DT_ItemSE" , "<(?:!(?:--%(Until2Hyphens)s>|[^-]%(MarkupDeclCE)s)|\\?%(Name)s" "(?:%(PI_Tail)s))|%%%(Name)s;|%(S)s" ) a("DocTypeCE" , "%(DT_IdentSE)s(?:%(S)s)?(?:\\[(?:%(DT_ItemSE)s)*](?:%(S)s)?)?>?" ) a("DeclCE" , "--(?:%(CommentCE)s)?|\\[CDATA\\[(?:%(CDATA_CE)s)?|DOCTYPE" "(?:%(DocTypeCE)s)?") a("PI_CE" , "%(Name)s(?:%(PI_Tail)s)?") a("EndTagCE" , "%(Name)s(?:%(S)s)?>?") a("AttValSE" , "\"[^<\"]*\"|'[^<']*'") a("ElemTagCE" , "%(Name)s(?:%(S)s%(Name)s(?:%(S)s)?=(?:%(S)s)?(?:%(AttValSE)s))*" "(?:%(S)s)?/?>?") a("MarkupSPE" , "<(?:!(?:%(DeclCE)s)?|\\?(?:%(PI_CE)s)?|/(?:%(EndTagCE)s)?|" "(?:%(ElemTagCE)s)?)") a("XML_SPE" , "%(TextSE)s|%(MarkupSPE)s") a("XML_MARKUP_ONLY_SPE" , "%(MarkupSPE)s") def lexxml(data, markuponly=0): if markuponly: reg = "XML_MARKUP_ONLY_SPE" else: reg = "XML_SPE" regex = re.compile(collector.res[reg]) return regex.findall(data) def assertlex(data, numtokens, markuponly=0): tokens = lexxml(data, markuponly) if len(tokens)!=numtokens: assert len(lexxml(data))==numtokens, \ "data = '%s', numtokens = '%s'" %(data, numtokens) if not markuponly: assert "".join(tokens)==data walktokens(tokens) def walktokens(tokens): print for token in tokens: if token.startswith("<"): if token.startswith("<!"): print "declaration:", token elif token.startswith("<?xml"): print "xml declaration:", token elif token.startswith("<?"): print "processing instruction:", token elif token.startswith("</"): print "end-tag:", token elif token.endswith("/>"): print "empty-tag:", token elif token.endswith(">"): print "start-tag:", token else: print "error:", token else: print "text:", token def testlexer( ): # This test suite could be larger! assertlex("<abc/>", 1) assertlex("<abc><def/></abc>", 3) assertlex("<abc>Blah</abc>", 3) assertlex("<abc>Blah</abc>", 2, markuponly=1) assertlex("<?xml version='1.0'?><abc>Blah</abc>", 3, markuponly=1) assertlex("<abc>Blah&foo;Blah</abc>", 3) assertlex("<abc>Blah&foo;Blah</abc>", 2, markuponly=1) assertlex("<abc><abc>", 2) assertlex("</abc></abc>", 2) assertlex("<abc></def></abc>", 3) if _ _name_ _=="_ _main_ _": testlexer( ) 12.12.1 See AlsoThis recipe is based on the following article, with regular expressions translated from Perl into Python: "REX: XML Shallow Parsing with Regular Expressions", Robert D. Cameron, Markup Languages: Theory and Applications, Summer 1999, pp. 61-88, http://www.cs.sfu.ca/~cameron/REX.html. |
I l@ve RuBoard |