I l@ve RuBoard

12.4 Extracting Text from an XML Document

Credit: Paul Prescod

12.4.1 Problem

You need to extract only the text from an XML document, not the tags.

12.4.2 Solution

Once again, subclassing SAX's ContentHandler makes this extremely easy:

from xml.sax.handler import ContentHandler
import xml.sax
import sys

class textHandler(ContentHandler):
    def characters(self, ch):
        sys.stdout.write(ch.encode("Latin-1"))

parser = xml.sax.make_parser(  )
handler = textHandler(  )
parser.setContentHandler(handler)
parser.parse("test.xml")

12.4.3 Discussion

Sometimes you want to get rid of XML tags�for example, to rekey a document or to spellcheck it. This recipe performs this task and will work with any well-formed XML document. It is quite efficient. If the document isn't well-formed, you could try a solution based on the XML lexer (shallow parser) shown in Recipe 12.12.

In this recipe's textHandler class, we subclass ContentHander's characters method, which the parser calls for each string of text in the XML document (excluding tags, XML comments, and processing instructions), passing as the only argument the piece of text as a Unicode string. We have to encode this Unicode before we can emit it to standard output. In this recipe, we're using the Latin-1 (also known as ISO-8859-1) encoding, which covers all Western-European alphabets and is supported by many popular output devices (e.g., printers and terminal-emulation windows). However, you should use whatever encoding is most appropriate for the documents you're handling and is supported by the devices you use. The configuration of your devices may depend on your operating system's concepts of locale and code page. Unfortunately, these vary too much between operating systems for me to go into further detail.

12.4.4 See Also

Recipe 12.2, Recipe 12.3, and Recipe 12.6 for other uses of the SAX API; see Recipe 12.12 for a very different approach to XML lexing that works on XML fragments.

I l@ve RuBoard