I l@ve RuBoard

12.8 Converting Ad-Hoc Text into XML Markup

Credit: Luther Blissett

12.8.1 Problem

You have plain text that follows certain common conventions (e.g., paragraphs are separated by empty lines, text meant to be highlighted is marked up _like this_), and you need to mark it up automatically as XML.

12.8.2 Solution

Producing XML markup from data that is otherwise structured, including plain text that rigorously follows certain conventions, is really quite easy:

def markup(text, paragraph_tag='paragraph', inline_tags={'_':'highlight'}):
    # First we must escape special characters, which have special meaning in XML
    text = text.replace('&', "&amp;")\
               .replace('<', "&lt;")\
               .replace(''', "&quot;")\
               .replace('>', "&gt;")

    # paragraph markup; pass any false value as the paragraph_tag argument to disable
    if paragraph_tag:
        # Get list of lines, removing leading and trailing empty lines:
        lines = text.splitlines(1)
        while lines and lines[-1].isspace(): lines.pop(  )
        while lines and lines[0].isspace(  ): lines.pop(0)

        # Insert paragraph tags on empty lines:
        marker = '</%s>\n\n<%s>' % (paragraph_tag, paragraph_tag)
        for i in range(len(lines)):
            if lines[i].isspace(  ):
                lines[i] = marker
                # remove 'empty paragraphs':
                if i!=0 and lines[i-1] == marker:
                    lines[i-1] = ''

        # Join text again
        lines.insert(0, '<%s>'%paragraph_tag)
        lines.append('</%s>\n'%paragraph_tag)
        text = ''.join(lines)

    # inline-tags markup; pass any false value as the inline_tags argument to disable
    if inline_tags:
        for ch, tag in inline_tags.items(  ):
            pieces = text.split(ch)
            # Text should have an even count of ch, so pieces should have
            # odd length. But just in case there's an extra unmatched ch:
            if len(pieces)%2 == 0: pieces.append('')
            for i in range(1, len(pieces), 2):
                pieces[i] = '<%s>%s</%s>'%(tag, pieces[i], tag)
            # Join text again
            text = ''.join(pieces)

    return text

if _ _name_ _ == '_ _main_ _':
    sample = """
Just some _important_ text,
with inlike "_markup_" by convention.

Oh, and paragraphs separated by
empty (or all-whitespace) lines.
Sometimes more than one, wantonly.


I've got _lots_ of old text like that
around -- don't you?
"""
    print markup(sample)

12.8.3 Discussion

Sometimes you have a lot of plain text that needs to be automatically marked up in a structured way�usually, these days, as XML. If the plain text you start with follows certain typical conventions, you can use them heuristically to get each text snippet into marked-up form with reasonably little effort.

In my case, the two conventions I had to work with were: paragraphs are separated by blank lines (empty or with some spaces, and sometimes several redundant blank lines for just one paragraph separation), and underlines (one before, one after) are used to indicate important text that should be highlighted. This seems to be quite far from the brave new world of:

<paragraph>blah blah</paragraph>

But in reality, it isn't as far as all that, thanks, of course, to Python! While you could use regular expressions for this task, I prefer the simplicity of the split and splitlines methods, with join to put the strings back together again.

12.8.4 See Also

StructuredText, the latest incarnation of which, ReStructuredText, is part of the docutils package (http://docutils.sourceforge.net/).

I l@ve RuBoard