I l@ve RuBoard |
15.2 Colorizing Python Source Using the Built-in TokenizerCredit: Jürgen Hermann 15.2.1 ProblemYou need to convert Python source code into HTML markup, rendering comments, keywords, operators, and numeric and string literals in different colors. 15.2.2 Solutiontokenize.tokenize does most of the work and calls us back for each token found, so we can output it with appropriate colorization: """ MoinMoin - Python Source Parser """
import cgi, string, sys, cStringIO
import keyword, token, tokenize
# Python Source Parser (does highlighting into HTML)
_KEYWORD = token.NT_OFFSET + 1
_TEXT = token.NT_OFFSET + 2
_colors = {
token.NUMBER: '#0080C0',
token.OP: '#0000C0',
token.STRING: '#004080',
tokenize.COMMENT: '#008000',
token.NAME: '#000000',
token.ERRORTOKEN: '#FF8080',
_KEYWORD: '#C00000',
_TEXT: '#000000',
}
class Parser:
""" Send colorized Python source as HTML to an output file (normally stdout).
"""
def _ _init_ _(self, raw, out = sys.stdout):
""" Store the source text. """
self.raw = string.strip(string.expandtabs(raw))
self.out = out
def format(self):
""" Parse and send the colorized source to output. """
# Store line offsets in self.lines
self.lines = [0, 0]
pos = 0
while 1:
pos = string.find(self.raw, '\n', pos) + 1
if not pos: break
self.lines.append(pos)
self.lines.append(len(self.raw))
# Parse the source and write it
self.pos = 0
text = cStringIO.StringIO(self.raw)
self.out.write('<pre><font face="Lucida,Courier New">')
try:
tokenize.tokenize(text.readline, self) # self as handler callable
except tokenize.TokenError, ex:
msg = ex[0]
line = ex[1][0]
self.out.write("<h3>ERROR: %s</h3>%s\n" % (
msg, self.raw[self.lines[line]:]))
self.out.write('</font></pre>')
def _ _call_ _(self, toktype, toktext, (srow,scol), (erow,ecol), line):
""" Token handler """
if 0: # You may enable this for debugging purposes only
print "type", toktype, token.tok_name[toktype], "text", toktext,
print "start", srow,scol, "end", erow,ecol, "<br>"
# Calculate new positions
oldpos = self.pos
newpos = self.lines[srow] + scol
self.pos = newpos + len(toktext)
# Handle newlines
if toktype in [token.NEWLINE, tokenize.NL]:
self.out.write('\n')
return
# Send the original whitespace, if needed
if newpos > oldpos:
self.out.write(self.raw[oldpos:newpos])
# Skip indenting tokens
if toktype in [token.INDENT, token.DEDENT]:
self.pos = newpos
return
# Map token type to a color group
if token.LPAR <= toktype <= token.OP:
toktype = token.OP
elif toktype == token.NAME and keyword.iskeyword(toktext):
toktype = _KEYWORD
color = _colors.get(toktype, _colors[_TEXT])
style = ''
if toktype == token.ERRORTOKEN:
style = ' style="border: solid 1.5pt #FF0000;"'
# Send text
self.out.write('<font color="%s"%s>' % (color, style))
self.out.write(cgi.escape(toktext))
self.out.write('</font>')
if _ _name_ _ == "_ _main_ _":
import os, sys
print "Formatting..."
# Open own source
source = open('python.py').read( )
# Write colorized version to "python.html"
Parser(source, open('python.html', 'wt')).format( )
# Load HTML page into browser
if os.name == "nt":
os.system("explorer python.html")
else:
os.system("netscape python.html &")
15.2.3 DiscussionThis code is part of MoinMoin (see http://moin.sourceforge.net/) and shows how to use the built-in keyword, token, and tokenize modules to scan Python source code and re-emit it with appropriate color markup but no changes to its original formatting ("no changes" is the hard part!). The Parser class's constructor saves the multiline string that is the Python source to colorize and the file object, which is open for writing, where you want to output the colorized results. Then, the format method prepares a self.lines list that holds the offset (the index into the source string, self.raw) of each line's start. format then calls tokenize.tokenize, passing self as the callback. Thus, the _ _call_ _ method is invoked for each token, with arguments specifying the token type and starting and ending positions in the source (each expressed as line number and offset within the line). The body of the _ _call_ _ method reconstructs the exact position within the original source code string self.raw, so it can emit exactly the same whitespace that was present in the original source. It then picks a color code from the _colors dictionary (which uses HTML color coding), with help from the keyword standard module to determine if a NAME token is actually a Python keyword (to be emitted in a different color than that used for ordinary identifiers). The test code at the bottom of the module formats the module itself and launches a browser with the result. It does not use the standard Python module webbrowser to ensure compatibility with stone-age versions of Python. If you have no such worries, you can change the last few lines of the recipe to: # Load HTML page into browser import webbrowser webbrowser.open("python.html", 0, 1) and enjoy the result in your favorite browser. 15.2.4 See AlsoDocumentation for the webbrowser, token, tokenize, and keyword modules in the Library Reference; the colorizer is available at http://purl.net/wiki/python/MoinMoinColorizer, part of MoinMoin (http://moin.sourceforge.net). |
I l@ve RuBoard |