I l@ve RuBoard Previous Section Next Section

7.9 The robotparser Module

(New in 2.0) The robotparser module reads robots.txt files, which are used to implement the Robot Exclusion Protocol (http://info.webcrawler.com/mak/projects/robots/robots.html).

If you're implementing an HTTP robot that will visit arbitrary sites on the Net (not just your own sites), it's a good idea to use this module to check that you really are welcome. Example 7-21 demonstrates the robotparser module.

Example 7-21. Using the robotparser Module
File: robotparser-example-1.py

import robotparser

r = robotparser.RobotFileParser()
r.set_url("http://www.python.org/robots.txt")
r.read()

if r.can_fetch("*", "/index.html"):
    print "may fetch the home page"

if r.can_fetch("*", "/tim_one/index.html"):
    print "may fetch the tim peters archive"

may fetch the home page
    I l@ve RuBoard Previous Section Next Section