10.7 Grabbing a Document from the Web
Credit: Gisle Aas
10.7.1 Problem
You need to grab a document from a URL on
the Web.
10.7.2 Solution
urllib.urlopen
returns a file-like object, and you can call read
on it:
from urllib import urlopen
doc = urlopen("http://www.python.org").read( )
print doc
10.7.3 Discussion
Once you obtain a file-like object from urlopen,
you can read it all at once into one big string by calling its
read method, as I do in this recipe.
Alternatively, you can read it as a list of lines by calling its
readlines method or, for special purposes, just
get one line at a time by calling its readline
method in a loop. In addition to these file-like operations, the
object that urlopen returns offers a few other
useful features. For example, the following snippet gives you the
headers of the document:
doc = urlopen("http://www.python.org")
print doc.info( )
such as the Content-Type: header
(text/html in this case) that defines the MIME
type of the document. doc.info returns a
mimetools.Message instance, so you can access it
in various ways without printing it or otherwise transforming it into
a string. For example, doc.info(
).getheader('Content-Type') returns the
'text/html' string. The
maintype attribute of the
mimetools.Message object is the
'text' string, subtype is the
'html' string, and type is also
the 'text/html' string. If you need to perform
sophisticated analysis and processing, all the tools you need are
right there. At the same time, if your needs are simpler, you can
meet them in very simple ways, as this recipe shows.
10.7.4 See Also
Documentation for the standard library modules
urllib and mimetools in the
Library Reference.
|