Hack 44 Get Purchase Circle Products with Screen Scraping

Purchase Circles provide a unique look at sales patterns. You can access them programmatically only with screen scraping.

Amazon's purchase circles are specialized bestseller lists broken down by geography or organization. If you visit the Friends & Favorites page, choose "Purchase Circles" from the drop-down list, and type the name of your city, chances are you'll find what's uniquely popular among your fellow residents. Amazon also lists what's popular at universities and large corporations. If everyone at Microsoft is reading about a certain technology, you may find it in the next version of Windows!

44.1 Finding Purchase Circle IDs

In fact, you can link directly to the Microsoft Corporation purchase circle:

http://www.amazon.com/exec/obidos/tg/cm/browse-communities/-/211569/

The six-digit code at the end of the URL is the Purchase Circle ID for Microsoft. Every purchase circle has a unique ID. You can find IDs by noting them from URLs as you browse circles. The purchase circles home page (http://www.amazon.com/exec/obidos/subst/community/community.html) is a good place to start.

Once you know an ID, you can link to it directly using the URL format. You can also write scripts to access the page and retrieve a list of items.

44.2 The Code

This script takes a Purchase Circle ID and returns the books listed. Create a file called get_circle.pl and add the following code:

#!/usr/bin/perl
# get_circle.pl
# A script to scrape Amazon to retrieve purchase circle products
# Usage: perl get_circle.pl <circleID>

#Take the asin from the command-line
my $circleID =shift @ARGV or die "Usage:perl get_circle.pl <circleID>\n";

#Assemble the URL
my $url = "http://amazon.com/o/tg/cm/browse-communities/-/" .
          $circleID . "/t/";

use strict;
use LWP::Simple;

#Request the URL
my $content = get($url);
die "Could not retrieve $url" unless $content;

my $circle = (join '', $content);

while ($circle =~ m!<title>(.*?)</title>!mgis) {
    print $1 . "\n\n";
}

while ($circle =~ m!<td.*?<b><a.*?-/(.*?)[?/].*?>(.*?)</a></b>.*?by[RETURN]
(.*?)<br>.*?</td>!mgis) {
    my($asin,$title,$author) = ($1||'',$2||'',$3||'');
    #Print the results
    print $title . "\n" .
          "by " . $author . "\n" .  
          "ASIN: " . $asin .
          "\n\n";
}

One thing to note about this code is that it passes the /t/ URL argument to return a text-only version of the purchase circle page. Text-only pages have less HTML, which means that fewer bytes are flying around and it's generally easier to scrape for information.

44.3 Running the Hack

You can run this hack, providing a Purchase Circle ID, from the command line like this:

perl get_circle.pl insert purchase circle ID

44.4 Hacking the Hack

This script returns popular books for a given circle, but there's no reason you can't also get lists of the most popular music or movies for a circle. Add a catalog after the Purchase Circle ID to find what you're looking for. Here are the possible catalogs:

music
dvd
video
toy
ce (electronics)

So, for example, to link directly to DVDs that are popular in Sebastopol, CA, find the Purchase Circle ID, and add /dvd/ to the URL:

http://amazon.com/exec/obidos/tg/cm/browse-communities/-/216435/dvd/

If you'd like to keep it text-only as in the script, the /t/ follows the catalog:

http://amazon.com/exec/obidos/tg/cm/browse-communities/-/216435/dvd/t/

[ Team LiB ]