[ Team LiB ] Previous Section Next Section

Hack 40 Scrape Customer Advice

figs/expert.giffigs/hack40.gif

Screen scraping can give you access to community features not yet implemented through the API—like customer buying advice.

Customer buying advice isn't available through Amazon's Web Services, so if you'd like to include this information on a remote site, you'll have to get it from Amazon's site through scraping. The first step to this hack is knowing where to find all of the customer advice on one page. The following URL links directly to the advice page for a given ASIN:

http://amazon.com/o/tg/detail/-/insert ASIN/?vi=advice

40.1 The Code

This Perl script, get_advice.pl, splits the advice page into two variables based on the headings "in addition to" and "instead of." It then loops through those sections, using regular expressions to match the products' information. The script then formats and prints the information.

#!/usr/bin/perl
# get_advice.pl
# A script to scrape Amazon to retrieve customer buying advice
# Usage: perl get_advice.pl <asin>

#Take the asin from the command-line
my $asin =shift @ARGV or die "Usage:perl get_advice.pl <asin>\n";

#Assemble the URL
my $url = "http://amazon.com/o/tg/detail/-/" . $asin . 
          "/?vi=advice";

#Set up unescape-HTML rules
my %unescape = ('&quot;'=>'"', '&amp;'=>'&', '&nbsp;'=>' ');
my $unescape_re = join '|' => keys %unescape;

use strict;
use LWP::Simple;

#Request the URL
my $content = get($url);
die "Could not retrieve $url" unless $content;

my($inAddition) = (join '', $content) =~ m!in addition to(.*?)<tr>[RETURN]
<td colspan=3><br></td></tr>!mis;
my($instead) = (join '', $content) =~ m!recommendations instead of(.*?)</[RETURN]
table>!mis;

#Loop through the HTML looking for "in addition" advice
print "-- In Addition To --\n\n";
while ($inAddition =~ m!<td width=10>(.*?)</td>\n<td width=90%>.*?ASIN/(.[RETURN]
*?)/.*?">(.*?)</a>.*?</td>.*?<td width=10% align=center>(.*?)</td>!mgis) {
    my($place,$thisAsin,$title,$number) = ($1||'',$2||'',$3||'',$4||'');
    $title =~ s/($unescape_re)/$unescape{$1}/migs; #unescape HTML 
    #Print the results
    print $place . " " . 
          $title . " (" . $thisAsin . ")\n(" . 
          "Recommendations: " . $number . ")" . 
          "\n\n";
}

#Loop through the HTML looking for "instead of" advice
print "-- Instead Of --\n\n";
while ($instead =~ m!<td width=10>(.*?)</td>\n<td width=90%>.*?ASIN/(.*?)/.[RETURN]
*?">(.*?)</a>.*?</td>.*?<td width=10% align=center>(.*?)</td>!mgis) {
    my($place,$thisAsin,$title,$number) = ($1||'',$2||'',$3||'',$4||'');
    $title =~ s/($unescape_re)/$unescape{$1}/migs; #unescape HTML 
    #Print the results
    print $place . " " . 
          $title . " (" . $thisAsin . ")\n(" . 
          "Recommendations: " . $number . ")" . 
          "\n\n";
}

40.2 Running the Hack

Run this script from the command line, passing it any ASIN:

perl get_advice.pl  ASIN 

If the product has long lists of alternate recomendations, send the output to a text file. This example sends all alternate customer product recommendations for Google Hacks to a file called advice.txt:

perl get_advice.pl 0596004478 > advice.txt
    [ Team LiB ] Previous Section Next Section