[ Team LiB ] Previous Section Next Section

Hack 34 Scrape Product Reviews

figs/expert.giffigs/hack34.gif

Amazon has made some reviews available through their Web Services API, but most are available only at the Amazon.com web site, requiring a little screen scraping to grab.

Here's an even more powerful way to integrate Amazon reviews with your web site. Unlike linking to reviews [Hack #28] or monitoring reviews for changes [Hack #31], this puts the entire text of Amazon reviews on your web site.

The easiest and most reliable way to access customer reviews programmatically is through the Web Services API. Unfortunately, the API gives only a small window to the larger number of reviews available. An API query for the book Cluetrain Manifesto, for example, includes three user reviews. If you visit the review page [Hack #28] for that book, though, you'll find 128 reviews. To dig deeper into the reviews available on Amazon.com and use all of them on your own web site, you'll need to delve deeper into scripting.

34.1 The Code

This Perl script, get_reviews.pl, builds a URL to the reviews page for a given ASIN, uses regular expressions to find the reviews, and breaks the review into its pieces: rating, title, date, reviewer, and the text of the review.

#!/usr/bin/perl
# get_reviews.pl
#
# A script to scrape Amazon, retrieve reviews, and write to a file
# Usage: perl get_reviews.pl <asin>
use strict;
use warnings;
use LWP::Simple;

# Take the asin from the command-line
my $asin = shift @ARGV or die "Usage: perl get_reviews.pl <asin>\n";

# Assemble the URL from the passed asin.
my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=customer-reviews";

# Set up unescape-HTML rules. Quicker than URI::Escape.
my %unescape = ('&quot;'=>'"', '&amp;'=>'&', '&nbsp;'=>' ');
my $unescape_re = join '|' => keys %unescape;

# Request the URL.
my $content = get($url);
die "Could not retrieve $url" unless $content;

#Remove everything before the reviews
$content =~ s!.*?Number of Reviews:!!ms;

# Loop through the HTML looking for matches
while ($content =~ m!<img.*?stars-(\d)-0.gif.*?>.*?<b>(.*?)</b>, (.*?)[RETURN]
\n.*?Reviewer:\n<b>\n(.*?)</b>.*?</table>\n(.*?)<br>\n<br>!mgis) {

    my($rating,$title,$date,$reviewer,$review) = [RETURN] 
($1||'',$2||'',$3||'',$4||'',$5||'');
    $reviewer =~ s!<.+?>!!g;   # drop all HTML tags
    $reviewer =~ s!\(.+?\)!!g;   # remove anything in parenthesis
    $reviewer =~ s!\n!!g;      # remove newlines
    $review =~ s!<.+?>!!g;     # drop all HTML tags
    $review =~ s/($unescape_re)/$unescape{$1}/migs; # unescape.

    # Print the results
    print "$title\n" . "$date\n" . "by $reviewer\n" .
          "$rating stars.\n\n" . "$review\n\n";

}

34.2 Running the Hack

This script can be run from a command line and requires an ASIN. The reviews are too long to read as they scroll past on your screen, so it helps to send the information to a text file (in this case, reviews.txt) like so:

perl get_reviews.pl  asin  > reviews.txt 
    [ Team LiB ] Previous Section Next Section