[ Team LiB ] Previous Section Next Section

16.2 The Screen-Scraper Application

The screen scraper application has only one job, to populate the table in the database, from which the ASP.NET application will display the relative rankings.

It works by reading through a set of XML files, which contain ISBNs of books that compete on various subjects (ASP.NET, C#, VB.NET). For example, the file CSharpISBN.xml is shown in Example 16-1. These XML files are maintained by hand; if you find a new C# title you want to track, you just make a new entry for that ISBN.

Example 16-1. CSharpISBN.xml
 <isbns>
  <isbn>193183654X</isbn> 
  <isbn>0130461334</isbn> 
  <isbn>1893115593</isbn> 
  <isbn>0130622214</isbn> 
  <isbn>1861007043</isbn> 
  <isbn>1861004982</isbn> 
  <isbn>0672320711</isbn> 
  <isbn>0596001819</isbn> 
  <isbn>0735612897</isbn> 
  <isbn>0735612900</isbn> 
  <isbn>0596003099</isbn> 
  <isbn>0596003765</isbn> 
  <isbn>0072133295</isbn> 
  <isbn>0672322358</isbn> 
  <isbn>0072193794</isbn> 
  <isbn>067232122X</isbn> 
  <isbn>1588801926</isbn> 
  <isbn>0672321521</isbn> 
  <isbn>0735615683</isbn> 
  <isbn>0201729555</isbn> 
  </isbns>

As each ISBN is read, the relevant values (title, publisher, rank) are found on the Amazon web site and stored in the database table. A simple listbox is then updated, as shown in Figure 16-3. As each book is recorded, its summary information is added to the listbox. Once all the books are recorded, the system becomes dormant while a timer ticks down the remaining time between sessions. You can force a new session by clicking the Now button. This UI was intentionally created to be as simple as possible.

Figure 16-3. Running the screen-scraper
figs/pcsharp3_1603.gif

The complete listing for the screen-scraper version of the program is shown in Example 16-2. Detailed analysis follows.

Example 16-2. SalesRankDB.sln
using System;
using System.Drawing;
using System.Collections;
using System.ComponentModel;
using System.Windows.Forms;
using System.Data;
using System.Web.Services;
using System.Text;
using System.Text.RegularExpressions;
using System.Net;

namespace SalesRankDB
{
   public class Form1 : System.Windows.Forms.Form
   {
      private System.Windows.Forms.Button btnStart;

      private string connectionString;
      private System.Data.SqlClient.SqlConnection connection;
      private System.Data.SqlClient.SqlCommand command;
      private System.Windows.Forms.Timer timer1;
      private System.ComponentModel.IContainer components;
      int timeRemaining;
      private System.Windows.Forms.Button btnNow;
      private System.Windows.Forms.TextBox txtClock;
      private System.Windows.Forms.ListBox lbOutput;
      const int WaitTime =  900; // 15 min.

      public Form1( )
      {
         InitializeComponent( );
      }

      /// <summary>
      /// Clean up any resources being used.
      /// </summary>
      protected override void Dispose( bool disposing )
      {
         if( disposing )
         {
            if (components != null) 
            {
               components.Dispose( );
            }
         }
         base.Dispose( disposing );
      }

      #region Windows Form Designer generated code
      /// <summary>
      /// Required method for Designer support - do not modify
      /// the contents of this method with the code editor.
      /// </summary>
      private void InitializeComponent( )
      {
         this.components = new System.ComponentModel.Container( );
         this.btnStart = new System.Windows.Forms.Button( );
         this.timer1 = new System.Windows.Forms.Timer(this.components);
         this.lbOutput = new System.Windows.Forms.ListBox( );
         this.btnNow = new System.Windows.Forms.Button( );
         this.txtClock = new System.Windows.Forms.TextBox( );
         this.SuspendLayout( );
         // 
         // btnStart
         // 
         this.btnStart.Location = new System.Drawing.Point(232, 336);
         this.btnStart.Name = "btnStart";
         this.btnStart.TabIndex = 0;
         this.btnStart.Text = "Start";
         this.btnStart.Click += new 
                            System.EventHandler(this.btnStart_Click);
         // 
         // timer1
         // 
         this.timer1.Tick += new System.EventHandler(this.timer1_Tick);
         // 
         // lbOutput
         // 
         this.lbOutput.Location = new System.Drawing.Point(8, 8);
         this.lbOutput.Name = "lbOutput";
         this.lbOutput.Size = new System.Drawing.Size(704, 303);
         this.lbOutput.TabIndex = 1;
         // 
         // btnNow
         // 
         this.btnNow.Location = new System.Drawing.Point(432, 336);
         this.btnNow.Name = "btnNow";
         this.btnNow.Size = new System.Drawing.Size(48, 23);
         this.btnNow.TabIndex = 7;
         this.btnNow.Text = "Now";
         this.btnNow.Click += new System.EventHandler(this.btnNow_Click);
         // 
         // txtClock
         // 
         this.txtClock.BackColor = 
                        System.Drawing.SystemColors.InactiveCaptionText;
         this.txtClock.Enabled = false;
         this.txtClock.Location = new System.Drawing.Point(320, 336);
         this.txtClock.Name = "txtClock";
         this.txtClock.Size = new System.Drawing.Size(88, 20);
         this.txtClock.TabIndex = 6;
         this.txtClock.Text = "";
         // 
         // Form1
         // 
         this.AutoScaleBaseSize = new System.Drawing.Size(5, 13);
         this.ClientSize = new System.Drawing.Size(728, 398);
         this.Controls.AddRange(new System.Windows.Forms.Control[] {
                                                        this.btnNow,
                                                        this.txtClock,
                                                        this.lbOutput,
                                                        this.btnStart});
         this.Name = "Form1";
         this.Text = "--";
         this.Load += new System.EventHandler(this.Form1_Load);
         this.ResumeLayout(false);

      }
      #endregion

      /// <summary>
      /// The main entry point for the application.
      /// </summary>
      [STAThread]
      static void Main( ) 
      {
         Application.Run(new Form1( ));
      }

      private void Form1_Load(object sender, System.EventArgs e)
      {
         // connection string to connect to the Sales Rank Database
         connectionString = 
   "server=localhost;Trusted_Connection=true;" +    "database=AmazonSalesRanks";

         // Create connection object, initialize with 
         // connection string. 
         connection = 
            new System.Data.SqlClient.SqlConnection(connectionString);
       
         // Create a SqlCommand object and assign the connection
         command = 
            new System.Data.SqlClient.SqlCommand( );

         command.Connection = connection;
         timer1.Interval = 1000; // one second 
         timer1.Enabled = false;
         timeRemaining = 1;  // when you first start up, get the info.
         UpdateButton( );
      
      }

      // set the button based on whether the timer is enabled
      private void UpdateButton( )
      {
         btnStart.Text = timer1.Enabled ? "Stop" : "Start";
      }

      // toggle the button and update its text
      private void btnStart_Click(object sender, System.EventArgs e)
      {
         timer1.Enabled = timer1.Enabled ? false : true;
         UpdateButton( );
      }

      // given an isbn, get the information by scraping Amazon.com
      private void GetInfoFromISBN(string isbn, string technology)
      {

         if (isbn.Length != 10)
            return ;


         string strHTML;  // hold the html from Amazon.com

         // used to interact with internet resources
         WebClient webClient = new WebClient( );

         // assemble the url
         string strURL = "http://www.amazon.com/exec/obidos/ASIN/" + 
            isbn + "/";

         // set the encoding
         UTF8Encoding utf8encoder = new UTF8Encoding( );
         try
         {
            strHTML = 
               utf8encoder.GetString(webClient.DownloadData(strURL));
         }
         catch
         {
            lbOutput.Items.Add("Unable to access record for " + isbn);
            lbOutput.SelectedIndex = lbOutput.Items.Count -1;
            return;
         }
         int rankBeginOffset;
         int rankEndOffset;


         string title = "";
         string author = "";
         string publisher = "";
         string pubDate = "";
         int rank = 9999999;

         // search for and parse the ranking 
            rankBeginOffset = strHTML.IndexOf("Rank: </b>") + 11;
         rankEndOffset = strHTML.IndexOf("<",rankBeginOffset);
         if (rankEndOffset > rankBeginOffset)
         {
            string strRank = strHTML.Substring(
               rankBeginOffset,rankEndOffset - rankBeginOffset);
            strRank = strRank.Replace(",","");
            rank = Convert.ToInt32(strRank);
         }

         // search for and parse the book title
         int titleBeginOffset;
         int titleEndOffset;

         titleBeginOffset = strHTML.IndexOf("<b class=sans>") + 14;
         titleEndOffset = strHTML.IndexOf("<",titleBeginOffset);
         if (titleBeginOffset < titleEndOffset)
         {
             title = fixQuote(strHTML.Substring(titleBeginOffset, 
                                      titleEndOffset-titleBeginOffset));
         }

         // search for and parse the first author
         int authorBeginOffset;
         int authorEndOffset;
         int fieldAuthorBeginOffset;

         fieldAuthorBeginOffset = strHTML.IndexOf("field-author=");
         string fullAuthor = 
               strHTML.Substring(fieldAuthorBeginOffset,200);
         authorBeginOffset = fullAuthor.IndexOf("\">") + 2;
         authorEndOffset = fullAuthor.IndexOf("</a>");
         if (authorEndOffset > authorBeginOffset)
         {
             author = fixQuote(fullAuthor.Substring(authorBeginOffset,
                               authorEndOffset-authorBeginOffset));
         }
         
         // search for and parse the publisher
         int publisherBeginOffset;
         int publisherEndOffset;

         publisherBeginOffset = strHTML.IndexOf("<b>Publisher:</b>") + 18;
         publisherEndOffset = strHTML.IndexOf("; ;",titleBeginOffset);
         if (publisherEndOffset > publisherBeginOffset)
         {
             publisher = fixQuote(strHTML.Substring(publisherBeginOffset,
                          publisherEndOffset-publisherBeginOffset));


         // search for and parse the publication (release) date
         int pubDateBeginOffset;
         int pubDateEndOffset;

            string fullPublisher = 
               strHTML.Substring(publisherBeginOffset,100);
      

            pubDateBeginOffset = fullPublisher.IndexOf("(") + 1;
            pubDateEndOffset = fullPublisher.IndexOf(")");
            if ( pubDateEndOffset > pubDateBeginOffset )
            {
               pubDate = fullPublisher.Substring(pubDateBeginOffset,
                         pubDateEndOffset-pubDateBeginOffset);
            }
         }

         // assemble the results for display in the list box
         string results = title + " by " + 
            author + ": " + publisher + ", " + 
            pubDate + ". Rank: " + rank;

         // display in list box and set selected item to last
         // item entered
         lbOutput.Items.Add(results);
         lbOutput.SelectedIndex = lbOutput.Items.Count -1;

         // update the db
         string commandString = @"Update BookInfo set isbn = '" +
            isbn + "', title = '" + title + "', publisher = '" +
            publisher + "', pubDate = '" + pubDate + "', rank = " +
            rank + ", link = '" + strURL + "', lastUpdate = '" + 
            System.DateTime.Now +  "', technology = '" +
            technology +  "', author = '" +
            author + "' where isbn = '" +
            isbn + "'";

         command.CommandText = commandString;
         try 
         {
            connection.Open( );

            // if you get back no rows affected, this is a new record
            int numRowsAffected = command.ExecuteNonQuery( );

            if (numRowsAffected == 0)
            {
               // insert the record into the db
               commandString = @"Insert into BookInfo values ('" +
                  isbn + "', '" + title + "', '" + publisher + "', '" +
                  pubDate + "', " + rank + ", '" + strURL + "', '" + 
                  System.DateTime.Now + 
                  "', '" + technology + "', '" + author + "')";

               command.CommandText = commandString;
               command.ExecuteNonQuery( );
            }
         }
         catch
         {
            lbOutput.Items.Add("Unable to update database!");
            lbOutput.SelectedIndex = lbOutput.Items.Count -1;
         }
         finally
         {
            connection.Close( );      // clean up
         }
         Application.DoEvents( );
      }

      private string fixQuote(string s)
      {
         return s.Replace("'","''");
         
      }

      // each clock tick, check how much time remains
      // if it is time to do work, read the xml files
      private void timer1_Tick(object sender, System.EventArgs e)
      {
         if (timer1.Enabled)
            txtClock.Text = (--timeRemaining).ToString( ) + " seconds";
         else
            txtClock.Text = "Stopped";

         // hi ho, hi ho, it's off to work we go...
         if ( timeRemaining < 1 )
         {
            timeRemaining = WaitTime;  // reset the clock

            // create data set based on xml file
            DataSet BookData = new DataSet( );
            BookData.ReadXml("aspnetIsbn.xml");
            
            // iterate through, calling GetInfoFromISBN for 
            // each isbn found in file
            foreach(DataRow Book in BookData.Tables[0].Rows)
            {
               string isbn = Book[0].ToString( );
               GetInfoFromISBN(isbn,"ASPNET");
            }

            BookData = new DataSet( );
            BookData.ReadXml("csharpIsbn.xml");
            foreach(DataRow Book in BookData.Tables[0].Rows)
            {
               string isbn = Book[0].ToString( );
               GetInfoFromISBN(isbn,"CSHARP");
            }

            BookData = new DataSet( );
            BookData.ReadXml("VBnetIsbn.xml");
            foreach(DataRow Book in BookData.Tables[0].Rows)
            {
               string isbn = Book[0].ToString( );
               GetInfoFromISBN(isbn,"VBNET");
            }
         }
      }

      private void btnNow_Click(object sender, System.EventArgs e)
      {
         timeRemaining = 2;
      }

   }
}

16.2.1 Controlling the Application

Remember to add a reference to System.Web.Services in the Solution Explorer.

The application begins by loading the form. In the Form_Load event handler, the connection string is created and a connection to the database is instantiated.

private void Form1_Load(object sender, System.EventArgs e)
{
   connectionString = 
      "server=localhost;Trusted_Connection=true;database=AmazonSalesRanks";

   connection = 
      new System.Data.SqlClient.SqlConnection(connectionString);

   command = 
      new System.Data.SqlClient.SqlCommand( );

   command.Connection = connection;

Note that this application uses a trusted connection, and thus you will need to have the appropriate rights for the ASPNET user on your database. If you administer the SQL Server Database, be sure to add the MachineName\ASPNET user to the database, and give that user db_owner role membership.

That done, the timer interval is set to 1 second, and the timer is disabled (and the button is set to say Start).

timer1.Interval = 1000; 
timer1.Enabled = false;
timeRemaining = 2;  
UpdateButton( );

Clicking Start toggles the timer and causes the button to change its text to Stop.

Note that clicking Stop does not stop the application from gathering data from the web service, but does stop the clock so that it won't run again until you click Start.

private void btnStart_Click(object sender, System.EventArgs e)
{
   timer1.Enabled = timer1.Enabled ? false : true;
   UpdateButton( );
}

private void UpdateButton( )
{
   btnStart.Text = timer1.Enabled ? "Stop" : "Start";
}

There is a second button, Now, which brings the timer down from whatever its current count is to 2 seconds, in effect forcing an update:

private void btnNow_Click(object sender, System.EventArgs e)
{
   timeRemaining = 2;
}

Each time the timer does tick, you check to see if the time has run out (making it time to update the books):

private void timer1_Tick(object sender, System.EventArgs e)
{
   if (timer1.Enabled)
      txtClock.Text = (--timeRemaining).ToString( ) + " seconds";
   else
      txtClock.Text = "Stopped";

   if ( timeRemaining < 1 )
   {

Once timeRemaining is exhausted, you reset it to the constant WaitTime, e.g., 900 seconds (or 15 minutes), and you are ready to go. You create a data set and populate that data set by calling ReadXml, passing in the name of the appropriate XML file):

timeRemaining = WaitTime;
DataSet BookData = new DataSet( );
BookData.ReadXml("aspnetIsbn.xml");

You can now iterate through the rows in the data set, extracting each ISBN and then calling the private helper method GetInfoFromISBN, which will be responsible for scraping Amazon for that ISBN:

foreach(DataRow Book in BookData.Tables[0].Rows)
{
   string isbn = Book[0].ToString( );
   GetInfoFromISBN(isbn,"ASPNET");
}

You'll do this once each for the ASP.NET, C#, and VB.NET ISBN files. Each time, you also pass in the "technology" as a string; this will be stored with the records for easy extraction later.

16.2.2 Scraping HTML

The heart of this program is in the GetInfoFromISBN method itself. The first task is to ensure that you've been given a 10-digit ISBN:

if (isbn.Length != 10)
   return ;

All ISBN values are 10 digits, so if you get a length other than 10, you'll skip that entry. Actually getting the stream of HTML is pretty straightforward (see Chapter 21). You create an instance of WebClient, and you create a string using the Amazon URL with the ISBN appended:

string strHTML;
WebClient webClient = new WebClient( );

string strURL = "http://www.amazon.com/exec/obidos/ASIN/" + isbn + "/";

Set the UTF encoding and call GetString to get back the HTML as a string:

UTF8Encoding utf8encoder = new UTF8Encoding( );
try
{
   strHTML = utf8encoder.GetString(webClient.DownloadData(strURL));
}
catch
{
   lbOutput.Items.Add("Unable to access record for " + isbn);
   lbOutput.SelectedIndex = lbOutput.Items.Count -1;
   return;
}

Assuming you did not get an exception, you now have the HTML for that page in the strHTML variable. You can now parse that string for the various elements you require.

For example, to find the rank, you search for the offset into the string of the substring "Rank: </b>" and then you add 11 characters.

rankBeginOffset = strHTML.IndexOf("Rank: </b>") + 11;

Why Screen Scraping Is Evil

Screen-scraping works great as long as Amazon always lists the rank in exactly this way, but if it doesn't, the parsing will fail. Each time Amazon changes its pages, you must upgrade this program.

You'll solve this problem later in this chapter by replacing the screen-scraper with a web services consumer application.

To find the end of the rank, you look for the less-than (<) character:

rankEndOffset = strHTML.IndexOf("<",rankBeginOffset);

You double-check that the end offset is a value greater than the beginning offset, and you are ready to try to extract the substring that represents the relative rank of the book. Once you have it as a string, you remove any commas and convert it to an integer:

if (rankEndOffset > rankBeginOffset)
{
   string strRank = strHTML.Substring(
      rankBeginOffset,rankEndOffset - rankBeginOffset);
   strRank = strRank.Replace(",","");
   rank = Convert.ToInt32(strRank);
}

You then repeat this logic for each of the other fields you are trying to parse (e.g., title, author, etc.).

When you've gathered all the data, you are ready to add it to the listbox, as a string. You then set the selected index to the last entry in the listbox so that the new entry remains visible:

string results = title + " by " + 
   author + ": " + publisher + ", " + 
   pubDate + ". Rank: " + rank;

lbOutput.Items.Add(results);

lbOutput.SelectedIndex = lbOutput.Items.Count -1;

That done, you are ready to update your database, with an update command:

string commandString = @"Update BookInfo set isbn = '" +
   isbn + "', title = '" + title + "', publisher = '" +
   publisher + "', pubDate = '" + pubDate + "', rank = " +
   rank + ", link = '" + strURL + "', lastUpdate = '" + 
   System.DateTime.Now +  "', technology = '" +
   technology +  "', author = '" +
   author + "' where isbn = '" +
   isbn + "'";

command.CommandText = commandString;

You place the update command in a try block to catch any SQL exceptions.

try 
{
   connection.Open( );
   int numRowsAffected = command.ExecuteNonQuery( );

If the record does not exist for this ISBN, the number of rows affected will be zero, and you will have to insert the record rather than update it.

if (numRowsAffected == 0)
{

   commandString = @"Insert into BookInfo values ('" +
      isbn + "', '" + title + "', '" + publisher + "', '" +
      pubDate + "', " + rank + ", '" + strURL + "', '" + 
      System.DateTime.Now + 
      "', '" + technology + "', '" + author + "')";

   command.CommandText = commandString;
   command.ExecuteNonQuery( );
}

If you do catch an exception, you'll report that in the listbox and continue with the next ISBN. In any case, once you've updated the database, you close the connection:

catch
{
   lbOutput.Items.Add("Unable to update database!");
   lbOutput.SelectedIndex = lbOutput.Items.Count -1;
}
finally
{
   connection.Close( );
}

Before looping, you call DoEvents to give the application an opportunity to update the user interface:

Application.DoEvents( );

When this application has run its way through the three XML files, the data for all of the books has been scraped, and the table in SQL Server is fully populated, as shown in Figure 16-4.

Figure 16-4. Fully populated DB table
figs/pcsharp3_1604.gif

While this is quite useful, it is difficult to read and difficult to share with your friends, so you'll write a quick ASP.NET application to read this data and display it in a set of tables.

    [ Team LiB ] Previous Section Next Section