URL Grabber

Nasir Darwish

4.75/5 (8 votes)

Sep 11, 2014

CPOL

3 min read

24641

743

The article describes an ASP.NET Web Pages project to grab URLs of pages with high matches for certain words.

Download URL_Grabber.zip - 2.9 MB

GrabWords Image

Introduction

In this post, I describe a web application which is useful for collecting URLs of web pages with text matching a given set of words. The application can be run from here. (Note: The deployed application allows a maximum of five queries per user per day.)

The application is useful for finding pages with a large number of URLs relating to the topics being sought. For example, using the query "English podcast mp3", the application will find pages with many links to MP3 podcasts for learning English.

The application demonstrates some practical concepts including web crawling, generic collections, and the use of WebRequest (in System.Net) and WebGrid (in System.Web.Helpers) classes. The application serves as a good example of using Microsoft's WebMatrix (the IDE used for development) to develop a Razor-based application using a single ".cshtml" file, and not having several separate files for the model, view and controller.

How the Application Works

The application is a single page application (SPA) developed in MS Razor. It consists of a single Razor view page that contains both the HTML and the processing logic.

@{
  int StartTimer = 0;
  string  ProgressInfo = "";
  WebGrid  grid = null;

  //  Server.ScriptTimeout = 30;  

  int MaxRecords = 20; // Default value for MaxRecords
  string  searchWords = "English podcast mp3"; // Default value for searchWords
   
  if (Request["hvar"] =="submit") 
  {   
     MaxRecords = int.Parse(Request["MaxRecords"]);
     if (MaxRecords > 60)
     {  ProgressInfo = "Maximum Records cannot exceed 60.";   
        goto finish;
     }

     StartTimer = 1;
     Grabber grabber = new Grabber();
  
     grabber.Session = Session;

     var urlTable = (HashSet<RowSchema>) Session["urlTable"];

     if (urlTable==null) 
     {  urlTable = new HashSet<RowSchema>();
        Session["urlTable"] = urlTable; 
     } 
          
     else if (Request["refresh"] =="0") 
     { urlTable.Clear(); }
       
     searchWords = Request["searchWords"];

     bool status = grabber.Search(searchWords, MaxRecords, dt);

     grid = new WebGrid(source:urlTable, rowsPerPage:100); 
     grid.SortDirection = SortDirection.Descending;
     grid.SortColumn  = "Count";

     int visitedCount = urlTable.Where(p => p.Visited).Count(); 
     ProgressInfo = "Visited count = " + visitedCount + "; Page will refresh after 15 seconds ...";  
     if (status)
     {  StartTimer=0; // Used to disable refresh timer on client side
        ProgressInfo = "Finished";
     }  
   }

   finish:
  }
}

// some more stuff here
:
<form action="" method="post" >
  
   <input name="hvar" type="hidden" value="submit" />
   <input id="refresh" name="refresh" type="hidden" value="0" />
   <label>Maximum Records</label><input type="text" name="MaxRecords" value="@MaxRecords" size="4" />  
   <label>Search Word(s)</label><input type="text" name="searchWords" value="@searchWords"  size="35" />  
   <input type="submit"  value="Search"  onclick="submitForm()" />  
   <input type="button" value="Stop"  onclick="DoStop()" />

</form>      
  
<div style="margin-left:10px" > 
   <p id="status" >@ProgressInfo</p> 
   <!-- render grid here -->
     @if (grid!=null) 
     { @grid.GetHtml() }
</div>

The preceding listing shows the form's HTML and the server-side program code that is executed every time the page is visited.

In the code, the line Grabber grabber = new Grabber(); creates a "Grabber" object. The call grabber.Search(searchWords, MaxRecords, urlTable); crawls the web and fills a collection (urlTable parameter) with URLs that have high relevance to the words specified by the searchWords parameter.

The line grid = new WebGrid(source:urlTable, rowsPerPage:100); sets urlTable as the data source for a WebGrid object. In the HTML for the body of the page, the line { @grid.GetHtml() } renders the object's data as an HTML table.

The urlTable is a HashSet object (a generic collection). Every time the page is refreshed, more URLs are added to urlTable by the call to grabber.Search(). To prevent the loss of the urlTable object between postbacks (page refreshes), it is saved into the page’s Session object. The rows in this table are of type RowSchema (a class defined in the Grabber.cs file in the App_Code folder). To avoid duplicate URLs, we have chosen a HashSet<RowSchema> collection. This necessitates defining some overrides for GetHashCode() and Equals() methods for the element's type (RowSchema in our case).

The call to Search() runs for about 10 seconds with every page refresh. The refresh process is terminated if the call to Search() returns true, which happens if number of rows reaches MaxRecords and all rows are visited.

The Grabber Class

The following table lists some key methods defined in the Grabber class.

Method	Description
`public bool Search(string searchWords, int MaxRecords, HashSet<RowSchema> urlTable)`	The main (entry point) method in the `Grabber` class. It adds (removes, or updates) rows to (from) `urlTable`. The method returns true if number of rows reaches `MaxRecords` and all rows are visited.
`string FetchURL(string url)`	Fetches the html for a given `url`. It uses .NET WebRquest class.
`string GetTitle(string htmlHead, string searchWords)`	Returns the page's title. It returns empty string if no title is found or if none of the words in `searchWords` is found in the title.
`int CountWords(string htmlData, string searchWords)`	Returns the number of matches for words from `searchWords` in `htmlData`.
`HashSet<string> GrabURLs(string htmlData, string parentURL)`	Returns a set of absolute URLs from URLs found in `htmlData`.

The following listing shows the Search() method from the Grabber class.

public bool Search(string searchWords, int MaxRecords, HashSet<RowSchema> urlTable)
{ // Uncomment next line for logging
  // logFile = new StreamWriter(Server.MapPath("") + @"\log.txt");
          
  DateTime t1 = DateTime.Now;
             
  while (true)
  { if ((DateTime.Now - t1).TotalSeconds > MaxServerTime) break;

    string SearchUrl = String.Format("http://www.bing.com/search?q={0}" , 
          HttpUtility.UrlEncode(searchWords)) + "&first=" + rand.Next(500);
    string parentURL = "";

    RowSchema row1 = null;
    if ((urlTable.Count > 5) && (rand.NextDouble() < 0.5))
    { var foundRows = urlTable.Where(p => p.Visited== false).ToList<RowSchema>(); 

      if ((foundRows.Count == 0) && (urlTable.Count == MaxRecords))
         return true; // All visited; use to disable refresh timer
                  
      if (foundRows.Count > 0)
      {  row1 = foundRows[0];
         SearchUrl = row1.URL;
         row1.Visited = true; // Optimistic that call to FetchURL() will be OK
         parentURL = SearchUrl; 
      }
    }
                 
    string searchData = FetchURL(SearchUrl);

    if (searchData.StartsWith("Error"))
    {  if (row1!= null)
       { urlTable.Remove(row1); } 
       continue;
    }
                   
    // Debugging: Response.Write(searchData); return;

    int i = searchData.IndexOf("<body", StringComparison.InvariantCultureIgnoreCase);
    if (i == -1)
    {  if (row1 != null)
       { urlTable.Remove(row1); }
       continue; 
    } 

    string htmlHead = searchData.Substring(0,i-1);
    string htmlBody = searchData.Substring(i).ToLower(); 

    if (row1 != null)
    {  string Title = GetTitle(htmlHead, searchWords);
       if (Title == "")
       {  urlTable.Remove(row1);
          continue;
       }

       int Count = CountWords(htmlBody,searchWords);
       if (Count == 0)
       {  urlTable.Remove(row1);
          continue;
       }

       row1.Title = Title;
       row1.Count = Count;  
    }

    foreach (string s in urlSet)
    { if (urlTable.Count == MaxRecords) break;

      row1 = new RowSchema();
      row1.URL = s;
      row1.Visited = false;

      // Note: HashSet collection guarantees uniqueness (no duplicate)
      // based on the override for Equals()
      // row1 won't be added if there is match in urlTable 
 
       urlTable.Add(row1);
     }
  } 
           
   if (logFile != null) logFile.Close(); 
   return false;
}

The call FetchURL(SearchUrl) is used to fetch content where the SearchURL is either a search query to Bing or some unvisited URL from urlTable. The returned content (searchData) is then processed to extract URLs using the call GrabURLs(searchData), which returns a set of URLs (urlSet). Finally, the urls in urlSet are added to the urlTable.

History

September 11, 2014: Version 1.0