Click here to Skip to main content
15,881,559 members
Articles / Web Development / HTML

URL Grabber

Rate me:
Please Sign up or sign in to vote.
4.75/5 (8 votes)
10 Sep 2014CPOL3 min read 23.4K   728   13   2
The article describes an ASP.NET Web Pages project to grab URLs of pages with high matches for certain words.

GrabWords Image

Introduction

In this post, I describe a web application which is useful for collecting URLs of web pages with text matching a given set of words. The application can be run from here. (Note: The deployed application allows a maximum of five queries per user per day.)

The application is useful for finding pages with a large number of URLs relating to the topics being sought. For example, using the query "English podcast mp3", the application will find pages with many links to MP3 podcasts for learning English.

The application demonstrates some practical concepts including web crawling, generic collections, and the use of WebRequest (in System.Net) and WebGrid (in System.Web.Helpers) classes. The application serves as a good example of using Microsoft's WebMatrix (the IDE used for development) to develop a Razor-based application using a single ".cshtml" file, and not having several separate files for the model, view and controller.

How the Application Works

The application is a single page application (SPA) developed in MS Razor. It consists of a single Razor view page that contains both the HTML and the processing logic.

C#
@{
  int StartTimer = 0;
  string  ProgressInfo = "";
  WebGrid  grid = null;

  //  Server.ScriptTimeout = 30;  

  int MaxRecords = 20; // Default value for MaxRecords
  string  searchWords = "English podcast mp3"; // Default value for searchWords
   
  if (Request["hvar"] =="submit") 
  {   
     MaxRecords = int.Parse(Request["MaxRecords"]);
     if (MaxRecords > 60)
     {  ProgressInfo = "Maximum Records cannot exceed 60.";   
        goto finish;
     }

     StartTimer = 1;
     Grabber grabber = new Grabber();
  
     grabber.Session = Session;

     var urlTable = (HashSet<RowSchema>) Session["urlTable"];

     if (urlTable==null) 
     {  urlTable = new HashSet<RowSchema>();
        Session["urlTable"] = urlTable; 
     } 
          
     else if (Request["refresh"] =="0") 
     { urlTable.Clear(); }
       
     searchWords = Request["searchWords"];

     bool status = grabber.Search(searchWords, MaxRecords, dt);

     grid = new WebGrid(source:urlTable, rowsPerPage:100); 
     grid.SortDirection = SortDirection.Descending;
     grid.SortColumn  = "Count";

     int visitedCount = urlTable.Where(p => p.Visited).Count(); 
     ProgressInfo = "Visited count = " + visitedCount + "; Page will refresh after 15 seconds ...";  
     if (status)
     {  StartTimer=0; // Used to disable refresh timer on client side
        ProgressInfo = "Finished";
     }  
   }

   finish:
  }
} 
HTML
// some more stuff here
:
<form action="" method="post" >
  
   <input name="hvar" type="hidden" value="submit" />
   <input id="refresh" name="refresh" type="hidden" value="0" />
   <label>Maximum Records</label><input type="text" name="MaxRecords" value="@MaxRecords" size="4" />  
   <label>Search Word(s)</label><input type="text" name="searchWords" value="@searchWords"  size="35" />  
   <input type="submit"  value="Search"  onclick="submitForm()" />  
   <input type="button" value="Stop"  onclick="DoStop()" />

</form>      
  
<div style="margin-left:10px" > 
   <p id="status" >@ProgressInfo</p> 
   <!-- render grid here -->
     @if (grid!=null) 
     { @grid.GetHtml() }
</div>

The preceding listing shows the form's HTML and the server-side program code that is executed every time the page is visited.

In the code, the line  Grabber grabber = new Grabber(); creates a "Grabber" object. The call grabber.Search(searchWords, MaxRecords, urlTable); crawls the web and fills a collection (urlTable parameter) with URLs that have high relevance to the words specified by the searchWords parameter.

The line grid = new WebGrid(source:urlTable, rowsPerPage:100); sets urlTable as the data source for a WebGrid object. In the HTML for the body of the page, the line { @grid.GetHtml() } renders the object's data as an HTML table.

The urlTable is a HashSet object (a generic collection). Every time the page is refreshed, more URLs are added to urlTable by the call to grabber.Search(). To prevent the loss of the urlTable object between postbacks (page refreshes), it is saved into the page’s Session object. The rows in this table are of type RowSchema (a class defined in the Grabber.cs file in the App_Code folder). To avoid duplicate URLs, we have chosen a HashSet<RowSchema> collection. This necessitates defining some overrides for GetHashCode() and Equals() methods for the element's type (RowSchema in our case).

The call to Search() runs for about 10 seconds with every page refresh. The refresh process is terminated if the call to Search() returns true, which happens if number of rows reaches MaxRecords and all rows are visited.

The Grabber Class

The following table lists some key methods defined in the Grabber class.

Method Description
public bool Search(string searchWords, int MaxRecords, HashSet<RowSchema> urlTable) The main (entry point) method in the Grabber class. It adds (removes, or updates) rows to (from) urlTable. The method returns true if number of rows reaches MaxRecords and all rows are visited.
string FetchURL(string url) Fetches the html for a given url. It uses .NET WebRquest class.
string GetTitle(string htmlHead, string searchWords) Returns the page's title. It returns empty string if no title is found or if none of the words in searchWords is found in the title.
int CountWords(string htmlData, string searchWords) Returns the number of matches for words from searchWords in htmlData.
HashSet<string> GrabURLs(string htmlData, string parentURL) Returns a set of absolute URLs from URLs found in htmlData.

The following listing shows the Search() method from the Grabber class.

C#
public bool Search(string searchWords, int MaxRecords, HashSet<RowSchema> urlTable)
{ // Uncomment next line for logging
  // logFile = new StreamWriter(Server.MapPath("") + @"\log.txt");
          
  DateTime t1 = DateTime.Now;
             
  while (true)
  { if ((DateTime.Now - t1).TotalSeconds > MaxServerTime) break;

    string SearchUrl = String.Format("http://www.bing.com/search?q={0}" , 
          HttpUtility.UrlEncode(searchWords)) + "&first=" + rand.Next(500);
    string parentURL = "";

    RowSchema row1 = null;
    if ((urlTable.Count > 5) && (rand.NextDouble() < 0.5))
    { var foundRows = urlTable.Where(p => p.Visited== false).ToList<RowSchema>(); 

      if ((foundRows.Count == 0) && (urlTable.Count == MaxRecords))
         return true; // All visited; use to disable refresh timer
                  
      if (foundRows.Count > 0)
      {  row1 = foundRows[0];
         SearchUrl = row1.URL;
         row1.Visited = true; // Optimistic that call to FetchURL() will be OK
         parentURL = SearchUrl; 
      }
    }
                 
    string searchData = FetchURL(SearchUrl);

    if (searchData.StartsWith("Error"))
    {  if (row1!= null)
       { urlTable.Remove(row1); } 
       continue;
    }
                   
    // Debugging: Response.Write(searchData); return;

    int i = searchData.IndexOf("<body", StringComparison.InvariantCultureIgnoreCase);
    if (i == -1)
    {  if (row1 != null)
       { urlTable.Remove(row1); }
       continue; 
    } 

    string htmlHead = searchData.Substring(0,i-1);
    string htmlBody = searchData.Substring(i).ToLower(); 

    if (row1 != null)
    {  string Title = GetTitle(htmlHead, searchWords);
       if (Title == "")
       {  urlTable.Remove(row1);
          continue;
       }

       int Count = CountWords(htmlBody,searchWords);
       if (Count == 0)
       {  urlTable.Remove(row1);
          continue;
       }

       row1.Title = Title;
       row1.Count = Count;  
    }

    foreach (string s in urlSet)
    { if (urlTable.Count == MaxRecords) break;

      row1 = new RowSchema();
      row1.URL = s;
      row1.Visited = false;

      // Note: HashSet collection guarantees uniqueness (no duplicate)
      // based on the override for Equals()
      // row1 won't be added if there is match in urlTable 
 
       urlTable.Add(row1);
     }
  } 
           
   if (logFile != null) logFile.Close(); 
   return false;
}  

The call FetchURL(SearchUrl) is used to fetch content where the SearchURL is either a search query to Bing or some unvisited URL from urlTable. The returned content (searchData) is then processed to extract URLs using the call GrabURLs(searchData), which returns a set of URLs (urlSet). Finally, the urls in urlSet are added to the urlTable.

History

  • September 11, 2014: Version 1.0

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Instructor / Trainer KFUPM
Saudi Arabia Saudi Arabia

Nasir Darwish is an associate professor with the Department of Information and Computer Science, King Fahd University of Petroleum and Minerals (KFUPM), Saudi Arabia.


Developed some practical tools including COPS (Cooperative Problem Solving), PageGen (a tool for automatic generation of web pages), and an English/Arabic full-text search engine. The latter tools were used for the Global Arabic Encyclopedia and various other multimedia projects.


Recently, developed TilerPro which is a web-based software for construction of symmetric curves and their utilization in the design of aesthetic tiles. For more information, visit Tiler website.


Comments and Discussions

 
QuestionHow to Run This Application In Asp.net ? Pin
Yasin Mangroliya25-Dec-19 22:51
Yasin Mangroliya25-Dec-19 22:51 
GeneralMy vote of 5 Pin
Humayun Kabir Mamun11-Sep-14 0:33
Humayun Kabir Mamun11-Sep-14 0:33 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.