Click here to Skip to main content
15,867,330 members
Articles / Programming Languages / C#
Article

Domain Walker

Rate me:
Please Sign up or sign in to vote.
4.63/5 (15 votes)
22 Jan 2006CPOL2 min read 102.9K   1K   54   30
An object that allows you to explore the topology of the internet.

What is it?

DomainWalker in action DomainWalker is an object that discovers domains reachable from a URL.  Unlike traditional crawlers and site downloaders that identify all reachable URLs on a page, DomainWalker explores a subset of the world wide web's topology by targeting root URLs only.  DomainWalker guarantees that its walk will complete in a finite amount of time by ensuring that duplicate domains are never crawled.

DomainWalker is an example of a WebResourceProvider and uses my StringParser utility class, both of which are published elsewhere at this site.  As an aside, the demo application shows how to spin off a worker thread from a GUI and have it update the GUI in a safe manner.  This is done by having the app respond to events fired by the worker thread.

How do I use it?

You use DomainWalker by initializing it, calling its Walk() method, and getting its results.

  1. Initialize the DomainWalker instance
    // Initialize the DomainWalker
    DomainWalker dw = new DomainWalker();
    dw.StartUrl = "www.ravib.com";
    dw.MaxDepth = 3;
  2. Do the walk
    // Do walk
    dw.walk();
  3. Get the results
    // Get results
    HashTable domainTree = dw.DomainTree;
    printHashTableAsTree (domainTree);   // left as an exercise to the reader

Getting DomainWalker's results

You retrieve DomainWalker's results by accessing its DomainTree property at the end of the walk and/or responding to the OnNotifyUrlBeingTraversed event.

DomainTree property

DomainWalker's result is a tree of discovered domains obtained from the object's DomainTree property. The tree is actually a nested Hashtable, where each collection of child nodes is stored in a new Hashtable.

Domain tree retrieved by DomainWalker

OnNotifyUrlBeingTraversed event

It may be more convenient to get at DomainWalker's results by being notified every time a new URL is discovered. This is done by subscribing to the object's OnNotifyUrlBeingTraversed event and is the approach taken by the demo app. Domain discovery notifications are received by registering a OnNotifyUrlBeingTraversed delegate which has the following signature:

/// <summary>
/// Notifies an observer when a url is about to be traversed.
/// </summary>
/// <param name="strParentUrl">The parent url (may be null).</param>
/// <param name="strUrlBeingTraversed">The url being traversed.</param>
/// <param name="nCurrentDepth">Current traversal depth.</param>
/// <param name="nDomains">Number of domains discovered so far.</param>
/// <param name="tsElapsed">Time elapsed since start of crawl.</param>
public delegate void OnNotifyUrlBeingTraversed
  (string strParentUrl,
   string strUrlBeingTraversed,
   int nCurrentDepth,
   int nDomains,
   TimeSpan tsElapsed);

The demo app responds to the OnNotifyUrlBeingTraversed event by adding strUrlBeingTraversed to a list box. The string is indented by an appropriate number of spaces proportional to nCurrentDepth. Other useful information such as the elapsed walk time (tsElapsed) is displayed in a label control.

OnNotifyWalkCompleted event

DomainWalker also fires the OnNotifyWalkCompleted event at the end of a walk. The OnNotifyWalkCompleted delegate has the following signature:

/// <summary>
/// Notifies an observer when the walk has completed.
/// </summary>
/// <param name="nDomains">Number of domains discovered.</param>
/// <param name="tsElapsed">Time taken to complete crawl.</param>
public delegate void OnNotifyWalkCompleted
  (int nDomains,
   TimeSpan tsElapsed);

Revision History

  • 22 Jan 2006
    • Corrected DomainWalkerForm delegates to ensure controls are accessed from the GUI thread. (Thanks, Birgir K!)
    • Added missing .resx file to project.
    • Upgraded project to VS2005.
  • 15 Jan 2006
    Initial version.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Technical Lead
Canada Canada
Ravi Bhavnani is an ardent fan of Microsoft technologies who loves building Windows apps, especially PIMs, system utilities, and things that go bump on the Internet. During his career, Ravi has developed expert systems, desktop imaging apps, marketing automation software, EDA tools, a platform to help people find, analyze and understand information, trading software for institutional investors and advanced data visualization solutions. He currently works for a company that provides enterprise workforce management solutions to large clients.

His interests include the .NET framework, reasoning systems, financial analysis and algorithmic trading, NLP, HCI and UI design. Ravi holds a BS in Physics and Math and an MS in Computer Science and was a Microsoft MVP (C++ and C# in 2006 and 2007). He is also the co-inventor of 3 patents on software security and generating data visualization dashboards. His claim to fame is that he crafted CodeProject's "joke" forum post icon.

Ravi's biggest fear is that one day he might actually get a life, although the chances of that happening seem extremely remote.

Comments and Discussions

 
GeneralMy vote of 5 Pin
cmptr_kemist16-Aug-11 17:32
cmptr_kemist16-Aug-11 17:32 
GeneralThank you and Found a bug Pin
Phebous3-Apr-08 10:59
Phebous3-Apr-08 10:59 
GeneralRe: Thank you and Found a bug Pin
Ravi Bhavnani30-Aug-08 11:57
professionalRavi Bhavnani30-Aug-08 11:57 
GeneralRe: Thank you and Found a bug Pin
Phebous30-Aug-08 15:02
Phebous30-Aug-08 15:02 
QuestionWhat about robots.txt Pin
mariusco20-Dec-06 5:55
mariusco20-Dec-06 5:55 
AnswerRe: What about robots.txt Pin
Ravi Bhavnani20-Dec-06 6:03
professionalRavi Bhavnani20-Dec-06 6:03 
Generalneed your help! Pin
beyondwm200411-Apr-06 4:38
beyondwm200411-Apr-06 4:38 
GeneralRe: need your help! Pin
Ravi Bhavnani12-Apr-06 4:52
professionalRavi Bhavnani12-Apr-06 4:52 
GeneralStill the Same Problems Pin
David Roh8-Apr-06 9:47
David Roh8-Apr-06 9:47 
GeneralRe: Still the Same Problems Pin
Ravi Bhavnani9-Apr-06 5:01
professionalRavi Bhavnani9-Apr-06 5:01 
Thanks for your comments!

The missing DomainWalkerForm.resx has been added to the source .zip. I didn't tweak the article updated date since this is a fix to the package of files and not a code or content change.

/ravi

My new year's resolution: 2048 x 1536
Home | Music | Articles | Freeware | Trips
ravib(at)ravib(dot)com

GeneralThe Same Problem Pin
beyondwm200426-Mar-06 21:39
beyondwm200426-Mar-06 21:39 
GeneralRe: The Same Problem Pin
Ravi Bhavnani27-Mar-06 0:53
professionalRavi Bhavnani27-Mar-06 0:53 
GeneralRe: The Same Problem Pin
Ravi Bhavnani9-Apr-06 4:59
professionalRavi Bhavnani9-Apr-06 4:59 
GeneralNice work Pin
Hatem Mostafa25-Feb-06 1:18
Hatem Mostafa25-Feb-06 1:18 
GeneralRe: Nice work Pin
Ravi Bhavnani25-Feb-06 3:58
professionalRavi Bhavnani25-Feb-06 3:58 
QuestionWhere is the exe ? Pin
NinjaCross23-Jan-06 3:52
NinjaCross23-Jan-06 3:52 
AnswerRe: Where is the exe ? Pin
Ravi Bhavnani23-Jan-06 5:41
professionalRavi Bhavnani23-Jan-06 5:41 
AnswerRe: Where is the exe ? Pin
Ravi Bhavnani25-Jan-06 11:52
professionalRavi Bhavnani25-Jan-06 11:52 
GeneralBreaks with GUI updates Pin
Birgir K19-Jan-06 9:04
Birgir K19-Jan-06 9:04 
GeneralRe: Breaks with GUI updates Pin
Ravi Bhavnani19-Jan-06 9:29
professionalRavi Bhavnani19-Jan-06 9:29 
GeneralRe: Breaks with GUI updates Pin
Birgir K19-Jan-06 12:45
Birgir K19-Jan-06 12:45 
GeneralRe: Breaks with GUI updates Pin
Ravi Bhavnani20-Jan-06 2:46
professionalRavi Bhavnani20-Jan-06 2:46 
GeneralRe: Breaks with GUI updates Pin
Birgir K20-Jan-06 14:27
Birgir K20-Jan-06 14:27 
GeneralRe: Breaks with GUI updates Pin
Ravi Bhavnani21-Jan-06 3:44
professionalRavi Bhavnani21-Jan-06 3:44 
GeneralRe: Breaks with GUI updates Pin
Ravi Bhavnani22-Jan-06 13:35
professionalRavi Bhavnani22-Jan-06 13:35 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.