What is it?
| DomainWalker is an object that discovers domains reachable from a URL. Unlike traditional crawlers and site downloaders that identify all reachable URLs on a page, DomainWalker explores a subset of the world wide web's topology by targeting root URLs only. DomainWalker guarantees that its walk will complete in a finite amount of time by ensuring that duplicate domains are never crawled.
DomainWalker is an example of a WebResourceProvider and uses my StringParser utility class, both of which are published elsewhere at this site. As an aside, the demo application shows how to spin off a worker thread from a GUI and have it update the GUI in a safe manner. This is done by having the app respond to events fired by the worker thread.
|
How do I use it?
You use DomainWalker
by initializing it, calling its Walk()
method, and getting its results.
- Initialize the
DomainWalker
instance
DomainWalker dw = new DomainWalker();
dw.StartUrl = "www.ravib.com";
dw.MaxDepth = 3;
- Do the walk
dw.walk();
- Get the results
HashTable domainTree = dw.DomainTree;
printHashTableAsTree (domainTree);
Getting DomainWalker's results
You retrieve DomainWalker
's results by accessing its DomainTree
property at the end of the walk and/or responding to the OnNotifyUrlBeingTraversed
event.
DomainTree property
DomainWalker
's result is a tree of discovered domains obtained from the object's DomainTree
property. The tree is actually a nested Hashtable
, where each collection of child nodes is stored in a new Hashtable
.
OnNotifyUrlBeingTraversed event
It may be more convenient to get at DomainWalker
's results by being notified every time a new URL is discovered. This is done by subscribing to the object's OnNotifyUrlBeingTraversed
event and is the approach taken by the demo app. Domain discovery notifications are received by registering a OnNotifyUrlBeingTraversed
delegate which has the following signature:
public delegate void OnNotifyUrlBeingTraversed
(string strParentUrl,
string strUrlBeingTraversed,
int nCurrentDepth,
int nDomains,
TimeSpan tsElapsed);
The demo app responds to the OnNotifyUrlBeingTraversed
event by adding strUrlBeingTraversed
to a list box. The string is indented by an appropriate number of spaces proportional to nCurrentDepth
. Other useful information such as the elapsed walk time (tsElapsed
) is displayed in a label control.
OnNotifyWalkCompleted event
DomainWalker
also fires the OnNotifyWalkCompleted
event at the end of a walk. The OnNotifyWalkCompleted
delegate has the following signature:
public delegate void OnNotifyWalkCompleted
(int nDomains,
TimeSpan tsElapsed);
Revision History
- 22 Jan 2006
- Corrected
DomainWalkerForm
delegates to ensure controls are accessed from the GUI thread. (Thanks, Birgir K!)
- Added missing
.resx
file to project.
- Upgraded project to VS2005.
- 15 Jan 2006
Initial version.
Ravi Bhavnani is an ardent fan of Microsoft technologies who loves building Windows apps, especially PIMs, system utilities, and things that go bump on the Internet. During his career, Ravi has developed expert systems, desktop imaging apps, marketing automation software, EDA tools, a platform to help people find, analyze and understand information, trading software for institutional investors and advanced data visualization solutions. He currently works for a company that provides enterprise workforce management solutions to large clients.
His interests include the .NET framework, reasoning systems, financial analysis and algorithmic trading, NLP, HCI and UI design. Ravi holds a BS in Physics and Math and an MS in Computer Science and was a Microsoft MVP (C++ and C# in 2006 and 2007). He is also the co-inventor of 3 patents on software security and generating data visualization dashboards. His claim to fame is that he crafted CodeProject's "joke" forum post icon.
Ravi's biggest fear is that one day he might actually get a life, although the chances of that happening seem extremely remote.