How Do I Make A Web Crawler That Works Through Desktop Application, That Application Would Separate Internal And External Links

Question

0.00/5 (No votes)

See more:

A web crawler desktop app using c# that would separate internal and external links i.e. <a href="about.html"> is internal and <a href="http|https://www.xyz.com"> is external, I've tried many solutions but all are finding links great but no solution for separation of internal and external links of a website for crawling is available.
I'm using the following code to separate internal and external links but it doesn't work as I need. It's been 2 days I'm working on it but still no improvements. Can you check this and guide me about it.

C#

List inter = new List();
List dates = new List();
int count = 0;
List i2 = new List();
WebClient web = new WebClient();
string html = web.DownloadString(textBox1.Text);
string n3 = "", s4 = "";
MatchCollection m0 = Regex.Matches(html, @"]*?href[\s]?=[\s\""\']+(?.*?)[\""\']+.*?>(?[^<]+|.*?)?<\/a>", RegexOptions.Singleline | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);

foreach (Match m in m0)
{
string city = m.Groups[1].Value;



Match m2 = Regex.Match(city, "\\s*(?i)href\\s*=\\s*(\"([^\"]*\")|'[^']*'|([^'\">\\s]+))", RegexOptions.Singleline | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);
string city2 = m2.Groups[1].Value;
dates.Add(city2);

s4 = textBox1.Text;
string n2 = s4.Remove(0, 11);
n3 = s4.Remove(0, 12);
string n4 = s4.Remove(0,7);

Match m3 = Regex.Match(city, @"((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:@=.+?,##%&~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])", RegexOptions.Singleline | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);
string s5 = m3.Groups[1].Value;
if (m3.Groups[1].Value != s4 && m3.Groups[1].Value != n2 && m3.Groups[1].Value != n3&& m3.Groups[1].Value!=n4)
{



i2.Add(city);
inter.Add(s5);

count = 1;

}


}
if (count != 0)
{
AllLinks.Items.Add(s4);
}
Hrefs.DataSource = i2;
//AllLinks.DataSource = dates;
inter.RemoveAll(string.IsNullOrWhiteSpace);
ExternalLinks.DataSource = inter;

Posted 31-Mar-15 19:38pm

M Adeel Khalid

Updated 3-Apr-15 0:24am

RajeeshMenoth

v4

Add a Solution

Comments

Sinisa Hajnal 1-Apr-15 3:08am

How would you separate the links? When you think of the way you would do it manually then you can write an algorithm. Until then, you cannot do anything.

M Adeel Khalid 2-Apr-15 1:14am

thank you for your reply.

Sinisa Hajnal 2-Apr-15 2:02am

Once you have all the links, why is it a problem to separate those that start with http? Or even that contain home domain path?

M Adeel Khalid 2-Apr-15 6:38am

the problem is, whenever i try to fetch external it also gets me internal, I'm stuck and don't know how to differentiate them and also on which basis. Really becomes a headache.

Sinisa Hajnal 2-Apr-15 7:27am

But you just said it - you separate them by having http:// at the start. Why couldn't you use that?

M Adeel Khalid 3-Apr-15 1:16am

yeah! internet has many topics regarding this but none solution meet my requirement. It's become a headache. I'm confused on which basis I should differentiate them.

Sinisa Hajnal 3-Apr-15 2:17am

Again (and I will not repeat it again): you separate them by checking if they start with "http://" if yes, external, else internal. If you have trouble with this, use improve question and rephrase it. I don't know how to say it any clearer.

M Adeel Khalid 3-Apr-15 2:52am

don't be angry, thanks for replying.

Sinisa Hajnal 3-Apr-15 4:25am

I'm not, I'm trying to understand which part you don't understand. If there is anything that needs clearing up, let me know.

Sinisa Hajnal 3-Apr-15 6:01am

Be so kind and move this up in the question (use Improve question link above) and remove things that don't work with the links...they are just clutter here. I'll take a look on my break.

And please explain those s4, n3 etc...variables and removes.

M Adeel Khalid 3-Apr-15 6:05am

s4, n3 are string which store url name after removing starting "http://" or "http://www."

Sinisa Hajnal 3-Apr-15 6:40am

There is your problem. First go through your collection and remove everything that DOES NOT containg http...then you'll be left with only those that are external. THEN you can remove whatever parts of the string you want. Or check before adding to the collection so you'll never have internal links in the collection in the first place.

M Adeel Khalid 7-Apr-15 9:03am

thank you so much for telling that it is my problem, i thought it might be someone else's problem. Hats Off 2u.

Sinisa Hajnal 7-Apr-15 10:08am

I also offered you a solution in that same comment :) Don't get angry...you obviously didn't know that it was your problem.

M Adeel Khalid 8-Apr-15 0:26am

:D

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)