How to grab links from an HTML page

Question

4.00/5 (1 vote)

See more:

Hi Everyone,
Greetings for the day!

My concern is I'm trying to grab all the links in an HTML page. I've tried following this: http://www.dotnetperls.com/scraping-html[^] and many other links but couldn't find a suitable solution. The problem is I'm quite unable to find a "LinkItem" object in my IDE.. also tried using System.Net and System.Diagnostics namespaces.

Please suggest me how can I get rid of this minor problem or may be a major I don't know.

Any Help is appreciated.

Thanks,
Sunny K

Posted 29-May-12 22:58pm

Sunny_Kumar_

Add a Solution

Comments

Sunasara Imdadhusen 30-May-12 5:16am

Please clarify your problem clearly. is this windows or web application?

Sunny_Kumar_ 30-May-12 5:30am

I've tried doing it with Both but was quite unable to find the LinkItem object in either of them.

Zoltán Zörgő 30-May-12 7:43am

I know you got already some solutions, but please clarify, what do you mean by "link"? Only the href attribute of A tags? All urls that user can navigate to from that page? All url-s referenced by the page?

Sunny_Kumar_ 30-May-12 7:56am

What I meant by "links" is all urls that a page references.
thanks :)

Zoltán Zörgő 30-May-12 9:23am

Than you have accepted a solution that does not satisfy your requirements. That regular expression finds only some absolute urls, no https, no replative paths for example. Look here for a wide list of expressions you could use: http://regexlib.com/Search.aspx?k=URL

Sunny_Kumar_ 30-May-12 10:28am

thanks for such a nice link, I'll see this for more solutions that could be possible :)

2 solutions

Solution 1

use document.getElementByTagName('a') using javascript

Posted 29-May-12 23:23pm

solutions@ashish

Comments

Sunny_Kumar_ 30-May-12 5:31am

thanks Ashish :) I really appreciate your help, but this is not what I am looking for. I want to do this with C#.

Member 15627495 2-Sep-22 23:35pm

hello !

look at 'HtmlDocument class' , in the extensions of this one, you have a collection to retrieve 'links' from a page.
See the microsoft DOC.

Add a Solution

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

**tanweer** · Accepted Answer · 2012-05-29T23:24:00

hello, Add a TextBox, label and a Button on your page and on Button_click try like this:

C#

protected void Button1_Click(object sender, EventArgs e)
   {
       string url = TextBox1.Text;
       WebClient wc = new WebClient();

       string html = wc.DownloadString(url);

       ArrayList linkCount = CollectLinks(html);

       StringBuilder sb = new StringBuilder();
       int c = 1;
       sb.Append("<table> <tr>  <td style="padding-right:40px;">   #   </td>   <td>  URL </td> </tr> ");
       foreach (var item in linkCount)
       {
           sb.Append(" <tr>  <td>    " + c.ToString() + "   </td>   <td>  " + item.ToString() + " </td> </tr>");
           c++;
       }
       sb.Append("</table>");

       lblResult.Text = sb.ToString();
   }

   public ArrayList CollectLinks(string strSource)
   {
       ArrayList ar = new ArrayList();
       try
       {
           Regex r1 = new Regex("((http://|www\\.)([A-Z0-9.-:]{1,})\\.[0-9A-Z?;~:&+%#=\\-_\\./]{2,})", RegexOptions.Compiled | RegexOptions.IgnoreCase);
           MatchCollection mc = r1.Matches(strSource);
           foreach (Match m in mc)
           {
               ar.Add(m);
           }
       }
       catch (Exception exp)
       {

       }

       return ar;
   }

and all links will show in the lable

How to grab links from an HTML page

2 solutions

Solution 2

Solution 1

Add your solution here

Preview 0

How to grab links from an HTML page

2 solutions

Solution 2

Solution 1

Add your solution here

Preview 0

Existing Members

...or Join us