Click here to Skip to main content
15,881,172 members
Please Sign up or sign in to vote.
4.00/5 (1 vote)
See more:
Hi Everyone,
Greetings for the day!

My concern is I'm trying to grab all the links in an HTML page. I've tried following this: http://www.dotnetperls.com/scraping-html[^] and many other links but couldn't find a suitable solution. The problem is I'm quite unable to find a "LinkItem" object in my IDE.. also tried using System.Net and System.Diagnostics namespaces.

Please suggest me how can I get rid of this minor problem or may be a major I don't know.

Any Help is appreciated.

Thanks,
Sunny K
Posted
Comments
Sunasara Imdadhusen 30-May-12 5:16am    
Please clarify your problem clearly. is this windows or web application?
Sunny_Kumar_ 30-May-12 5:30am    
I've tried doing it with Both but was quite unable to find the LinkItem object in either of them.
Zoltán Zörgő 30-May-12 7:43am    
I know you got already some solutions, but please clarify, what do you mean by "link"? Only the href attribute of A tags? All urls that user can navigate to from that page? All url-s referenced by the page?
Sunny_Kumar_ 30-May-12 7:56am    
What I meant by "links" is all urls that a page references.
thanks :)
Zoltán Zörgő 30-May-12 9:23am    
Than you have accepted a solution that does not satisfy your requirements. That regular expression finds only some absolute urls, no https, no replative paths for example. Look here for a wide list of expressions you could use: http://regexlib.com/Search.aspx?k=URL

hello, Add a TextBox, label and a Button on your page and on Button_click try like this:
C#
protected void Button1_Click(object sender, EventArgs e)
   {
       string url = TextBox1.Text;
       WebClient wc = new WebClient();

       string html = wc.DownloadString(url);

       ArrayList linkCount = CollectLinks(html);

       StringBuilder sb = new StringBuilder();
       int c = 1;
       sb.Append("<table> <tr>  <td style="padding-right:40px;">   #   </td>   <td>  URL </td> </tr> ");
       foreach (var item in linkCount)
       {
           sb.Append(" <tr>  <td>    " + c.ToString() + "   </td>   <td>  " + item.ToString() + " </td> </tr>");
           c++;
       }
       sb.Append("</table>");

       lblResult.Text = sb.ToString();
   }

   public ArrayList CollectLinks(string strSource)
   {
       ArrayList ar = new ArrayList();
       try
       {
           Regex r1 = new Regex("((http://|www\\.)([A-Z0-9.-:]{1,})\\.[0-9A-Z?;~:&+%#=\\-_\\./]{2,})", RegexOptions.Compiled | RegexOptions.IgnoreCase);
           MatchCollection mc = r1.Matches(strSource);
           foreach (Match m in mc)
           {
               ar.Add(m);
           }
       }
       catch (Exception exp)
       {

       }

       return ar;
   }


and all links will show in the lable
 
Share this answer
 
Comments
Sunny_Kumar_ 30-May-12 5:36am    
Thanks Tanveer for your answer, I really appreciate that. Is there anyway to find the links without using Regex?
tanweer 30-May-12 5:51am    
when you have all the HTML in this line of code
string html = wc.DownloadString(url);
then now you have to add your own logic to find LINKS,
an idea is to split the HTML with </a> then collect all anchor tags by adding some string operations using c# code.
Sunny_Kumar_ 30-May-12 7:54am    
Thanks again :)
use document.getElementByTagName('a') using javascript
 
Share this answer
 
Comments
Sunny_Kumar_ 30-May-12 5:31am    
thanks Ashish :) I really appreciate your help, but this is not what I am looking for. I want to do this with C#.
Member 15627495 2-Sep-22 23:35pm    
hello !

look at 'HtmlDocument class' , in the extensions of this one, you have a collection to retrieve 'links' from a page.
See the microsoft DOC.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900