Click here to Skip to main content
15,881,424 members
Please Sign up or sign in to vote.
4.00/5 (1 vote)
See more:
I have a web form, where I insert a URL and then I extract only the "usefull" text on the .html file.

But before that action, I want to look for .pdf (or other) uploaded files (linked with a href)
The download part there's no problem, I already know how to do that. The problem is only identifying the URL of the PDF file so I can pass it as a string.

So I should do (1) download original html file (2) read it to string (3) search in that string the url documents (what I need) (4) download them (5)....

String strFile = File.ReadAllText(path2);
// how to find (let's say pdf to search by extension) documents?
...
WebClient Client = new WebClient ();
Client.DownloadFile(documentURL, path);


Anyone can give me an hint?
Posted
Updated 24-May-11 23:38pm
v4

1 solution

After fetching the HTML the best way for you to get at all the links on the page is to use a library like HTMLAgilityPack[^]. This way you can easily get at all the a href nodes to inspect them for possible pdf files. Caveat: The URL pointing to a PDF file does not nescessarily have to contain the sting .pdf The only way to make sure that you're really getting all PDF's that are linked to from a page is to open every link you find on said page and make a header request on that document. The mime-type returned by the server is also no absolute guarantee that it will be a PDF but better yet than only looking at the URL extension.
If you're writing a crawler you'd also want to make sure to follow links to other documents linked from your page that might contain PDFs.

I hope I expressed myself clearly, but if you still have doubts feel free to leave me a comment.

Cheers!

-MRB
 
Share this answer
 
Comments
Maxdd 7 25-May-11 5:39am    
Thanks Manfred. I now use Agility pack it works perfect for my needs. Thank you very much!
Manfred Rudolf Bihy 25-May-11 9:29am    
You're welcome! Yes HTMLAgilityPack does take the trouble out of HTML parsing :)
Prasanta_Prince 25-May-11 5:42am    
Good one.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900