Crawler to find and download .pdf file in a webpage - choosing a specific URL extension

Question

4.00/5 (1 vote)

See more:

.NET3.5

I have a web form, where I insert a URL and then I extract only the "usefull" text on the .html file.

But before that action, I want to look for .pdf (or other) uploaded files (linked with a href)
The download part there's no problem, I already know how to do that. The problem is only identifying the URL of the PDF file so I can pass it as a string.

So I should do (1) download original html file (2) read it to string (3) search in that string the url documents (what I need) (4) download them (5)....

String strFile = File.ReadAllText(path2);
// how to find (let's say pdf to search by extension) documents?
...
WebClient Client = new WebClient ();
Client.DownloadFile(documentURL, path);

Anyone can give me an hint?

Posted 23-May-11 13:04pm

Maxdd 7

Updated 24-May-11 23:38pm

v4

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Manfred Rudolf Bihy · Accepted Answer · 2011-05-23T14:57:00

After fetching the HTML the best way for you to get at all the links on the page is to use a library like HTMLAgilityPack[^]. This way you can easily get at all the a href nodes to inspect them for possible pdf files. Caveat: The URL pointing to a PDF file does not nescessarily have to contain the sting .pdf The only way to make sure that you're really getting all PDF's that are linked to from a page is to open every link you find on said page and make a header request on that document. The mime-type returned by the server is also no absolute guarantee that it will be a PDF but better yet than only looking at the URL extension.
If you're writing a crawler you'd also want to make sure to follow links to other documents linked from your page that might contain PDFs.

I hope I expressed myself clearly, but if you still have doubts feel free to leave me a comment.

Cheers!

-MRB