Click here to Skip to main content
15,867,308 members
Please Sign up or sign in to vote.
1.00/5 (2 votes)
I have a task in which I need to write a program to crawl some PDF and if I get keywords which are previously defined the crawler should highlight that text or give popup and then continue the search.


Thanks in advance.
Posted
Updated 26-Aug-15 21:37pm
v2
Comments
Zoltán Zörgő 26-Aug-15 14:45pm    
The only reliable approach is OCR. Are you willing to take this path?
suhel_khan 26-Aug-15 14:50pm    
Anything will work, But can i integrate the same with my C# code.
Sergey Alexandrovich Kryukov 26-Aug-15 14:58pm    
Sure, there is a choice of products you can use; please see Solution 1.
—SA

This is a set of referenced to PDF libraries you can use: http://csharp-source.net/open-source/pdf-libraries.

In particular, you can try this one: https://pdfapi.codeplex.com.

—SA
 
Share this answer
 
As I mentioned, the only reliable way to extract text from a PDF is doing OCR. There are some free/os libraries you could use (like Tesseract[^]), I recommend buying an API with proper .net support, like these:
http://www.abbyy.com/ocr-sdk-windows/[^]
https://www.leadtools.com/sdk/ocr/default.htm?SrcOrigin=Google-CPC-OCR%20API&MatchType=e&AdPos=1t2&gclid=CLjXx4Gx6K8CFdA2pAodAXth1Q[^]
http://www.aspose.com/.net/ocr-component.aspx[^]

An other approach is using iFilter[^], which is actually made for full-text indexing, and there is iFilter for PDF: http://www.adobe.com/support/downloads/detail.jsp?ftpID=5542[^]. But I have doubts you will be able to actually find the original position of the text with it.
 
Share this answer
 
Comments
suhel_khan 27-Aug-15 1:59am    
OCR is paid Zoltan :(
Zoltán Zörgő 27-Aug-15 2:52am    
Neither Tesseract, nor IFilter is.
You can try to do it without OCR - but you will fail if your PDFs are just some PDFs from enywhere. PDF is just for rendering, thus there is no guaranty that a text you see in Acrobat Reader is actually a single object in the internal structure. I have seen many interesting examples. Not speaking about the situation when your text is actually an image.
suhel_khan 27-Aug-15 3:24am    
Pdf format is not fix, do you have any idea how can i achieve this?
Zoltán Zörgő 27-Aug-15 3:30am    
I don' tunderstand this comment. I don't have other idea but OCR.
suhel_khan 27-Aug-15 3:34am    
My concern is I don't want any paid dll to be used, Now can i achieve the same through code.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900