Click here to Skip to main content
15,884,298 members
Please Sign up or sign in to vote.
4.00/5 (1 vote)
See more:
I am having bulk Scanned PDF document.

I want to read Scanned PDF document and generate to XML.

Again, i want to update the content in PDF from modified XML file.

How to do this...
Posted
Comments
Abhishek Pant 19-Dec-12 9:13am    

Hire people to do this for you. :)
This is a really challenging task not for a "quick answers" kind of forum. There are commercial applications for such tasks, but in general it can't be performed with 100% accuracy.

What you need:
- an OCR engine (let's suppose, that the quality of the images is good enough, and there is no handwriting) - some scanners are already making an ocr-ed layer above the scanned image
- you need one or more patterns that map text to the xml element based on position or some metadata (supposing your documents are of a limited number of type)
- you will need a document type recognition logic
- you will need a content validation logic to have a clue how good the automatic process performed
- editing a PDF is something else. If the scanned images is not ocred by the scanner, you cannot edit the images itself, you have to put the new text above the original one

But these are only the basic concepts. Such a task is really a hard one, many months of full-time shifts, and at the end you will still have special cases, when the automatic handling will not work, thus you have to add some user interaction, thus you will need user interface too.
 
Share this answer
 
Comments
lewax00 19-Dec-12 10:26am    
That sums it up pretty well. I work on a product with a similar feature, and I can add that even if the PDF is not scanned in (i.e. the text can be stripped from it) they still aren't easy to process. PDF is only good for one thing: printable documents that don't change based on the reader. They are terrible as a data source.
Adobe ACROBAT 9 PRO (v.9.5.2) does a good job of making .xml out of .pdf. It has an option in the Save dialog to save as "XML 1.0" with settings;

Encoding, bookmark generation, tag generation ...

And there's Image File Settings;

Generate images, use sub-folder, as well as output format (TIFF,JPG,PNG), even downsample ...

So as complicated as "disassembling" a .pdf can be (knowing from personal experience), Adobe is the original "fonter" and "printer" and this app more than enables them to package their proprietary knowledge both formidably and somewhat successfully.

$$$; the only downside.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900