How to convert Scanned PDF to XML

Question

4.00/5 (1 vote)

See more:

I am having bulk Scanned PDF document.

I want to read Scanned PDF document and generate to XML.

Again, i want to update the content in PDF from modified XML file.

How to do this...

Posted 19-Dec-12 2:18am

gani7787

Add a Solution

Comments

Abhishek Pant 19-Dec-12 9:13am

how to convert PDF to XML[^]

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

**Zoltán Zörgő** · Answer 1 · 2012-12-19T03:38:00

Hire people to do this for you. :)
This is a really challenging task not for a "quick answers" kind of forum. There are commercial applications for such tasks, but in general it can't be performed with 100% accuracy.

What you need:
- an OCR engine (let's suppose, that the quality of the images is good enough, and there is no handwriting) - some scanners are already making an ocr-ed layer above the scanned image
- you need one or more patterns that map text to the xml element based on position or some metadata (supposing your documents are of a limited number of type)
- you will need a document type recognition logic
- you will need a content validation logic to have a clue how good the automatic process performed
- editing a PDF is something else. If the scanned images is not ocred by the scanner, you cannot edit the images itself, you have to put the new text above the original one

But these are only the basic concepts. Such a task is really a hard one, many months of full-time shifts, and at the end you will still have special cases, when the automatic handling will not work, thus you have to add some user interaction, thus you will need user interface too.

RedDk · Answer 2 · 2012-12-19T09:04:00

Adobe ACROBAT 9 PRO (v.9.5.2) does a good job of making .xml out of .pdf. It has an option in the Save dialog to save as "XML 1.0" with settings;

Encoding, bookmark generation, tag generation ...

And there's Image File Settings;

Generate images, use sub-folder, as well as output format (TIFF,JPG,PNG), even downsample ...

So as complicated as "disassembling" a .pdf can be (knowing from personal experience), Adobe is the original "fonter" and "printer" and this app more than enables them to package their proprietary knowledge both formidably and somewhat successfully.

$$$; the only downside.

How to convert Scanned PDF to XML

2 solutions

Solution 1

Solution 2

Add your solution here

Preview 0