How could I parse a PDF document to either excel or XML. Which solution would be best for a large amount of documents?

Question

0.00/5 (No votes)

See more:

I've been researching the best way to parse (or extract) a PDF file into Excel or XML. I've looked at iText and ByteScout and they may be the best for what I need to do, but I'm also considering coding in VB .Net or VBScript, but need to be pointed in the right direction to get started. Any help would be greatly appreciated.

KM

What I have tried:

I have tried both ByteScout and Aspose.PDF. They may work, but I don't fully understand them. I've looked at iText also.

Posted 17-May-16 3:56am

Kim Williams

Updated 18-May-16 13:09pm

CHill60

v2

Add a Solution

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Garth J Lancaster · Answer 1 · 2016-05-17T04:30:00

That could be a bit of a big/broad question for the 'Quick Answers' section, I suspect there are many tools that would do the job for you, but its what you havnt mentioned that would determine overall approach and possibly 'tools' & language etc

Some questions that might be asked (ie, forming 'requirements') :-

- "large amount of documents" (how many ?)
- how much text is in each document ?
- what is the source of the documents - file system/web server/email/database (etc) ?
- why Excel or XML for output - what do you need to do with the extracted text, eg, search it, reformat it ?
- are you envisaging a batch process or a real-time/on demand process
- do you have a budget ? ie, can you pay for tools ?
- what are the time factors for delivering your 'project' I'll call it ?
- how are you going to track/trace documents extracted etc
(and probably lots more)

You say (of ByteScout & Aspose.PDF) "but I don't fully understand them" - we dont know your background and how much experience you have - if you're going to have to write and support something, you may be better off with a $$ product so you can use the product supplier for help & support - any decent SDK should also come with a number of examples/samples and support - this is a 'buy vs build' question

Answers/thoughts to/on some of those questions above might also suggest VB.Net for example over VBScript - ie, robustness, level of automation, ...

So, Im sorry, there's no 'best way' on the information you have shown - there could be lots of good ways and more bad ways - the extract 'tool' is only a small part of the solution

[edit : Added]
You could also 'outsource' the extraction to a bureau/service of course - you send them the PDF's and they send you back the data in the format you require - no coding required on your part !
[/edit]

[edit 2]

ok, I would 'start' with a solution that goes along the lines of the following, recognising that you may evolve some parts later on. Basically, it plays upon your strengths in (for example) VB.Net and VBScript and what I believe are their strengths, and developing a set of 'modules' - each 'module' as a simple purpose

Input Modules
a) write a set of 'input' modules - one for each type of input you have, for example
extract from email -> disk folder. May be VB.Net
copy from website folder -> disk folder. May Be VBScript Module
(manual) from mail ? scan

Each input module needs to be able to accept various parameters (command line) unique to how its getting its input - eg SMTP/email paramters, and the directory into which to place the PDF's

Processing Modules
b) write a 'core' PDF Extractor - Im suggesting VB.Net for this rather than VBScript - I think you'll find the power/flexibility/expressiveness suits the task - a console program, that reads from disk and extracts the text and stores the xml as a file on disk

The processing Module needs to be able to accept parameters (command line) where to read the PDF's from, where to put (for example) the XML output from the extraction

c) write a database loader module (or use SSIS or ...) that reads an XML file from (b) from disk and uploads into the database.

The database module/loader will need to be able to accept (command line) parameters to indicate where the XML files are, and how to connect to the DB

VBScript is used like 'DOS Batch' language - a 'glue' to bind everything together .. it :-
- runs each of the input modules
- for each PDF File on disk, runs the PDF extractor
- for each XML file runs the upload to DB module
- runs any audit steps
- can be scheduled or run manually

Keeping things as separate modules means for example something written in VBScript can be upgraded/replaced with something written in VB.Net or C# or even c++ later on. Obviously, some inputs to the modules can be command-line, some you may wish to read from config-type files

[/edit 2]

Patrice T · Answer 2 · 2016-05-18T13:09:00

Solution 3

you process look complicated.
Email => convert to PDF => Extract data from PDF => Feed to Excel
I would try simpler.
Extract from Email => Feed to Excel
Since Email is text, it should be simpler to extract data.

Posted 18-May-16 13:09pm

Patrice T

Comments

Garth J Lancaster 18-May-16 22:23pm

I dont think the poster is doing email => PDF - I think the poster is getting PDF attachments from internal/external ie 3rd party sources. I've been on the receiving end of this sort of thing a lot of times - you dont get to dictate terms sometimes, you take what you've given and get on with it ... its surprising what some people call 'EDI' these days, I'd almost wish they used a carrier pidgeon instead

Patrice T 18-May-16 23:25pm

I see what you mean. With a little luck, there is a setting allowing him to receive data in XML or real EDI format rather than 'human readable PDF'.

bulrush400 24-May-16 12:19pm

> With a little luck, there is a setting allowing him to receive data in XML or real EDI format rather than 'human readable PDF'.

That's often not true for the most valuable data. Some cases. 1) A gov't entity has the data, but only in PDF form. They don't have the budget to hire someone to output an Excel file. 2) A business spends $50,000 to do a survey and only offers their data as a PDF to protect their property. Some of these reports can cost a user $500-100 each.

Thus we have a lot of people trying to "scrape" a PDF. Which means a very accurate PDF scraper becomes very valuable.