Click here to Skip to main content
15,892,059 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I've been researching the best way to parse (or extract) a PDF file into Excel or XML. I've looked at iText and ByteScout and they may be the best for what I need to do, but I'm also considering coding in VB .Net or VBScript, but need to be pointed in the right direction to get started. Any help would be greatly appreciated.

KM

What I have tried:

I have tried both ByteScout and Aspose.PDF. They may work, but I don't fully understand them. I've looked at iText also.
Posted
Updated 18-May-16 13:09pm
v2

That could be a bit of a big/broad question for the 'Quick Answers' section, I suspect there are many tools that would do the job for you, but its what you havnt mentioned that would determine overall approach and possibly 'tools' & language etc

Some questions that might be asked (ie, forming 'requirements') :-

- "large amount of documents" (how many ?)
- how much text is in each document ?
- what is the source of the documents - file system/web server/email/database (etc) ?
- why Excel or XML for output - what do you need to do with the extracted text, eg, search it, reformat it ?
- are you envisaging a batch process or a real-time/on demand process
- do you have a budget ? ie, can you pay for tools ?
- what are the time factors for delivering your 'project' I'll call it ?
- how are you going to track/trace documents extracted etc
(and probably lots more)

You say (of ByteScout & Aspose.PDF) "but I don't fully understand them" - we dont know your background and how much experience you have - if you're going to have to write and support something, you may be better off with a $$ product so you can use the product supplier for help & support - any decent SDK should also come with a number of examples/samples and support - this is a 'buy vs build' question

Answers/thoughts to/on some of those questions above might also suggest VB.Net for example over VBScript - ie, robustness, level of automation, ...

So, Im sorry, there's no 'best way' on the information you have shown - there could be lots of good ways and more bad ways - the extract 'tool' is only a small part of the solution

[edit : Added]
You could also 'outsource' the extraction to a bureau/service of course - you send them the PDF's and they send you back the data in the format you require - no coding required on your part !
[/edit]

[edit 2]

ok, I would 'start' with a solution that goes along the lines of the following, recognising that you may evolve some parts later on. Basically, it plays upon your strengths in (for example) VB.Net and VBScript and what I believe are their strengths, and developing a set of 'modules' - each 'module' as a simple purpose

Input Modules
a) write a set of 'input' modules - one for each type of input you have, for example
extract from email -> disk folder. May be VB.Net
copy from website folder -> disk folder. May Be VBScript Module
(manual) from mail ? scan

Each input module needs to be able to accept various parameters (command line) unique to how its getting its input - eg SMTP/email paramters, and the directory into which to place the PDF's

Processing Modules
b) write a 'core' PDF Extractor - Im suggesting VB.Net for this rather than VBScript - I think you'll find the power/flexibility/expressiveness suits the task - a console program, that reads from disk and extracts the text and stores the xml as a file on disk

The processing Module needs to be able to accept parameters (command line) where to read the PDF's from, where to put (for example) the XML output from the extraction

c) write a database loader module (or use SSIS or ...) that reads an XML file from (b) from disk and uploads into the database.

The database module/loader will need to be able to accept (command line) parameters to indicate where the XML files are, and how to connect to the DB

VBScript is used like 'DOS Batch' language - a 'glue' to bind everything together .. it :-
- runs each of the input modules
- for each PDF File on disk, runs the PDF extractor
- for each XML file runs the upload to DB module
- runs any audit steps
- can be scheduled or run manually

Keeping things as separate modules means for example something written in VBScript can be upgraded/replaced with something written in VB.Net or C# or even c++ later on. Obviously, some inputs to the modules can be command-line, some you may wish to read from config-type files

[/edit 2]
 
Share this answer
 
v5
you process look complicated.
Email => convert to PDF => Extract data from PDF => Feed to Excel
I would try simpler.
Extract from Email => Feed to Excel
Since Email is text, it should be simpler to extract data.
 
Share this answer
 
Comments
Garth J Lancaster 18-May-16 22:23pm    
I dont think the poster is doing email => PDF - I think the poster is getting PDF attachments from internal/external ie 3rd party sources. I've been on the receiving end of this sort of thing a lot of times - you dont get to dictate terms sometimes, you take what you've given and get on with it ... its surprising what some people call 'EDI' these days, I'd almost wish they used a carrier pidgeon instead
Patrice T 18-May-16 23:25pm    
I see what you mean. With a little luck, there is a setting allowing him to receive data in XML or real EDI format rather than 'human readable PDF'.
bulrush400 24-May-16 12:19pm    
> With a little luck, there is a setting allowing him to receive data in XML or real EDI format rather than 'human readable PDF'.

That's often not true for the most valuable data. Some cases. 1) A gov't entity has the data, but only in PDF form. They don't have the budget to hire someone to output an Excel file. 2) A business spends $50,000 to do a survey and only offers their data as a PDF to protect their property. Some of these reports can cost a user $500-100 each.

Thus we have a lot of people trying to "scrape" a PDF. Which means a very accurate PDF scraper becomes very valuable.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900