Click here to Skip to main content
15,352,974 members
Articles / Programming Languages / C#
Article
Posted 20 May 2006

Stats

925K views
117.9K downloads
174 bookmarked

Extract Text from PDF in C# (100% .NET)

Rate me:
Please Sign up or sign in to vote.
3.66/5 (59 votes)
20 May 2006CPOL1 min read
A simple class to extract plain text from PDF documents with ITextSharp

Introduction

This is a 100% .NET solution to extract text from PDF documents.

Background

Dan Letecky posted a nice code on how to extract text from PDF documents in C# based on PDFBox. Although his solution works well it has a drawback, the size of the required additional libraries is almost 16 MB. Using iTextSharp the size of required additional libraries is only 2.3 MB.

Using the Code

In order to use this solution in your projects, you need to do the following steps:

  • Add references to itextsharp.dll and SharpZiplib.dll
  • Add the PDFParser.cs class to your project

Then you can use the newly added class in the following way:

C#
// create an instance of the pdfparser class
PDFParser pdfParser = new PDFParser();
   
// extract the text
String result = pdfParser.ExtractText(pdfFile);

I also created a small console application which uses the class and shows the progress of the conversion. Please keep in mind that if you try to extract text from big PDF files, keeping all the resultant text in memory is not the best solution, in these cases you should write the extracted text to the file after parsing every page.

How Is It Working?

My code is based on the algorithm in C ExtractPDFText. Using iTextSharp's PdfReader class to extract the deflated content of every page, I use a simple function ExtractTextFromPDFBytes to extract the text contents from the deflated page.

Further Improvements

Although the code worked well for me, I didn't find in Adobe's PDF reference how to parse special characters. So if someone knows how to do this, just post it and I will update the class.

History

  • 20th May, 2006: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Zollor
Web Developer
Romania Romania
No Biography provided

Comments and Discussions

 
GeneralUsing this in a web application Pin
BaxterBressler22-May-07 2:46
MemberBaxterBressler22-May-07 2:46 
QuestionError Pin
srochford@ardrua.com16-Apr-07 9:29
Membersrochford@ardrua.com16-Apr-07 9:29 
AnswerRe: Error Pin
talbot_c5-Oct-09 19:38
Membertalbot_c5-Oct-09 19:38 
GeneralError with some pdf's Pin
godsvision3527-Nov-06 3:54
Membergodsvision3527-Nov-06 3:54 
GeneralDoes not extract any text with some pdf, but pdfbox can Pin
petoulachi23-May-06 2:58
Memberpetoulachi23-May-06 2:58 
GeneralRe: Does not extract any text with some pdf, but pdfbox can Pin
rajaher14-Sep-06 0:54
Memberrajaher14-Sep-06 0:54 
GeneralRe: Does not extract any text with some pdf, but pdfbox can Pin
Mikael Svenson8-Mar-08 5:06
MemberMikael Svenson8-Mar-08 5:06 
GeneralDoesn't extract all text Pin
Kevin Whitefoot23-May-06 2:50
MemberKevin Whitefoot23-May-06 2:50 
I tried the demo command line tool. Looks good except that it didn't extract all the text from the document. I have a suspicion that part of the text was marked with a code to prevent it being extracted in addition to the global flag.
GeneralRe: Doesn't extract all text Pin
Manuel__8310-Oct-06 2:31
MemberManuel__8310-Oct-06 2:31 
Generalnot supporting non-ASCII characters Pin
Huisheng Chen21-May-06 16:08
MemberHuisheng Chen21-May-06 16:08 
AnswerRe: not supporting non-ASCII characters Pin
petoulachi22-May-06 3:51
Memberpetoulachi22-May-06 3:51 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.