Click here to Skip to main content
15,888,461 members
Articles / Programming Languages / C++
Article

Code to extract plain text from a PDF file

Rate me:
Please Sign up or sign in to vote.
4.87/5 (74 votes)
21 Jun 20044 min read 832.1K   36.7K   175   152
Source code that shows how to decompress and extract text from PDF documents.

Introduction

PDF documents are commonly used and their content is usually compressed. This article shows a simple C code that can be used to extract plain text from the PDF file.

Why?

Adobe does allows you to submit PDF files and will extract the text or HTML and mail it back to you. But there are times when you need to extract the text yourself or do it inside an application. You may also want to apply special formatting (e.g., add tabs) so that the text can be easily imported into Excel for example (when your PDF document mostly contains tables that you need to port to Excel, which is how this code got developed).

There are several projects on "The Code Project" that show how to create PDF documents, but none that provide free code that shows how to extract text without using a commercial library. In the reader comments, a need was expressed for code just like what is being supplied here.

There are several libraries out there that read or create PDF file, but you have to register them for commercial use or sign various agreements. The code supplied here is very simple and basic, but it is entirely free. It only use the ZLIB library which is also free.

Basics

You can download documents such as PDFReference15_v5.pdf from here that explains some of the inners of PDF files. In short, each PDF file contains a number of objects. Each object may require one or more filters to decompress it and may also provide a stream of data. Text streams are usually compressed using the FlateDecode filter and may be uncompressed using code from the ZLIB (http://www.zlib.org/) library.

The data for each object can be found between "stream" and "endstream" sections. Once inflated, the data needs to be processed to extract the text. The data usually contains one or more text objects (starting with BT and ending with ET) with formatting instructions inside. You can learn a lot from the structure of PDF file by stepping through this application.

About Code

This single source code file contains very simple, very basic C code. It initially reads in the entire PDF file into one buffer and then repeatedly scans for "stream" and "endstream" sections. It does not check which filter should be applied and always assumes FlateDecode. (If it gets it wrong, usually no output is generated for that section of the file, so it is not a big issue). Once the data stream is inflated (uncompressed), it is processed. During the processing, the code searches for the BT and ET tokens that signify text objects. The contents of each is processed to extract the text and a guess is made as to whether tabs or new line characters are needed.

The code is far from complete or being any sort of general utility class, but it does demonstrate how you can extract the text yourself. It is enough to show you how and get you going.

The code is however fully functional, so when it is applied to a PDF document, it generally does a fair job of extracting the text. It has been tested on several PDF files.

This code is supplied as is, no warranties. Use at your own risk.

Using The Code

The download contains one C file. To use it, create a simple Windows 32 Console project and add the pdf.c file to the project. You also need to go here (bless them!) and download the free "zlib compiled DLL" zip file. Extract zdll.lib to your project directory and add it as a project dependency (link against it). Also put zlib1.dll in your project directory. Also put zconf.h and zlib.h in your project directory and add them to the project.

Now, step through the application and note that the input PDF and output text file names are hardwired at the start of the main method.

Future Enhancements

If there is enough interest, the author may consider uploading a release version with a Windows interface. The code is quite good for extracting data from tables in a form that can be readily imported into Excel, with the column preserved (because of the tabs that get added).

Code Snippets

Stream sections are located using initially:

C#
size_t streamstart = FindStringInBuffer (buffer, "stream", filelen);
size_t streamend = FindStringInBuffer (buffer, "endstream", filelen);

And then once the data portion is identified, it is inflated as follows:

C#
z_stream zstrm; ZeroMemory(&zstrm, sizeof(zstrm));
zstrm.avail_in = streamend - streamstart + 1;
zstrm.avail_out = outsize;
zstrm.next_in = (Bytef*)(buffer + streamstart);
zstrm.next_out = (Bytef*)output;
int rsti = inflateInit(&zstrm);
if (rsti == Z_OK)
{
  int rst2 = inflate (&zstrm, Z_FINISH);
  if (rst2 >= 0)
  {
    //Ok, got something, extract the text:
    size_t totout = zstrm.total_out;
    ProcessOutput(fileo, output, totout);
  }
}

The main work gets done in the ProcessOutput method which processes the uncompressed stream to extract text portion of any text object. It looks as follows:

C#
void ProcessOutput(FILE* file, char* output, size_t len)
{
  //Are we currently inside a text object?
  bool intextobject = false;
  //Is the next character literal 
  //(e.g. \\ to get a \ character or \( to get ( ):
  bool nextliteral = false;

  //() Bracket nesting level. Text appears inside ()
  int rbdepth = 0;

  //Keep previous chars to extract numbers etc.:
  char oc[oldchar];
  int j=0;
  for (j=0; j<oldchar; j++) oc[j]=' ';

  for (size_t i=0; i<len; i++)
  {
    char c = output[i];
    if (intextobject)
    {
      if (rbdepth==0 && seen2("TD", oc))
      {
        //Positioning.
        //See if a new line has to start or just a tab:
        float num = ExtractNumber(oc,oldchar-5);
        if (num>1.0)
        {
          fputc(0x0d, file);
          fputc(0x0a, file);
        }
        if (num<1.0)
        {
          fputc('\t', file);
        }
      }
      if (rbdepth==0 && seen2("ET", oc))
      {
        //End of a text object, also go to a new line.
        intextobject = false;
        fputc(0x0d, file);
        fputc(0x0a, file);
      }
      else if (c=='(' && rbdepth==0 && !nextliteral) 
      {
        //Start outputting text!
        rbdepth=1;
        //See if a space or tab (>1000) is called for by looking
        //at the number in front of (
        int num = ExtractNumber(oc,oldchar-1);
        if (num>0)
        {
          if (num>1000.0)
          {
            fputc('\t', file);
          }
          else if (num>100.0)
          {
            fputc(' ', file);
          }
        }
      }
      else if (c==')' && rbdepth==1 && !nextliteral) 
      {
        //Stop outputting text
        rbdepth=0;
      }
      else if (rbdepth==1) 
      {
        //Just a normal text character:
        if (c=='\\' && !nextliteral)
        {
          //Only print out next character 
          //no matter what. Do not interpret.
          nextliteral = true;
        }
        else
        {
          nextliteral = false;
          if ( ((c>=' ') && (c<='~')) || ((c>=128) && (c<255)) )
          {
            fputc(c, file);
          }
        }
      }
    }
    //Store the recent characters for 
    //when we have to go back for a number:
    for (j=0; j<oldchar-1; j++) oc[j]=oc[j+1];
      oc[oldchar-1]=c;
    if (!intextobject)
    {
      if (seen2("BT", oc))
      {
        //Start of a text object:
        intextobject = true;
      }
    }
  }
}

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Web Developer
Canada Canada
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralRe: VB/VBA version Pin
VBboy1369-Dec-08 8:19
VBboy1369-Dec-08 8:19 
GeneralSyntax error indentifier _TChar Pin
nerdfarm5-Oct-05 13:44
nerdfarm5-Oct-05 13:44 
GeneralRe: Syntax error indentifier _TChar Pin
ribanez23-May-06 4:22
ribanez23-May-06 4:22 
GeneralDon't create file Pin
tmprotsis1-Sep-05 9:43
tmprotsis1-Sep-05 9:43 
GeneralRelease Version Pin
Ed K17-Aug-05 2:31
Ed K17-Aug-05 2:31 
GeneralVb Equivalent Pin
Member 160190625-Jul-05 6:25
Member 160190625-Jul-05 6:25 
GeneralRe: Vb Equivalent Pin
masonb30325-Nov-08 19:44
masonb30325-Nov-08 19:44 
GeneralRe: Vb Equivalent Pin
ShaneParker11-Dec-08 0:38
ShaneParker11-Dec-08 0:38 
Hardly what I would call robust or efficient... but it works...

You need to reference the microsoft scripting runtime and the zlib.dll (I don't recall where I found that).


Option Explicit

Private Const oldchar = 15

Private Type StreamLimit
PtrStart As Long
PtrEnd As Long
End Type

'Declares
Private Declare Function ShellAbout Lib "shell32.dll" Alias "ShellAboutA" (ByVal hwnd As Long, ByVal szApp As String, ByVal szOtherStuff As String, ByVal hIcon As Long) As Long
Private Declare Sub CopyMemory Lib "kernel32" Alias "RtlMoveMemory" (hpvDest As Any, hpvSource As Any, ByVal cbCopy As Long)
Private Declare Function compress Lib "zlib.dll" (dest As Any, destLen As Any, src As Any, ByVal srcLen As Long) As Long
Private Declare Function uncompress Lib "zlib.dll" (dest As Any, destLen As Any, src As Any, ByVal srcLen As Long) As Long

Public Function FastExtractText(FileName As String)
Dim result As String
Dim InBuffer() As Byte
Dim BufferLength As Long
Dim Stream() As Byte
Dim Ptr As Long
Dim PDFData As String
Dim Limits As StreamLimit
Dim ThisString As String

'//Open the PDF source file:
InBuffer = OpenFileToBytes(FileName)
BufferLength = UBound(InBuffer)

'//Now search the buffer repeated for streams of data:
Ptr = 0
Limits = FastNextCleanStream(InBuffer, Ptr)
While ((Limits.PtrEnd &gt; 0) And (Limits.PtrEnd &lt; BufferLength))
'//Now use zlib to inflate:
PDFData = FastInflate(FileName, Limits)
ThisString = ProcessPDFData(PDFData)
result = result + ThisString
Limits = FastNextCleanStream(InBuffer, Limits.PtrEnd + 9)
DoEvents
Wend
FastExtractText = result

End Function

Private Function OpenFileToBytes(FileName As String, Optional Start As Long = 0, Optional Length As Long = 0) As Byte()
Dim FS As New Scripting.FileSystemObject
Dim InFile() As Byte
Dim FN As Long

If FS.FileExists(FileName) Then
ReDim InFile(FileLen(FileName))
FN = FreeFile
Open FileName For Binary As FN
If Start &lt;= 0 Then
Get FN, , InFile
ElseIf Length &lt;= 0 Then
Get FN, Start, InFile
Else
ReDim InFile(Length)
Get FN, Start, InFile
End If
OpenFileToBytes = InFile
Close FN
End If

End Function


Private Function ProcessPDFData(PDFData As String) As String
'Are we currently inside a text object?
Dim intextobject As Boolean
intextobject = False

'Is the next character literal (e.g. \\ to get a \ character or \( to get ( ):
Dim nextliteral As Boolean
nextliteral = False

'() Bracket nesting level. Text appears inside ()
Dim rbdepth As Integer
rbdepth = 0

'Keep previous chars to get extract numbers etc.:
Dim oc(oldchar) As Byte

Dim Ptr As Long
Dim c As String
Dim num As Double
Dim Length As Long

Length = Len(PDFData)

Dim NxtBT As Long
Dim NxtET As Long
Dim NxtOB As Long
Dim NxtCB As Long

Dim result As String
NxtET = 1

Dim done As Boolean
done = False

Do While Not done
NxtBT = GetNextSpaced("BT", NxtET, PDFData)
If NxtBT &lt;= 0 Then
'Nup - nothing left
Exit Do
End If
NxtET = GetNextSpaced("ET", NxtBT, PDFData)
NxtOB = FindNextNoEscape("(", NxtBT, PDFData)
Do While (NxtOB &lt; NxtET)
NxtCB = FindNextNoEscape(")", NxtOB, PDFData)
If NxtCB &lt; NxtET Then
If NxtCB &gt;= 0 Then
If result &lt;&gt; "" Then
result = result + "" + Mid(PDFData, NxtOB + 1, NxtCB - NxtOB - 1)
Else
result = Mid(PDFData, NxtOB + 1, NxtCB - NxtOB - 1)
End If
NxtOB = FindNextNoEscape("(", NxtCB, PDFData)
Else
NxtOB = NxtET
End If
Else
Do Until NxtET &gt; NxtCB
NxtET = GetNextSpaced("ET", NxtET, PDFData)
If NxtET &lt;= 0 Then
'Data error...
Exit Do
End If
Loop
End If
Loop
Loop

ProcessPDFData = ReplaceSpecialCharacters(result)

End Function

Private Function FastNextCleanStream(sBuffer() As Byte, ByRef After As Long) As StreamLimit
Dim Limits As StreamLimit
Dim sDelimit As String
Dim eDelimiter As String

sDelimit = "stream"

Limits = FastGrabStreamLimits(sBuffer, After, sDelimit, "endstream")
After = After + Limits.PtrEnd + Len("endstream")
Limits = FastCleanStream(sBuffer, Limits)
FastNextCleanStream = Limits
End Function

Private Function GetNextSpaced(Chars As String, After As Long, InData As String) As Long
Dim DataLen As Long
Dim result As Long
Dim found As Boolean
Dim CheckChar As String

DataLen = Len(InData)
found = False
result = After + 1
Do While Not found And result &gt; 0
result = InStr(result, InData, Chars, vbTextCompare)
If result &lt;= 0 Then
Exit Do
End If
CheckChar = Mid(InData, result - 1, 1)
If CheckChar = " " Or CheckChar = Chr(10) Or CheckChar = Chr(13) Then
If (result + Len(Chars) + 2) &lt;= DataLen Then
CheckChar = Mid(InData, result + Len(Chars), 1)
If CheckChar = " " Or CheckChar = Chr(10) Or CheckChar = Chr(13) Then
found = True
Exit Do
Else
result = result + Len(Chars)
End If
Else
Exit Do
End If
Else
result = result + Len(Chars)
End If
Loop

If found Then
GetNextSpaced = result
Else
GetNextSpaced = -1
End If
End Function

Private Function FindNextNoEscape(Char As String, After As Long, InData As String) As Long
Dim DataLen As Long
Dim result As Long
Dim found As Boolean
Dim CheckChar As String

found = False
result = After
Do While Not found And result &gt; 0
result = InStr(result, InData, Char, vbBinaryCompare)
If result &gt; After Then
CheckChar = Mid(InData, result - 1, 1)
If Not (CheckChar = "\") Then
found = True
Else
result = result + 1
End If
Else
found = True
End If
Loop

If found Then
FindNextNoEscape = result
Else
FindNextNoEscape = -1
End If
End Function

Private Function ReplaceSpecialCharacters(InString As String) As String
Dim result As String

result = InString
result = Replace(result, "\\", "'")
result = Replace(result, "\322", Chr(34)) ' "
result = Replace(result, "\323", Chr(34)) ' "
result = Replace(result, "\252", " TM")
result = Replace(result, "\320", "-")
result = Replace(result, "\311", "...")
result = Replace(result, "'", "\")
result = Replace(result, "\325", "'")
result = Replace(result, "\", "")
'Result = Replace(Result, "\)", ")")
ReplaceSpecialCharacters = result
End Function

Private Function FastGrabStreamLimits(Buffer() As Byte, After As Long, StartDelimiter As String, EndDelimiter As String) As StreamLimit
Dim result As StreamLimit

result.PtrStart = FindSpacedBytesLocation(After, Buffer, StartDelimiter)
'Result.PtrStart = GetNextSpaced(StartDelimiter, After, Buffer)
If result.PtrStart &gt; 0 Then
result.PtrStart = result.PtrStart + Len(StartDelimiter)
result.PtrEnd = FindSpacedBytesLocation(result.PtrStart, Buffer, EndDelimiter)
Else
result.PtrStart = UBound(Buffer)
End If
FastGrabStreamLimits = result
End Function

Private Function FindSpacedBytesLocation(Start As Long, Buffer() As Byte, Search As String) As Long
Dim result As Long
Dim LastResult As Long
Dim done As Boolean

result = 0
Do Until done
LastResult = result
result = FindBytesLocation(Start, Buffer, Search)
done = True
If result = LastResult Then
result = 0
done = True
End If
If result &lt;= 0 Then
done = True
End If
Loop
FindSpacedBytesLocation = result
End Function

Private Function FindBytesLocation(Start As Long, Buffer() As Byte, Search As String) As Long
Dim result As Long
Dim Ptr As Long
Dim Max As Long
Dim Srch As Byte
Dim SrchLen As Long
Dim found As Boolean

Max = UBound(Buffer)
Srch = Asc(Mid(Search, 1, 1))
SrchLen = Len(Search)
found = False
For result = Start To Max
If Buffer(result) = Srch Then
found = True
For Ptr = 1 To SrchLen - 1
If Not Buffer(result + Ptr) = Asc(Mid(Search, Ptr + 1, 1)) Then
found = False
Exit For
End If
Next Ptr
If found Then
Exit For
Else
found = False
result = result + Ptr
End If
End If
Next result

If found Then
FindBytesLocation = result
Else
FindBytesLocation = -1
End If
End Function

Private Function FastInflate(FileName As String, ZLData As StreamLimit) As String
Dim zstrm As New ZLIBTOOLLib.ZlibTool
Dim tmpZipped As String
Dim tmpUnZipped As String
Dim ZippedData() As Byte

tmpZipped = "C:\temp\tmpin": tmpUnZipped = "C:\temp\tmpout"
ZippedData = OpenFileToBytes(FileName, ZLData.PtrStart + 1, ZLData.PtrEnd - ZLData.PtrStart - 1)
DumpStreamToDisk ZippedData, tmpZipped
zstrm.InputFile = tmpZipped: zstrm.OutputFile = tmpUnZipped
zstrm.Decompress
FastInflate = OpenFileToText(tmpUnZipped)
Kill tmpZipped: Kill tmpUnZipped
End Function

Private Function FastCleanStream(Buffer As Variant, Limits As StreamLimit) As StreamLimit
Dim BufferLen As Long
Dim i As Long
Dim done As Boolean

BufferLen = Len(Buffer)
With Limits
done = False
i = 0
Do Until done
i = i + 1
If BufferLen &lt; .PtrStart + i Then
done = True
Else
If Not _
(Buffer(.PtrStart + i) = 10 Or _
Buffer(.PtrStart + i) = 13 Or _
Buffer(.PtrStart + i) = 32) Then
done = True
End If
End If
Loop
.PtrStart = .PtrStart + i
done = False
i = 0
Do Until done
i = i + 1
If .PtrEnd - i &lt;= 0 Then
done = True
Else
If Not _
(Buffer(.PtrEnd - i) = 10 Or _
Buffer(.PtrEnd - i) = 13 Or _
Buffer(.PtrEnd - i) = 32) Then
done = True
End If
End If
Loop
.PtrEnd = .PtrEnd - i
End With

FastCleanStream = Limits
End Function

Private Sub DumpStreamToDisk(Stream As Variant, FileName As String)
Dim OutFileNum As Long
Dim Data() As Byte

On Error Resume Next
Kill FileName
On Error GoTo 0
OutFileNum = FreeFile
Open FileName For Binary As OutFileNum
Data = Stream
Put OutFileNum, , Data
Close OutFileNum
End Sub

Private Function OpenFileToText(FileName As String) As String
Dim FS As New Scripting.FileSystemObject
Dim InFile As TextStream

If FS.FileExists(FileName) Then
Set InFile = FS.OpenTextFile(FileName)
If Not InFile.AtEndOfStream Then
OpenFileToText = InFile.ReadAll
End If
End If
End Function

Cheers Mate!

GeneralEncrypted Text Pin
Geoff Middleton8-Jul-05 6:46
Geoff Middleton8-Jul-05 6:46 
GeneralRe: Encrypted Text Pin
fifthnormal12-Jul-05 11:53
fifthnormal12-Jul-05 11:53 
GeneralRe: Encrypted Text Pin
Anonymous17-Jul-05 5:42
Anonymous17-Jul-05 5:42 
AnswerRe: Encrypted Text Pin
Lord TaGoH27-Feb-08 3:07
Lord TaGoH27-Feb-08 3:07 
GeneralReverse Task Pin
Abbas_Riazi30-Jun-05 20:35
professionalAbbas_Riazi30-Jun-05 20:35 
GeneralRe: Reverse Task Pin
gokul15074-Jan-10 23:41
gokul15074-Jan-10 23:41 
GeneralMinor Memory Leak Pin
the_grip21-Jun-05 6:22
the_grip21-Jun-05 6:22 
Generalhelp me Pin
M Shahid12-Jun-05 23:50
M Shahid12-Jun-05 23:50 
GeneralOptimization Pin
blizzymadden10-May-05 15:27
blizzymadden10-May-05 15:27 
Generaloutput file is empty Pin
jigneshrpatel25-Apr-05 21:13
jigneshrpatel25-Apr-05 21:13 
GeneralRe: output file is empty Pin
Member 223696413-Sep-05 17:58
Member 223696413-Sep-05 17:58 
GeneralTrouble extracting newer docs Pin
NeWi16-Jan-05 7:10
NeWi16-Jan-05 7:10 
GeneralRe: Trouble extracting newer docs Pin
ofoto28-Jan-05 21:20
ofoto28-Jan-05 21:20 
GeneralRe: Trouble extracting newer docs Pin
jdlw-200030-Jan-05 11:17
jdlw-200030-Jan-05 11:17 
GeneralRe: Trouble extracting newer docs Pin
Xiong Shijie27-Apr-05 3:26
Xiong Shijie27-Apr-05 3:26 
GeneralRe: Trouble extracting newer docs Pin
the_grip21-Jun-05 6:26
the_grip21-Jun-05 6:26 
GeneralRe: Trouble extracting newer docs Pin
brentoids22-Jun-05 7:23
brentoids22-Jun-05 7:23 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.