Click here to Skip to main content
15,888,461 members
Articles / Programming Languages / C++
Article

Code to extract plain text from a PDF file

Rate me:
Please Sign up or sign in to vote.
4.87/5 (74 votes)
21 Jun 20044 min read 832.1K   36.7K   175   152
Source code that shows how to decompress and extract text from PDF documents.

Introduction

PDF documents are commonly used and their content is usually compressed. This article shows a simple C code that can be used to extract plain text from the PDF file.

Why?

Adobe does allows you to submit PDF files and will extract the text or HTML and mail it back to you. But there are times when you need to extract the text yourself or do it inside an application. You may also want to apply special formatting (e.g., add tabs) so that the text can be easily imported into Excel for example (when your PDF document mostly contains tables that you need to port to Excel, which is how this code got developed).

There are several projects on "The Code Project" that show how to create PDF documents, but none that provide free code that shows how to extract text without using a commercial library. In the reader comments, a need was expressed for code just like what is being supplied here.

There are several libraries out there that read or create PDF file, but you have to register them for commercial use or sign various agreements. The code supplied here is very simple and basic, but it is entirely free. It only use the ZLIB library which is also free.

Basics

You can download documents such as PDFReference15_v5.pdf from here that explains some of the inners of PDF files. In short, each PDF file contains a number of objects. Each object may require one or more filters to decompress it and may also provide a stream of data. Text streams are usually compressed using the FlateDecode filter and may be uncompressed using code from the ZLIB (http://www.zlib.org/) library.

The data for each object can be found between "stream" and "endstream" sections. Once inflated, the data needs to be processed to extract the text. The data usually contains one or more text objects (starting with BT and ending with ET) with formatting instructions inside. You can learn a lot from the structure of PDF file by stepping through this application.

About Code

This single source code file contains very simple, very basic C code. It initially reads in the entire PDF file into one buffer and then repeatedly scans for "stream" and "endstream" sections. It does not check which filter should be applied and always assumes FlateDecode. (If it gets it wrong, usually no output is generated for that section of the file, so it is not a big issue). Once the data stream is inflated (uncompressed), it is processed. During the processing, the code searches for the BT and ET tokens that signify text objects. The contents of each is processed to extract the text and a guess is made as to whether tabs or new line characters are needed.

The code is far from complete or being any sort of general utility class, but it does demonstrate how you can extract the text yourself. It is enough to show you how and get you going.

The code is however fully functional, so when it is applied to a PDF document, it generally does a fair job of extracting the text. It has been tested on several PDF files.

This code is supplied as is, no warranties. Use at your own risk.

Using The Code

The download contains one C file. To use it, create a simple Windows 32 Console project and add the pdf.c file to the project. You also need to go here (bless them!) and download the free "zlib compiled DLL" zip file. Extract zdll.lib to your project directory and add it as a project dependency (link against it). Also put zlib1.dll in your project directory. Also put zconf.h and zlib.h in your project directory and add them to the project.

Now, step through the application and note that the input PDF and output text file names are hardwired at the start of the main method.

Future Enhancements

If there is enough interest, the author may consider uploading a release version with a Windows interface. The code is quite good for extracting data from tables in a form that can be readily imported into Excel, with the column preserved (because of the tabs that get added).

Code Snippets

Stream sections are located using initially:

C#
size_t streamstart = FindStringInBuffer (buffer, "stream", filelen);
size_t streamend = FindStringInBuffer (buffer, "endstream", filelen);

And then once the data portion is identified, it is inflated as follows:

C#
z_stream zstrm; ZeroMemory(&zstrm, sizeof(zstrm));
zstrm.avail_in = streamend - streamstart + 1;
zstrm.avail_out = outsize;
zstrm.next_in = (Bytef*)(buffer + streamstart);
zstrm.next_out = (Bytef*)output;
int rsti = inflateInit(&zstrm);
if (rsti == Z_OK)
{
  int rst2 = inflate (&zstrm, Z_FINISH);
  if (rst2 >= 0)
  {
    //Ok, got something, extract the text:
    size_t totout = zstrm.total_out;
    ProcessOutput(fileo, output, totout);
  }
}

The main work gets done in the ProcessOutput method which processes the uncompressed stream to extract text portion of any text object. It looks as follows:

C#
void ProcessOutput(FILE* file, char* output, size_t len)
{
  //Are we currently inside a text object?
  bool intextobject = false;
  //Is the next character literal 
  //(e.g. \\ to get a \ character or \( to get ( ):
  bool nextliteral = false;

  //() Bracket nesting level. Text appears inside ()
  int rbdepth = 0;

  //Keep previous chars to extract numbers etc.:
  char oc[oldchar];
  int j=0;
  for (j=0; j<oldchar; j++) oc[j]=' ';

  for (size_t i=0; i<len; i++)
  {
    char c = output[i];
    if (intextobject)
    {
      if (rbdepth==0 && seen2("TD", oc))
      {
        //Positioning.
        //See if a new line has to start or just a tab:
        float num = ExtractNumber(oc,oldchar-5);
        if (num>1.0)
        {
          fputc(0x0d, file);
          fputc(0x0a, file);
        }
        if (num<1.0)
        {
          fputc('\t', file);
        }
      }
      if (rbdepth==0 && seen2("ET", oc))
      {
        //End of a text object, also go to a new line.
        intextobject = false;
        fputc(0x0d, file);
        fputc(0x0a, file);
      }
      else if (c=='(' && rbdepth==0 && !nextliteral) 
      {
        //Start outputting text!
        rbdepth=1;
        //See if a space or tab (>1000) is called for by looking
        //at the number in front of (
        int num = ExtractNumber(oc,oldchar-1);
        if (num>0)
        {
          if (num>1000.0)
          {
            fputc('\t', file);
          }
          else if (num>100.0)
          {
            fputc(' ', file);
          }
        }
      }
      else if (c==')' && rbdepth==1 && !nextliteral) 
      {
        //Stop outputting text
        rbdepth=0;
      }
      else if (rbdepth==1) 
      {
        //Just a normal text character:
        if (c=='\\' && !nextliteral)
        {
          //Only print out next character 
          //no matter what. Do not interpret.
          nextliteral = true;
        }
        else
        {
          nextliteral = false;
          if ( ((c>=' ') && (c<='~')) || ((c>=128) && (c<255)) )
          {
            fputc(c, file);
          }
        }
      }
    }
    //Store the recent characters for 
    //when we have to go back for a number:
    for (j=0; j<oldchar-1; j++) oc[j]=oc[j+1];
      oc[oldchar-1]=c;
    if (!intextobject)
    {
      if (seen2("BT", oc))
      {
        //Start of a text object:
        intextobject = true;
      }
    }
  }
}

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Web Developer
Canada Canada
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralRe: Not able to find zdll.lib Pin
tiutababo15-Jun-10 21:27
tiutababo15-Jun-10 21:27 
GeneralRe: Not able to find zdll.lib Pin
Tilman Hausherr5-Aug-10 8:57
Tilman Hausherr5-Aug-10 8:57 
Generaldoes not support embedded fonts Pin
lipoor19-May-10 21:01
lipoor19-May-10 21:01 
QuestionCompatibility with Mac Pin
graccus29-Mar-10 9:37
graccus29-Mar-10 9:37 
GeneralTHANK YOU!!! Pin
marceloflu18-Feb-10 8:17
marceloflu18-Feb-10 8:17 
GeneralGetting Blank Command prompt Pin
gokul15074-Jan-10 23:33
gokul15074-Jan-10 23:33 
Generaldiacritics Pin
tontoncaidd21-Sep-09 4:38
tontoncaidd21-Sep-09 4:38 
GeneralA pascal translation of this pdf text extractor [modified] Pin
Domingo Alvarez29-Jul-09 13:25
Domingo Alvarez29-Jul-09 13:25 
unit pdftotext;

{$mode objfpc}{$H+}


//Converted from http://www.codeproject.com/KB/cpp/ExtractPDFText.aspx
//Original author http://www.codeproject.com/Members/NeWi
//Converted by Domingo Alvarez Duarte mingodadATgmailDOTcom
//Original source file pdf.cpp

//This file contains extremely crude pascal source code to extract plain text
//from a PDF file. It is only intended to show some of the basics involved
//in the process and by no means good enough for commercial use.
//But it can be easily modified to suit your purpose. Code is by no means
//warranted to be bug free or suitable for any purpose.

//Adobe has a web site that converts PDF files to text for free,
//so why would you need something like this? Several reasons:

//1) This code is entirely free including for commericcial use. It only
//   requires PAZLIB which is entirely free as well.

//2) This code tries to put tabs into appropriate places in the text,
//   which means that if your PDF file contains mostly one large table,
//   you can easily take the output of this program and directly read it
//   into Excel! Otherwise if you select and copy the text and paste it into
//   Excel there is no way to extract the various columns again.

//This code assumes that the PDF file has text objects compressed
//using FlateDecode (which seems to be standard).

//This code is free. Use it for any purpose.
//The author assumes no liability whatsoever for the use of this code.
//Use it at your own risk!

//PDF file strings (based on PDFReference15_v5.pdf from www.adobve.com:

//BT = Beginning of a text object, ET = end of a text object
//5 Ts = superscript
//-5 Ts = subscript
//Td move to start next line

interface

uses
   Classes;

function pdf2text(pdfFN: string): boolean;
function pdfStream2textStream(mStreamIn, mStreamOut: TMemoryStream): boolean;

implementation

uses
   SysUtils, paszlib;

const
   cStream = 'stream';
   cEndStream = 'endstream';
   cCR      = #13;
   cNL      = #10;
   cCRNL   = cCR + cNL;
   cTab      = #9;
   cBlanks = [' ', cCR, cNL];
   cDigits = ['0'..'9'];
   cDigitsDot = ['0'..'9','.'];


//Find a string in a buffer:
function FindStringInBuffer(buffer, search: PChar; buffersize: integer): integer;
var
   buffer0: PChar;
   len, i:   integer;
   fnd:      boolean;
begin
   buffer0 := buffer;

   len := strlen(search);
   fnd := False;
   while not fnd do
   begin
      fnd := True;
      for i := 0 to len - 1 do
      begin
         if (buffer[i] &lt;&gt; search[i]) then
         begin
            fnd := False;
            break;
         end;
      end;
      if (fnd) then
         exit(buffer - buffer0);
      Inc(buffer);
      if ((buffer - buffer0 + len) &gt;= buffersize) then
         exit(-1);
   end;
   Result := -1;
end;

//Keep this many previous recent characters for back reference:
//#define oldchar 15
const
   cOldChar = 15;

//Convert a recent set of characters into a number if there is one.
//Otherwise return -1:
function ExtractNumber(search: PChar; lastCharOffset: integer): real;
var
   iStart, iEnd:         integer;
   buffer: array[0..(cOldChar + 5)] of char;
begin
   iEnd := lastcharoffset;
   while (iEnd &gt; 0) and not (search[iEnd] in cDigits) do
      Dec(iEnd);
   iStart := iEnd;
   while (iStart &gt; 0) and (search[iStart] in cDigitsDot) do
      Dec(iStart);
   Result := -1.0;
   FillChar(buffer, 0, sizeof(buffer));
   strlcopy(buffer, search + iStart + 1, iEnd - iStart);
   if (buffer[0] &lt;&gt; #0) then result := StrToFloatDef(buffer, -1);
end;

//This method processes an uncompressed Adobe (text) object and extracts text.
procedure ProcessOutput(oStream: TStream; output: PChar; len: integer);
var
   inTextObject, nextLiteral: boolean;
   rbdepth, j, i: integer;
   oc:   array[0..cOldChar] of char;
   c:   char;
   num: real;

   //Check if a certain 2 character token just came along (e.g. BT):
   function seen2(search, recent: PChar): boolean;
   begin
      Result := (recent[cOldChar - 3] = search[0]) and
         (recent[cOldChar - 2] = search[1]) and
         (recent[cOldChar - 1] in cBlanks) and
         (recent[cOldChar - 4] in cBlanks);
   end;

begin
   //writeln(output);
   //Are we currently inside a text object?
   inTextObject := False;

   //Is the next character literal (e.g. \\ to get a \ character or \( to get ( ):
   nextLiteral := False;

   //() Bracket nesting level. Text appears inside ()
   rbdepth := 0;

   //Keep previous chars to get extract numbers etc.:
   for j := 0 to cOldChar - 1 do
      oc[j] := ' ';

   for i := 0 to len - 1 do
   begin
      c := output[i];
      if (inTextObject) then
      begin
         if (rbdepth = 0) and seen2('TD', oc) then
         begin
            //Positioning.
            //See if a new line has to start or just a tab:
            num := ExtractNumber(oc, cOldChar - 5);
            if (num &gt; 1.0) then
               oStream.Write(cCRNL, 2);
            if (num &lt; 1.0) then
               oStream.Write(cTab, 1);
         end;
         if (rbdepth = 0) and seen2('ET', oc) then
         begin
            //End of a text object, also go to a new line.
            inTextObject := False;
            oStream.Write(cCRNL, 2);
         end
         else if (c = '(') and (rbdepth = 0) and (not nextLiteral) then
         begin
            //Start outputting text!
            rbdepth := 1;
            //See if a space or tab (&gt;1000) is called for by looking
            //at the number in front of (
            num      := ExtractNumber(oc, cOldChar - 1);
            if (num &gt; 0) then
            begin
               if (num &gt; 1000.0) then
                  oStream.Write(cTab, 1)
               else if (num &gt; 100.0) then
                  oStream.Write(' ', 1);
            end;
         end
         else if (c = ')') and (rbdepth = 1) and (not nextLiteral) then
         begin
            //Stop outputting text
            rbdepth := 0;
         end
         else if (rbdepth = 1) then
         begin
            //Just a normal text character:
            if (c = '\') and (not nextLiteral) then
            begin
               //Only print out next character no matter what. Do not interpret.
               nextliteral := True;
            end
            else
            begin
               nextliteral := False;
               if ((c &gt;= ' ') and (c &lt;= '~')) or
                  ((Byte(c) &gt;= 128) and (Byte(c) &lt; 255)) then
               begin
                  oStream.Write(c, 1);
               end;
            end;
         end;
      end;
      //Store the recent characters for when we have to go back for a number:
      for j := 0 to cOldChar - 2 do
         oc[j] := oc[j + 1];
      oc[cOldChar - 1] := c;
      if not inTextObject then
      begin
         if seen2('BT', oc) then
         begin
            //Start of a text object:
            inTextObject := True;
         end;
      end;
   end;
end;

function pdfStream2textStream(mStreamIn, mStreamOut: TMemoryStream): boolean;
var
   moreStreams: boolean;
   streamStart, streamEnd, nextStreamStart, filelen, outsize, i: integer;
   buffer, output: PChar;
   zstrm: TZstream;
begin
   buffer   := PChar(mStreamIn.Memory);
   filelen := mStreamIn.Size;
   output   := nil;
   outsize := 0;

   moreStreams := True;
   //Now search the buffer repeated for streams of data:
   while moreStreams do
   begin
      //Search for stream, endstream. We ought to first check the filter
      //of the object to make sure it if FlateDecode, but skip that for now!
      streamStart := FindStringInBuffer(buffer, cStream, filelen);
      streamEnd   := FindStringInBuffer(buffer, cEndStream, filelen);
      nextStreamStart := streamEnd + sizeof(cEndStream) + 1;
      if (streamStart &gt; 0) and (streamEnd &gt; streamStart) then
      begin
         //Skip to beginning and end of the data stream:
         Inc(streamStart, sizeof(cStream) {6});

         if (buffer[streamStart] = cCR {0x0d}) and
            (buffer[streamstart + 1] = cNL {0x0a}) then
            Inc(streamStart, 2)
         else if (buffer[streamstart] = cNL {0x0a}) then
            Inc(streamStart);

         if (buffer[streamend - 2] = cCR {0x0d}) and
            (buffer[streamend - 1] = cNL {0x0a}) then
            Dec(streamEnd, 2)
         else if (buffer[streamend - 1] = cNL {0x0a}) then
            Dec(streamEnd);

         //Assume output will fit into 10 times input buffer:
         i := (streamEnd - streamStart) * 10;
         if i &gt; outsize then
         begin
            ReAllocMem(output, i);
            outsize := i;
         end;
         FillChar(output, 0, outsize);

         //Now use zlib to inflate:
         //z_stream zstrm; ZeroMemory(&amp;zstrm, sizeof(zstrm));
         FillChar(zstrm, 0, SizeOf(zstrm));

         zstrm.avail_in   := streamEnd - streamStart + 1;
         zstrm.avail_out := outsize;
         zstrm.next_in   := PByte(buffer + streamstart);
         zstrm.next_out   := Pbyte(output);

         if (inflateInit(zstrm) = Z_OK) and
            (inflate(zstrm, Z_FINISH) &gt;= 0) then
            //Ok, got something, extract the text:
            ProcessOutput(mStreamOut, output, zstrm.total_out);

         Inc(buffer, nextStreamStart);
         Dec(filelen, nextStreamStart);
      end
      else
         morestreams := False;
   end;
   FreeMem(output);
   Result := True;
end;

function pdf2text(pdfFN: string): boolean;
var
   mStreamIn, mStreamOut: TMemoryStream;
begin
   mStreamIn   := TMemoryStream.Create;
   mStreamOut := TMemoryStream.Create;
   //Read the entire file into memory (!):
   mStreamIn.LoadFromFile(pdfFN);
   Result := pdfStream2textStream(mStreamIn, mStreamOut);
   mStreamIn.Free;
   mStreamOut.SaveToFile(pdfFN + '.txt');
   mStreamOut.Free;
end;

end.

modified on Thursday, July 30, 2009 6:07 AM

GeneralA less complex pascal translation of this pdf text extractor that add spaces betwen lines Pin
Domingo Alvarez30-Jul-09 23:32
Domingo Alvarez30-Jul-09 23:32 
GeneralUnable to delete the buffer Pin
REDSERPENT712-Mar-09 1:04
REDSERPENT712-Mar-09 1:04 
Generalthanx Pin
mrares26-Feb-09 13:02
mrares26-Feb-09 13:02 
General.Net version Pin
nstuart23-Feb-09 10:37
nstuart23-Feb-09 10:37 
AnswerRe: .Net version Pin
Vasiliy Zverev29-Sep-10 6:37
Vasiliy Zverev29-Sep-10 6:37 
GeneralLinking error : on deflateInit(....) call Pin
mmayur11-Feb-09 16:58
mmayur11-Feb-09 16:58 
GeneralRe: Linking error : on deflateInit(....) call Pin
alirezazarei8-Sep-09 13:39
alirezazarei8-Sep-09 13:39 
QuestionProblem with implementation ( help ) Pin
ankit0921-Jan-09 2:43
ankit0921-Jan-09 2:43 
QuestionHow to extract it page by page, or rather how to detect a new page? Pin
Alexander Schmidt6-Dec-08 7:15
Alexander Schmidt6-Dec-08 7:15 
AnswerRe: How to extract it page by page, or rather how to detect a new page? Pin
themanof8310-Dec-08 5:41
themanof8310-Dec-08 5:41 
QuestionReading PDF files with tables Pin
Leon Stenneth24-Nov-08 18:32
Leon Stenneth24-Nov-08 18:32 
GeneralBug in code... Pin
nortonio29-May-08 4:27
nortonio29-May-08 4:27 
GeneralRe: Bug in code... Pin
Tilman Hausherr5-Aug-10 8:42
Tilman Hausherr5-Aug-10 8:42 
Generalthank you (RESPECT PACA PACA) Pin
leonel2014-May-08 7:16
leonel2014-May-08 7:16 
GeneralRussian text [modified] Pin
Smolensk13-May-08 20:27
Smolensk13-May-08 20:27 
GeneralLink Error Pin
avinash_ss5-Mar-08 19:11
avinash_ss5-Mar-08 19:11 
Questionhow to create the output file in MFC application Pin
tunminhein6-Aug-07 3:59
tunminhein6-Aug-07 3:59 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.