Code to extract plain text from a PDF file

I finally figured it out. See this link http://www.codeproject.com/KB/DLL/PDF2TXTVB.aspx[^] Laugh | :laugh:

Hi I am compiling the code using VC++ but I get this error. Why?
Syntax error indentifier _TChar
Thanks in advance
EJ

Try adding the stdafx.h header file to your project.

First of all, this code is great Cool | :cool:

.

But for my Project i don't want to create a textfile, but i want to get the text as string to put this in my database.

Can anybody tell me how can i do this?

Thanks

Theo Mprotsis

Will you be uploading the release version? I working with a project that might benefit.

Thanks!

ed

~"Watch your thoughts; they become your words. Watch your words they become your actions.
Watch your actions; they become your habits. Watch your habits; they become your character.
Watch your character; it becomes your destiny."
-Frank Outlaw.

Hi
I just wanted to find out if one can get a visual basic equivalent for the c code to extract the text, any ideas?

yes, I have not been able to find ANY examples in VB which perform the same task.

Hardly what I would call robust or efficient... but it works...

You need to reference the microsoft scripting runtime and the zlib.dll (I don't recall where I found that).

Option Explicit

Private Const oldchar = 15

Private Type StreamLimit
PtrStart As Long
PtrEnd As Long
End Type

'Declares
Private Declare Function ShellAbout Lib "shell32.dll" Alias "ShellAboutA" (ByVal hwnd As Long, ByVal szApp As String, ByVal szOtherStuff As String, ByVal hIcon As Long) As Long
Private Declare Sub CopyMemory Lib "kernel32" Alias "RtlMoveMemory" (hpvDest As Any, hpvSource As Any, ByVal cbCopy As Long)
Private Declare Function compress Lib "zlib.dll" (dest As Any, destLen As Any, src As Any, ByVal srcLen As Long) As Long
Private Declare Function uncompress Lib "zlib.dll" (dest As Any, destLen As Any, src As Any, ByVal srcLen As Long) As Long

Public Function FastExtractText(FileName As String)
Dim result As String
Dim InBuffer() As Byte
Dim BufferLength As Long
Dim Stream() As Byte
Dim Ptr As Long
Dim PDFData As String
Dim Limits As StreamLimit
Dim ThisString As String

'//Open the PDF source file:
InBuffer = OpenFileToBytes(FileName)
BufferLength = UBound(InBuffer)

'//Now search the buffer repeated for streams of data:
Ptr = 0
Limits = FastNextCleanStream(InBuffer, Ptr)
While ((Limits.PtrEnd > 0) And (Limits.PtrEnd < BufferLength))
'//Now use zlib to inflate:
PDFData = FastInflate(FileName, Limits)
ThisString = ProcessPDFData(PDFData)
result = result + ThisString
Limits = FastNextCleanStream(InBuffer, Limits.PtrEnd + 9)
DoEvents
Wend
FastExtractText = result

End Function

Private Function OpenFileToBytes(FileName As String, Optional Start As Long = 0, Optional Length As Long = 0) As Byte()
Dim FS As New Scripting.FileSystemObject
Dim InFile() As Byte
Dim FN As Long

If FS.FileExists(FileName) Then
ReDim InFile(FileLen(FileName))
FN = FreeFile
Open FileName For Binary As FN
If Start <= 0 Then
Get FN, , InFile
ElseIf Length <= 0 Then
Get FN, Start, InFile
Else
ReDim InFile(Length)
Get FN, Start, InFile
End If
OpenFileToBytes = InFile
Close FN
End If

End Function

Private Function ProcessPDFData(PDFData As String) As String
'Are we currently inside a text object?
Dim intextobject As Boolean
intextobject = False

'Is the next character literal (e.g. \\ to get a \ character or \( to get ( ):
Dim nextliteral As Boolean
nextliteral = False

'() Bracket nesting level. Text appears inside ()
Dim rbdepth As Integer
rbdepth = 0

'Keep previous chars to get extract numbers etc.:
Dim oc(oldchar) As Byte

Dim Ptr As Long
Dim c As String
Dim num As Double
Dim Length As Long

Length = Len(PDFData)

Dim NxtBT As Long
Dim NxtET As Long
Dim NxtOB As Long
Dim NxtCB As Long

Dim result As String
NxtET = 1

Dim done As Boolean
done = False

Do While Not done
NxtBT = GetNextSpaced("BT", NxtET, PDFData)
If NxtBT <= 0 Then
'Nup - nothing left
Exit Do
End If
NxtET = GetNextSpaced("ET", NxtBT, PDFData)
NxtOB = FindNextNoEscape("(", NxtBT, PDFData)
Do While (NxtOB < NxtET)
NxtCB = FindNextNoEscape(")", NxtOB, PDFData)
If NxtCB < NxtET Then
If NxtCB >= 0 Then
If result <> "" Then
result = result + "" + Mid(PDFData, NxtOB + 1, NxtCB - NxtOB - 1)
Else
result = Mid(PDFData, NxtOB + 1, NxtCB - NxtOB - 1)
End If
NxtOB = FindNextNoEscape("(", NxtCB, PDFData)
Else
NxtOB = NxtET
End If
Else
Do Until NxtET > NxtCB
NxtET = GetNextSpaced("ET", NxtET, PDFData)
If NxtET <= 0 Then
'Data error...
Exit Do
End If
Loop
End If
Loop
Loop

ProcessPDFData = ReplaceSpecialCharacters(result)

End Function

Private Function FastNextCleanStream(sBuffer() As Byte, ByRef After As Long) As StreamLimit
Dim Limits As StreamLimit
Dim sDelimit As String
Dim eDelimiter As String

sDelimit = "stream"

Limits = FastGrabStreamLimits(sBuffer, After, sDelimit, "endstream")
After = After + Limits.PtrEnd + Len("endstream")
Limits = FastCleanStream(sBuffer, Limits)
FastNextCleanStream = Limits
End Function

Private Function GetNextSpaced(Chars As String, After As Long, InData As String) As Long
Dim DataLen As Long
Dim result As Long
Dim found As Boolean
Dim CheckChar As String

DataLen = Len(InData)
found = False
result = After + 1
Do While Not found And result > 0
result = InStr(result, InData, Chars, vbTextCompare)
If result <= 0 Then
Exit Do
End If
CheckChar = Mid(InData, result - 1, 1)
If CheckChar = " " Or CheckChar = Chr(10) Or CheckChar = Chr(13) Then
If (result + Len(Chars) + 2) <= DataLen Then
CheckChar = Mid(InData, result + Len(Chars), 1)
If CheckChar = " " Or CheckChar = Chr(10) Or CheckChar = Chr(13) Then
found = True
Exit Do
Else
result = result + Len(Chars)
End If
Else
Exit Do
End If
Else
result = result + Len(Chars)
End If
Loop

If found Then
GetNextSpaced = result
Else
GetNextSpaced = -1
End If
End Function

Private Function FindNextNoEscape(Char As String, After As Long, InData As String) As Long
Dim DataLen As Long
Dim result As Long
Dim found As Boolean
Dim CheckChar As String

found = False
result = After
Do While Not found And result > 0
result = InStr(result, InData, Char, vbBinaryCompare)
If result > After Then
CheckChar = Mid(InData, result - 1, 1)
If Not (CheckChar = "\") Then
found = True
Else
result = result + 1
End If
Else
found = True
End If
Loop

If found Then
FindNextNoEscape = result
Else
FindNextNoEscape = -1
End If
End Function

Private Function ReplaceSpecialCharacters(InString As String) As String
Dim result As String

result = InString
result = Replace(result, "\\", "'")
result = Replace(result, "\322", Chr(34)) ' "
result = Replace(result, "\323", Chr(34)) ' "
result = Replace(result, "\252", " TM")
result = Replace(result, "\320", "-")
result = Replace(result, "\311", "...")
result = Replace(result, "'", "\")
result = Replace(result, "\325", "'")
result = Replace(result, "\", "")
'Result = Replace(Result, "\)", ")")
ReplaceSpecialCharacters = result
End Function

Private Function FastGrabStreamLimits(Buffer() As Byte, After As Long, StartDelimiter As String, EndDelimiter As String) As StreamLimit
Dim result As StreamLimit

result.PtrStart = FindSpacedBytesLocation(After, Buffer, StartDelimiter)
'Result.PtrStart = GetNextSpaced(StartDelimiter, After, Buffer)
If result.PtrStart > 0 Then
result.PtrStart = result.PtrStart + Len(StartDelimiter)
result.PtrEnd = FindSpacedBytesLocation(result.PtrStart, Buffer, EndDelimiter)
Else
result.PtrStart = UBound(Buffer)
End If
FastGrabStreamLimits = result
End Function

Private Function FindSpacedBytesLocation(Start As Long, Buffer() As Byte, Search As String) As Long
Dim result As Long
Dim LastResult As Long
Dim done As Boolean

result = 0
Do Until done
LastResult = result
result = FindBytesLocation(Start, Buffer, Search)
done = True
If result = LastResult Then
result = 0
done = True
End If
If result <= 0 Then
done = True
End If
Loop
FindSpacedBytesLocation = result
End Function

Private Function FindBytesLocation(Start As Long, Buffer() As Byte, Search As String) As Long
Dim result As Long
Dim Ptr As Long
Dim Max As Long
Dim Srch As Byte
Dim SrchLen As Long
Dim found As Boolean

Max = UBound(Buffer)
Srch = Asc(Mid(Search, 1, 1))
SrchLen = Len(Search)
found = False
For result = Start To Max
If Buffer(result) = Srch Then
found = True
For Ptr = 1 To SrchLen - 1
If Not Buffer(result + Ptr) = Asc(Mid(Search, Ptr + 1, 1)) Then
found = False
Exit For
End If
Next Ptr
If found Then
Exit For
Else
found = False
result = result + Ptr
End If
End If
Next result

If found Then
FindBytesLocation = result
Else
FindBytesLocation = -1
End If
End Function

Private Function FastInflate(FileName As String, ZLData As StreamLimit) As String
Dim zstrm As New ZLIBTOOLLib.ZlibTool
Dim tmpZipped As String
Dim tmpUnZipped As String
Dim ZippedData() As Byte

tmpZipped = "C:\temp\tmpin": tmpUnZipped = "C:\temp\tmpout"
ZippedData = OpenFileToBytes(FileName, ZLData.PtrStart + 1, ZLData.PtrEnd - ZLData.PtrStart - 1)
DumpStreamToDisk ZippedData, tmpZipped
zstrm.InputFile = tmpZipped: zstrm.OutputFile = tmpUnZipped
zstrm.Decompress
FastInflate = OpenFileToText(tmpUnZipped)
Kill tmpZipped: Kill tmpUnZipped
End Function

Private Function FastCleanStream(Buffer As Variant, Limits As StreamLimit) As StreamLimit
Dim BufferLen As Long
Dim i As Long
Dim done As Boolean

BufferLen = Len(Buffer)
With Limits
done = False
i = 0
Do Until done
i = i + 1
If BufferLen < .PtrStart + i Then
done = True
Else
If Not _
(Buffer(.PtrStart + i) = 10 Or _
Buffer(.PtrStart + i) = 13 Or _
Buffer(.PtrStart + i) = 32) Then
done = True
End If
End If
Loop
.PtrStart = .PtrStart + i
done = False
i = 0
Do Until done
i = i + 1
If .PtrEnd - i <= 0 Then
done = True
Else
If Not _
(Buffer(.PtrEnd - i) = 10 Or _
Buffer(.PtrEnd - i) = 13 Or _
Buffer(.PtrEnd - i) = 32) Then
done = True
End If
End If
Loop
.PtrEnd = .PtrEnd - i
End With

FastCleanStream = Limits
End Function

Private Sub DumpStreamToDisk(Stream As Variant, FileName As String)
Dim OutFileNum As Long
Dim Data() As Byte

On Error Resume Next
Kill FileName
On Error GoTo 0
OutFileNum = FreeFile
Open FileName For Binary As OutFileNum
Data = Stream
Put OutFileNum, , Data
Close OutFileNum
End Sub

Private Function OpenFileToText(FileName As String) As String
Dim FS As New Scripting.FileSystemObject
Dim InFile As TextStream

If FS.FileExists(FileName) Then
Set InFile = FS.OpenTextFile(FileName)
If Not InFile.AtEndOfStream Then
OpenFileToText = InFile.ReadAll
End If
End If
End Function

Cheers Mate!

I've asked NeWi if he has code available, but I have not yet received a response.

Does anyone else know how to decrypt a text stream?

PS. I'm developing NeWi's code to parse the text metrics; whch will give a more accurate return vis-a-vis end-of-line characters.

Midders

Hello,

I am wondering if you have made any progress or learned any new information regarding decrypting text streams?

Thanks,

Daniel

Not yet. At present I'm coding a system to open the pdf file, look for active objects and retrieve the text streams.

When reviewing the pdf spec, I discovered - amongst other issues - that text is not always written to file in the order that it appears on screen (along the lines of M$ Word fast saves).

(Disclaimer: IT'S NOT MY WORK, GIVE CREDIT TO WHO IT BELONG TO!)
You could find this paper on how to decrypt encrypted stream:
http://www.cs.cmu.edu/~dst/Adobe/Gallery/anon21jul01-pdf-encryption.txt
It explain how to get the key and how to decrypt the encrypted streams.
quick hits, look for:

% Encryption dictionary<br />
94 0 obj<br />
<<<br />
    /Filter /Standard   % use the standard security handler<br />
    /V 1                % algorithm 1<br />
    /R 2                % revision 2<br />
    /U (xxx...xxx)      % hashed user password (32 bytes)<br />
    /O (xxx...xxx)      % hashed owner password (32 bytes)<br />
    /P 65472            % flags specifying the allowed operations<br />
>><br />
endobj

then
The encryption key is generated as follows:

1. Pad the user password out to 32 bytes, using a hardcoded<br />
   32-byte string:<br />
       28 BF 4E 5E 4E 75 8A 41 64 00 4E 56 FF FA 01 08<br />
       2E 2E 00 B6 D0 68 3E 80 2F 0C A9 FE 64 53 69 7A<br />
   If the user password is null, just use the entire padding<br />
   string.  (I.e., concatenate the user password and the padding<br />
   string and take the first 32 bytes.)<br />
2. Append the hashed owner password (the /O entry above).<br />
3. Append the permissions (the /P entry), treated as a four-byte<br />
   integer, LSB first.<br />
4. Append the file identifier (the /ID entry from the trailer<br />
   dictionary).  This is an arbitrary string of bytes; Adobe<br />
   recommends that it be generated by MD5 hashing various pieces<br />
   of information about the document.<br />
5. MD5 hash this string; the first 5 bytes of output are the<br />
   encryption key.  (This is a 40-bit key, presumably to meet US<br />
   export regulations.)

Note that the inputs to this algorithm are: the user password
(typically empty) and various information specified in the PDF file.
All stream (and string) objects in the PDF file are encrypted. This
is sufficient to render the file useless (that is, if it weren't so
easy to decrypt). Stream/string decryption works like this:

1. Take the 5-byte file key (from above).<br />
2. Append the 3 low-order bytes (LSB first) of the object number<br />
   for the stream/string object being decrypted.<br />
3. Append the 2 low-order bytes (LSB first) of the generation<br />
   number.<br />
4. MD5 hash that 10-byte string.<br />
5. Use the first 10 bytes of the output as an RC4 key to decrypt<br />
   the stream or string.  (This apparently still meets the US<br />
   export regulations because it's a 40-bit key with an additional<br />
   40-bit "salt".)

To decrypt a PDF file (i.e., generate a new PDF file, identical except
that all encryption is removed), just filter the file, applying the
above algorithm to decrypt every stream and string object. Then
remove the /Encrypt entry in the trailer dictionary.

it's not complete so please read the file!
(I just find it! IT'S NOT MY WORK, GIVE CREDIT TO WHO IT BELONG TO!)

Hi,
Thank you for sharing your knowledge. Do you have any idea how can embedd text in a created PDF?
I want to embedd some hidden (non-readable) text into PDF.

Best regards,
A. Riazi

Did you find any solution for the Reverse task..?If you finded please share..gokul_csse@yahoo.com

In the main function, the buffer char pointer is never deallocated. i realize this is a simple example, but if anyone is using this code for any practical purposes then they should be aware.

Thanks for this gem of an article! This is great stuff.

i want to extract plain text and images from pdf files in vb6

Replace this line:
for (size_t j=0; j<oldchar-1; j++) oc[j]=oc[j+1];

With this:

memmove(oc, oc+1, 14);

And you'll get a bit of a bump in performance.

Hi.. I have try your sample code in VC++, as u said and also try the setup exe.. but in both case it generates the empty output file..
I have installed "Adobe reader 6.0" on my pc..
So, where is the problem?? in code Or in version of Adobe reader???
Thanks, in advance
Regards,
jignesh

have you fixed the problem yet?I face the same one,it you have solve it ,could you give me your solution sample?you are be appreciated!thank you a lot!

Newer versions of PDF documents are sometimes encrypted. However, no password is required and so the document can still be displayed by readers. BUT the stream data still needs to be decrypted before it can be inflated. So when you run these PDF document through the code above, no output is produced. I have found some snippets of code that explain how to do this, and I am thinking of adding it in to pdf.c.

Anyone interested in an updated version of the code supplied here? Is this a problem for people out there?

I'd love to see an updated version of your code. keep up the good work Smile | :)

Yes being able to extract text from encrypted files would be excellent - if you can post any code, please do.

Also, to the guy who is "the only person in US & Europe who is approved by Adobe to teach the PDF file format" - that [you are the only person in US & Europe] is EXACTLY the reason we need articles like this - the PDF spec is tough and not explained much elsewhere - sure the code is not perfect, but for most of us we need to get things done this century so that we can move onto the next thing our boss wants, we just don't have the time (or good reason) to invest in reading and understanding the whole PDF spec. For me this code was a life-saver, and it got me started in the right direction. Open source libraries are not a viable resource for commercial developers (because of the licensing restrictions), and they can be hell to understand aswell.

Jim

That sounds interesting,could you send me an updated version in shijie_xiong@yahoo.com.

Good good study, day day up

i would like a copy if you update it. Please email to me via link in this post (my handle's email).

Muchas gracias!

This is a problem that I am interested in

Code to extract plain text from a PDF file

Introduction

Why?

Basics

About Code

Using The Code

Future Enhancements

Code Snippets

License

Comments and Discussions