Searching and replacing a text in a PDF file with the iText library

Question

0.00/5 (No votes)

See more:

I need to find and replace a placeholder string in a PDF file. The PDF file is loaded with the iText library and I have been trying to follow code samples to follow some code samples I have dug up, more often than not for the original Java implementation.

The problem is that the samples don't work for my PDF file. I get a PdfDictionary with PdfObjects, but when I try to filter out the objects with texts I get no results. I know that there is a text in there, because I first took a look at the contents of the file with a PDF parser. The parser will not allow me to make changes and write them back, but at least I know that there is something in there that can be found.

Taking a closer look at the PdfDictionary object, I found only one flavor of PdfObject in it: PdfIndirect reference. The name suggests that I must resolve these references to get objects which I can examine and modify, but i can't find any sample code for that.

What I have tried:

I have to work with an improvised setup with several computers and remote desktops at the moment, so I can't just post my experimental code right now. This is what I have:

1) Open a PdfReader (works)
2) Get a PdfDocument object with the reader (works)
3) Iterate through the pages of the document and get a Pdfpage object (works)
4) (For each page) get a PdfDictionary from the page object (works)
5) Get Pdf objects from the dictionary with dictionary.Get(PdfName.Contents) (works)
6) Normally i would just have to iterate over the results from step 5), but I only get PdfIndirectReference objects. How can I resolve and edit these references?

MemoryStream stream;
PdfReader reader;
PdfDocument document;
Dictionary<String, PdfFormField> fields;
PdfPage page;
PdfDictionary dict;
PdfStream content;
int pages;
int i;

using (stream = new MemoryStream(BinaryFile))
{
    using (reader = new PdfReader(stream))
    {
        using (document = new PdfDocument(reader))
        {
            pages = document.GetNumberOfPages();
            for (i = 1; i <= pages; i++)
            {
                page = document.GetPage(i);
                dict = page.GetPdfObject();
                var xcontent = dict.Get(PdfName.Contents);
                if (xcontent != null)
                {
                    PdfArray thearray= xcontent as PdfArray;
                    foreach (PdfObject obj in thearray)
                    {
                        // these objects actually are PdfIndirectReferences
                        // converting them leads nowhere, so here is the point
                        // where I would have to resolve the reference and use whatever
                        // objects I might obtain that way.
                        PdfStream strm = obj as PdfStream;
                        if(strm != null)
                        {
                            byte[] data = strm.GetBytes();
                            UTF8Encoding enc = new UTF8Encoding();

                            string test = enc.GetString(data);
                        }
                    }
                }
            }
        }
    }
}

Posted 23-Mar-20 1:11am

CodeWraith

Updated 23-Mar-20 1:57am

v2

Add a Solution

Comments

ZurdoDev 23-Mar-20 7:35am

It would help if you clicked Improve question and showed just some relevant code.

CodeWraith 23-Mar-20 7:58am

As you wish, but I doubt it will help very much. Everything so far is ok, but what can i do with the indirect references from there on?

ZurdoDev 23-Mar-20 8:03am

It always helps to make sure we understand what you are saying.

This might help, https://stackoverflow.com/questions/37014984/how-to-read-text-of-appearance-stream

CodeWraith 23-Mar-20 8:37am

Thanks. I already took a first look and it looks like I'm going in circles. The problem always is that I have to ask for what objects I want to see from the document and 'Contents' only yields the indirect references. I would be very happy to get to the point directly, but have no idea where the text is actually stored in the document or how to ask for this.

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)