Click here to Skip to main content
15,881,424 members
Articles / Desktop Programming / WPF

PDF Page Counter

Rate me:
Please Sign up or sign in to vote.
5.00/5 (7 votes)
24 Oct 2014CPOL2 min read 35.8K   2K   11   2
Quickly count the number of pages in a collection of PDF documents

Introduction

My company Red Cell Innovation Inc. provides a document scanning service. Often we require a page count of a collection of PDF files for the purpose of billing, quality control, scheduling, and estimating.

This is an application that I quickly whipped up to facilitate this. The application uses a procedural style to accomplish this in about 200 lines of code including XAML and comments in a single simple codebehind class.

Features

  • Simple: Drop a directory in the application.
  • Fast: Scanned 20GB of PDFs and counted 53877 pages in 499 files in 7 seconds on an SSD (270 seconds on a network drive).

How It Works

Language C# 5.0
.NET Framework 4.5
UI Framework WPF
Libraries iTextSharp
Pattern Codebehind procedural

When the application starts, the user is prompted to drop files and/or folders into the application's window.

UI: Drop files and.or folders to be counted.

When files or folders are dropped, the Start method is invoked, changing the visibility of UI elements to the count screen.

UI: File and page counts

The async Analyze method is invoked to create a new thread that traverses the filesystem recursively. A new thread is requested from the thread pool for each directory to be enumerated and its files counted.

C#
private async Task Analyze (IEnumerable<string> filenames)
{
    await Task.Run(async () =>
    {
        foreach (string filename in filenames)
        {
            if (this._cancel)
                break;

            Dispatcher.Invoke(Update);
            if (Directory.Exists(filename))
            {
                string[] nestedFilenames = Directory.GetFiles(filename, "*.pdf", SearchOption.AllDirectories);
                await Analyze(nestedFilenames);
            }

            this._files++;
            if (new FileInfo(filename).Extension.ToLower() != ".pdf")
                continue;

            this._filesPdf++;
            int pages = Count(filename);
            this._pages += pages;
        }
        Dispatcher.Invoke(Update);
    });
}

private int Count (string filename)
{
    using (var reader = new PdfReader(filename))
    {
        int pages = reader.NumberOfPages;
        reader.Close();
        return pages;
    }
}

The Count method uses the iTextSharp library was used to read the PDF files. Since PDF files are internally indexed, the document does not need to be scanned (see PDF Syntax). Instead a PdfReader object is instantiated and its Number OfPages property read.

The system resources used are negligible.

Task Manager Performance

PDF Syntax

This could have been done quite easily without iTextSharp by creating a simple PDF parser; however this would have increased the time required to develop the application, which was about an hour, already being familiar with iTextSharp.

To accomplish this without iTextSharp we would read the PDF and follow the references.

This is a syntactically correct and complete PDF file.  To find the section, we first check the Trailer which specifies reference 1 as the Root. We can see that section 1 contains the Catalog, which points to reference 3 as the Pages section. Note how the Pages resource describes a single page, described in section 4.

%PDF-1. 4
1  0  obj
 <<  /Type /Catalog
  /Outlines  2 0 R
  /Pages  3 0 R
 >>
endobj
2  0  obj
 <<  /Type  Outlines
  /Count  0
 >>
endobj
3  0  obj
 <<  /Type  /Pages
  /Kids  [ 4 0 R ]
  /Count  1
 >>
endobj
4  0  obj
 <<  /Type  /Page
  /Parent  3 0 R
  /MediaBox  [ 0  0  612  792 ]
  /Contents  5 0 R
  /Resources  <<  /ProcSet  6 0 R  >>
 >>
endobj
5  0  obj
 <<  /Length  35  >>
 stream
 <-- Page-marking operators -->
 endstream
endobj
6  0  obj
 [ /PDF ]
endobj
xref
0  7
0000000000  65535  f
0000000009  00000  n
0000000074  00000  n
0000000120  00000  n
0000000179  00000  n
0000000300  00000  n
0000000384  00000  n
trailer
 <<  /Size  7
  /Root  1 0 R
 >>
startxref
408
%%EOF

Acknowledgements

iTextSharp is the work of Paulo Soares, Bruce Lowagie, et al.

PDF file syntax example is from PDF Reference, sixth edition. © 2006 Adobe®Systems Incorporated.

History

January 18 2014 Application written
October 24, 2014 Article written

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Engineer Robotic Assistance Devices / AITX
Canada Canada
Yvan Rodrigues has 30 years of experience in information systems and software development for the industry. He is Senior Concept Designer at Robotic Assistance Devices

He is a Certified Technician (C.Tech.), a professional designation granted by the Institute of Engineering Technology of Ontario (IETO).

Yvan draws on experience as owner of Red Cell Innovation Inc., Mabel's Labels Inc. as Manager of Systems and Development, the University of Waterloo as Information Systems Manager, and OTTO Motors as Senior Systems Engineer and Senior Concept Designer.

Yvan is currently focused on design of embedded systems.

Comments and Discussions

 
QuestionHow to start this Program? Pin
Member 1484713428-May-20 22:41
Member 1484713428-May-20 22:41 
SuggestionImprovement of the code Pin
wmjordan30-Oct-14 18:54
professionalwmjordan30-Oct-14 18:54 
The simple constructor of PdfReader will load quite a lot of data.
Since you just have to get the page number of a PDF file, you should use the partial mode.
C#
new PdfReader (new RandomAccessFileOrArray (sourceFile), password)


Another approach is to P/Invoke the MuPDF engine, which is lightweight and faster than iTextSharp when opening a PDF file.
It is also great for generating PDF thumbnails, which may be a useful feature if you are to write a PDF file manager.
I've written an article to demonstrate how to get the page number of a PDF file and convert pages into Bitmap files.
Rendering PDF Documents with Mupdf and P/Invoke in C#[^]

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.