My company Red Cell Innovation Inc. provides a document scanning service. Often we require a page count of a collection of PDF files for the purpose of billing, quality control, scheduling, and estimating.
This is an application that I quickly whipped up to facilitate this. The application uses a procedural style to accomplish this in about 200 lines of code including XAML and comments in a single simple codebehind class.
- Simple: Drop a directory in the application.
- Fast: Scanned 20GB of PDFs and counted 53877 pages in 499 files in 7 seconds on an SSD (270 seconds on a network drive).
How It Works
|Language ||C# 5.0 |
|.NET Framework ||4.5 |
|UI Framework ||WPF |
|Libraries ||iTextSharp |
|Pattern ||Codebehind procedural |
When the application starts, the user is prompted to drop files and/or folders into the application's window.
When files or folders are dropped, the
Start method is invoked, changing the visibility of UI elements to the count screen.
async Analyze method is invoked to create a new thread that traverses the filesystem recursively. A new thread is requested from the thread pool for each directory to be enumerated and its files counted.
private async Task Analyze (IEnumerable<string> filenames)
await Task.Run(async () =>
foreach (string filename in filenames)
string nestedFilenames = Directory.GetFiles(filename, "*.pdf", SearchOption.AllDirectories);
if (new FileInfo(filename).Extension.ToLower() != ".pdf")
int pages = Count(filename);
this._pages += pages;
private int Count (string filename)
using (var reader = new PdfReader(filename))
int pages = reader.NumberOfPages;
Count method uses the iTextSharp library was used to read the PDF files. Since PDF files are internally indexed, the document does not need to be scanned (see PDF Syntax). Instead a
PdfReader object is instantiated and its
Number OfPages property read.
The system resources used are negligible.
This could have been done quite easily without iTextSharp by creating a simple PDF parser; however this would have increased the time required to develop the application, which was about an hour, already being familiar with iTextSharp.
To accomplish this without iTextSharp we would read the PDF and follow the references.
This is a syntactically correct and complete PDF file. To find the section, we first check the
Trailer which specifies reference
1 as the
Root. We can see that section
1 contains the
Catalog, which points to reference
3 as the
Pages section. Note how the
Pages resource describes a single page, described in section
1 0 obj
<< /Type /Catalog
/Outlines 2 0 R
/Pages 3 0 R
2 0 obj
<< /Type Outlines
3 0 obj
<< /Type /Pages
/Kids [ 4 0 R ]
4 0 obj
<< /Type /Page
/Parent 3 0 R
/MediaBox [ 0 0 612 792 ]
/Contents 5 0 R
/Resources << /ProcSet 6 0 R >>
5 0 obj
<< /Length 35 >>
<-- Page-marking operators -->
6 0 obj
[ /PDF ]
0000000000 65535 f
0000000009 00000 n
0000000074 00000 n
0000000120 00000 n
0000000179 00000 n
0000000300 00000 n
0000000384 00000 n
<< /Size 7
/Root 1 0 R
iTextSharp is the work of Paulo Soares, Bruce Lowagie, et al.
PDF file syntax example is from PDF Reference, sixth edition. © 2006 Adobe®Systems Incorporated.
|January 18 2014 ||Application written |
|October 24, 2014 ||Article written |