Click here to Skip to main content
15,885,985 members
Articles / Programming Languages / C#

OCR image-only PDF files.

Rate me:
Please Sign up or sign in to vote.
5.00/5 (2 votes)
3 Aug 2020CPOL4 min read 12.8K   509   8   25
A class library and command line utility to add OCR information to image-only PDF files
A class and utility to add OCR information to image-only PDF files.

Introduction

Back in 2009 I wrote an article how to index image-only PDF files. The task at that point was just to extract text from image-only PDF files in the iFilter so the files could be indexed in WIndows environment. This time around the task was somewhat different - to add text information to the PDF files. The benefit of this approach is that the files can be indexed by standard Adobe/Microsoft iFilters but also the text can be selected using visual tools (Adobe Reader, Chome, Edge, ...)

Background

I needed to add OCR information to thousands of PDF files (stored actually in the SQL server). I wanted to create a script/utility that could be executed daily to index any new PDF files that don't already contain searchable text.  After searching for a while I found a recipe:

  1. Use ghostscript to extract individual pages from PDF to image (JPG) files
  2. Use Tesseract to extract OCR from images
  3. Store extracted text back to PDF

Take 1

Apparently since I touched Tesseract last time in 2009, they added a new feature: the image and the OCR text will be exported as PDF file. I think you need Tesseract version 4+. So the solution seems pretty simple and following batch file emerged within next 20 min (see ocr.bat in the attached project):

C++
set gs="C:\Program Files\gs\gs9.52\bin\gswin64c.exe"
set tesseract="C:\Program Files\Tesseract-OCR\tesseract.exe"

if '%1'=='' goto :badParams
if '%2'=='' goto :badParams
mkdir %temp%\ocr\
set nm=%~n1
SETLOCAL ENABLEDELAYEDEXPANSION 

rem split pdf into multiple jpeg
%gs% -dSAFER -dBATCH -dNOPAUSE -sDEVICE=jpeg -r300 -dTextAlphaBits=4 -o "%temp%\ocr\ocr_%nm%_%%04d.jpg" -f "%1"

rem ocr each jpeg
for %%i in (%temp%\ocr\ocr_%nm%_*.jpg) do %tesseract% -l eng "%%i" %%~pni pdf
del %temp%\ocr\ocr_%nm%_*.jpg

rem combine pdfs
set ff=#
for %%i in (%temp%\ocr\ocr_%nm%_*.pdf) do set ff=!ff! %%i
set ff=%ff:#=%
%gs% -dNOPAUSE -dQUIET -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -o "%2" %ff%
del %temp%\ocr\ocr_%nm%_*.pdf

goto :eof

:badParams
echo usage %0 pdf-In pdf-Out

As you can see the script does following

  1. Uses GhostScript to extract individual pages into the %temp%\ocr\####.jpg files
  2. For each of the JPG file run Tesseract to create a %temp%\ocr\####.pdf file
  3. Use ghost script to combine all PDF files into the output file.

Take 1 Problems

Very quickly the problems with solution 1 arose for large PDF files (10+ pages) 

  1. It was pretty slow. But I figured, since it's running as a background process only for a new files, may be I can live with this.
  2. The output file was much, much, much larger then source - like 4 times. Times thousands of files - became a show stopper.

Take 2 - HOCR2PDF

Other people have the same problem. Enter HOCR2PDF (https://archive.codeplex.com/?p=hocrtopdf). Apparently Tesseract, aside from outputting OCR as text or PDF, can also output results as HOCR files - effectively encoded HTML files. So the task changes slightly

  1. Use GhostScript to split PDF files into multiple JPGs
  2. Use tesseract to convert JPGs into HOCR files
  3. Parse HOCR files
  4. Use PDF library (iTextShart) to add text information to the output PDF

Take 2 - Problems

Well, HOCR2PDF had similar problems as my original script. It was still slow and files were still pretty large, even though about half the size of solution #1. 

Take 3 - Final

So I went ahead and created my own project.

After some troubleshooting, and performance improvements such as enabling compression, using single font for a whole page, I found out that the size bloat boils down to a single iTextShart function call

stamp.GetImportedPage(stamp.Reader, pg)

That call alone seems to add about 30K per page. And the only reason it's needed is to get page height. After replacing this call with

stamp.Reader.GetPageSizeWithRotation(pg).Height

the size bloat went away, and the output file to my surprise became actually smaller then the source (probably due to enabling compression and removing unsused object).

To address performance problem I decided to run Tesseract for each page concurrently, using ThreadPool. For large (10+ pages) file, the performance boost was drastic. 

 

Using the code

The project contains a class PdfOcr with one public method OcrFile. The usage as below:

string txt = new PdfOcr().OcrFile(fileIn, fileOut);

This code will OCR the fileIn pdf file, create fileOut and return OCR text. The class might need to be customised by changing/assigning to following static variables:

  1.         GhostScript - location of GhostScript executable
  2.         Tesseract - location of Tesseract executable.
  3.         wdiTemp - folder where temp files will be generated
  4.         tmpPrfx - prefix for all the temp files

The class is stored in Program.cs file along with the main program that takes 2 arguments - source and destination file names.                       

                

Points of Interest

  1. https://github.com/UB-Mannheim/tesseract/wiki - TesserAct windows binary 
  2. https://www.ghostscript.com/download/gsdnld.html - GhostScript binary download.
  3. https://archive.codeplex.com/?p=hocrtopdf - HOCR2PDF .NET utility

History

  • initial release
  • 8/17/2023 - Made text transparent. Updated Zip file

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
http://www.GaspMobileGames.com
United States United States
Writing code since 1987 using whatever language/environment you can imagine. Recently got into the mobile games. Feel free to check them out at http://www.GaspMobileGames.com

Comments and Discussions

 
QuestionCopy paste not working Pin
Member13280111-Oct-23 5:13
Member13280111-Oct-23 5:13 
AnswerRe: Copy paste not working Pin
gstolarov11-Oct-23 5:42
gstolarov11-Oct-23 5:42 
GeneralRe: Copy paste not working Pin
Member13280112-Oct-23 3:52
Member13280112-Oct-23 3:52 
GeneralRe: Copy paste not working Pin
gstolarov12-Oct-23 11:06
gstolarov12-Oct-23 11:06 
GeneralRe: Copy paste not working Pin
Member13280113-Oct-23 7:47
Member13280113-Oct-23 7:47 
GeneralRe: Copy paste not working Pin
gstolarov13-Oct-23 10:08
gstolarov13-Oct-23 10:08 
QuestionImage Not Converted Pin
Member13280117-Aug-23 6:05
Member13280117-Aug-23 6:05 
AnswerRe: Image Not Converted Pin
gstolarov17-Aug-23 6:20
gstolarov17-Aug-23 6:20 
GeneralRe: Image Not Converted Pin
Member13280117-Aug-23 7:11
Member13280117-Aug-23 7:11 
GeneralRe: Image Not Converted Pin
gstolarov17-Aug-23 8:15
gstolarov17-Aug-23 8:15 
GeneralRe: Image Not Converted Pin
Member13280117-Aug-23 8:25
Member13280117-Aug-23 8:25 
GeneralRe: Image Not Converted Pin
gstolarov17-Aug-23 8:30
gstolarov17-Aug-23 8:30 
GeneralRe: Image Not Converted Pin
Member13280117-Aug-23 8:34
Member13280117-Aug-23 8:34 
GeneralRe: Image Not Converted Pin
Member13280118-Aug-23 8:04
Member13280118-Aug-23 8:04 
GeneralRe: Image Not Converted Pin
gstolarov18-Aug-23 10:39
gstolarov18-Aug-23 10:39 
QuestionBlurred Pdf Pin
Member13280117-Aug-23 2:20
Member13280117-Aug-23 2:20 
AnswerRe: Blurred Pdf Pin
gstolarov17-Aug-23 3:39
gstolarov17-Aug-23 3:39 
PraiseRe: Blurred Pdf Pin
Member13280117-Aug-23 4:50
Member13280117-Aug-23 4:50 
QuestionNot found Pin
Member 1497497025-Oct-20 20:23
Member 1497497025-Oct-20 20:23 
Questionpdfium Pin
blxstar5-Aug-20 22:44
blxstar5-Aug-20 22:44 
AnswerRe: pdfium Pin
Nelek5-Aug-20 22:45
protectorNelek5-Aug-20 22:45 
AnswerRe: pdfium Pin
OriginalGriff5-Aug-20 22:53
mveOriginalGriff5-Aug-20 22:53 
GeneralRe: pdfium Pin
blxstar5-Aug-20 23:11
blxstar5-Aug-20 23:11 
AnswerRe: pdfium Pin
gstolarov11-Aug-20 4:18
gstolarov11-Aug-20 4:18 
GeneralRe: pdfium Pin
blxstar12-Aug-20 1:07
blxstar12-Aug-20 1:07 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.