Click here to Skip to main content
15,867,686 members
Articles / Desktop Programming / Windows Forms
Article

Converting Scanned Document Images to Searchable PDFs with OCR

14 Dec 2006CPOL5 min read 163.1K   4.6K   57   26
Demonstrates the use of Atalasoft's DotImage GlyphReader OCR to enable .NET applications to digitize paper documents as searchable PDFs that can be indexed by search engines.

This article is in the Product Showcase section for our sponsors at CodeProject. These articles are intended to provide you with information on products and services that we consider useful and of value to developers.

Image 1

Introduction

From health records, tax forms, and insurance claims, to old memos, magazines, and books; businesses are digitizing paper every day. With the advent of better search technology, having searchable text for all these documents is an obvious win. The common way to do this is to use OCR (Optical Character Recognition) to translate the images to a document format that indexers already know, but the drawback is that we often lose the layout, images and color of the original – plus, since no OCR is perfect, we need the original image to be able to fix mistakes. What we want is a document format that looks like the original images when humans look at it, but that looks like plain text when the indexer looks at it. And, when we copy from the image, we want text put on the clipboard. This is the promise of the searchable PDF.

In a searchable PDF, the original scanned image is retained so any human can read the document. The textual content that is extracted via OCR is put behind the image so search indexers can see it and Acrobat Reader will let us select it as text. The ubiquity of desktop and enterprise search, ever-increasing OCR accuracy, and mass adoption of PDF are a powerful combination that make searchable PDF's the ideal format to store digitized paper.

This article will demonstrate just how simple it is to develop an application that generates these searchable PDF's from scanned documents that can be indexed by Google, Sharepoint, Microsoft desktop search, and other applications that will index PDF documents.

To help build this application, Atalasoft publishes an OCR framework that simplifies working with industry leading OCR engines and our own highly accurate engine, GlyphReader. A free 30-day evaluation of the Atalasoft DotImage Document Imaging SDK, including the OCR module, GlyphReader, and all other add-ons can be downloaded from atalasoft.com.

Using our framework, these steps are handled for you:

  1. Decompress the image
  2. Pre-process the image to make OCR more accurate (including cleaning it or deskewing it)
  3. OCR the image to extract the text.
  4. Re-encode the image in a choice of formats, including CCIT Group 4, JBIG2, JPEG, or JPEG2000 for the absolute smallest file size possible.
  5. Construct a PDF with the image and the extracted text, with each word accurately positioned behind the appropriate place in the image.

Atalasoft's OCR framework includes a flexible Translator interface for producing output from the recognition process. For example, TextTranslator is available out of the box and generates a text stream. The Searchable PDF Module includes the PdfTranslator and is used to generate text only PDF's or Image with hidden text PDF's. Both are "searchable", but the latter includes the original image and is what we are going to use.

This article will use the following 2-page color TIFF as the source document to OCR. Shown here are the lower resolution images of the original scanned TIFF (a recent white paper from Atalasoft that was printed, and scanned in color).

Image 2

Image 3

Extracting the Text into a Text File

Let's start with a method that simply extracts the text into a file. First, we must create an ImageSource object to efficiently handle multi-page image files. Then we create the OCR engine, initialize it, translate it to the desired MIME type, and shutdown the engine.

C#
void MakeText(string inFile, string outFile)
{
    using (FileSystemImageSource fis = 
           new FileSystemImageSource(new string[1] { inFile }, true))
    {
        GlyphReaderEngine ocr = new GlyphReaderEngine();
        ocr.Initialize();
        ocr.Translate(fis, "text/plain", outFile);
        ocr.ShutDown();
    }
}

The resulting text file obviously does not look at all like the original document, but it does contain the text. It also isn't stored in the same file as the image. We can do better.

Creating the Searchable PDF

For the next code sample, we'll use a PdfTranslator to create a searchable PDF. To do this we need to:

  1. Create an instance of the PdfTranslator
  2. Set its OutputType to TextUnderImage (to create a searchable PDF)
  3. Add it to the OcrEngine's Translators collection (since it's an add-on, it doesn't come pre-registered)
  4. Use the engine to translate with the output MIME type set to "application/pdf"

Here's the code:

C#
void MakePdf(string inFile, string outFile)
{
    using (FileSystemImageSource fis = 
           new FileSystemImageSource(new string[1] { inFile }, true))
    {
        GlyphReaderEngine ocr = new GlyphReaderEngine();
        PdfTranslator pdfTrans = new PdfTranslator();
        pdfTrans.OutputType = PdfTranslatorOutputType.TextUnderImage;
        ocr.Translators.Add(pdfTrans);
        ocr.Initialize();
        ocr.Translate(fis, "application/pdf", outFile);
        ocr.ShutDown();
    }
}

The result is a high quality searchable PDF! When opening the PDF into Acrobat Reader (see screenshot below), all text in the document can be selected as real text, even though the visible part of this PDF is the actual color rasterized image.

The OCR Engine and PDF Translator handle all the details required to deskew the image, store it, produce accurate OCR, compress the image, accurately place the recognized text under the right part of the image, and generate the PDF document.

Simply having this file on your filesystem will cause Google Desktop Search, or Windows Desktop Search to index this document properly, with the document looking exactly like the original.

Image 4

Product Requirements

To add searchable PDF generation to your applications, you will need the following products from Atalasoft:

  • DotImage Document Imaging SDK
  • OCR GlyphReader Engine Module (runtimes are additional)
  • OCR Searchable PDF Module (includes 20 runtimes)

Everything is included in the DotImage SDK which you can download and evaluate free for 30 days. Be sure to request Evaluation Licenses for the required products. Attached to this article is the resulting PDF and C# 2.0 source code for a simple console application where the first argument is the input image file, and the second argument is the resulting searchable PDF file.

Archives

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Founder
United States United States
The founder and CEO of Atalasoft, provider of Document and Photo Imaging Toolkits for Microsoft .NET Developers and Document Imaging and Viewing for SharePoint

Comments and Discussions

 
Questionunable to write to a output file. Pin
Member 1084197824-May-14 20:03
Member 1084197824-May-14 20:03 
GeneralDotImage OCR Searchable PDF problem Pin
Member 383359013-Mar-08 12:19
Member 383359013-Mar-08 12:19 
QuestionDoes it have to be a scanned document? Pin
brian2520-Oct-07 13:44
brian2520-Oct-07 13:44 
AnswerRe: Does it have to be a scanned document? Pin
Bill Bither23-Oct-07 5:06
Bill Bither23-Oct-07 5:06 
QuestionLocate text in images? Pin
philip andrew31-Aug-07 16:21
philip andrew31-Aug-07 16:21 
AnswerRe: Locate text in images? Pin
Bill Bither17-Sep-07 5:49
Bill Bither17-Sep-07 5:49 
QuestionCould you please recomend a commercial Sofware ? Pin
micaro19-Apr-07 20:29
micaro19-Apr-07 20:29 
AnswerRe: Could you please recomend a commercial Sofware ? Pin
Bill Bither24-Apr-07 18:13
Bill Bither24-Apr-07 18:13 
Questioni have a question ? Pin
combina_230-Jan-07 8:51
combina_230-Jan-07 8:51 
AnswerRe: i have a question ? Pin
Bill Bither30-Jan-07 11:25
Bill Bither30-Jan-07 11:25 
GeneralRe: i have a question ? Pin
combina_231-Jan-07 9:26
combina_231-Jan-07 9:26 
GeneralRe: i have a question ? Pin
Bill Bither31-Jan-07 11:39
Bill Bither31-Jan-07 11:39 
GeneralRe: i have a question ? Pin
combina_25-Feb-07 7:39
combina_25-Feb-07 7:39 
GeneralRe: i have a question ? Pin
Bill Bither5-Feb-07 7:45
Bill Bither5-Feb-07 7:45 
GeneralRe: i have a question ? Pin
combina_25-Feb-07 8:00
combina_25-Feb-07 8:00 
QuestionDoes it support Chinese like charset? Pin
fengjinzhi15-Jan-07 13:59
fengjinzhi15-Jan-07 13:59 
AnswerRe: Does it support Chinese like charset? Pin
Bill Bither17-Jan-07 4:17
Bill Bither17-Jan-07 4:17 
GeneralRe: Does it support Chinese like charset? Pin
combina_230-Jan-07 9:26
combina_230-Jan-07 9:26 
GeneralRe: Does it support Chinese like charset? Pin
Bill Bither30-Jan-07 11:17
Bill Bither30-Jan-07 11:17 
DotImage Document Imaging is a document imaging framework for .NET. The SDK supports core functionality including most codecs such as TIFF support, image processing commands such as deskew and despeckle, visual controls for Windows Forms and ASP.NET WebForms, support for manipulating metadata, support for annotations, support for TWAIN scanning, printing, and some others.

We also sell add-on modules to the core DotImage Document Imaging for features such as PDF Rasterization, Barcode Reading, and OCR. We offer the GlyphReader OCR engine, which is an add-on to DotImage, and we've also partnered with industry leading OCR vendors ExperVision and Abbyy. Since you need support for Chinese characters in your OCR, I can recommend the following products:

DotImage Document Imaging SDK
DotImage Expervision OCR SDK w/ Asian Character Support

or:

DotImage Document Imaging SDK
DotImage OCR Abbyy Engine SDK

You can deploy this to a server for an ASP.NET application, in which case you'll also need to purchase a production server license.

More information on Atalasoft's OCR is available on our website at http://www.atalasoft.com/products/dotimage/ocr/

Please let me know if you have any more questions.

Bill Bither
Atalasoft, Inc.
http://www.atalasoft.com/

GeneralAn idea Pin
Hamed Musavi21-Dec-06 22:59
Hamed Musavi21-Dec-06 22:59 
GeneralRe: An idea Pin
AFSEKI16-Apr-07 19:59
AFSEKI16-Apr-07 19:59 
GeneralRe: An idea Pin
Hamed Musavi17-Apr-07 19:52
Hamed Musavi17-Apr-07 19:52 
GeneralAtalaSoft Recommendation Pin
Bryant | DocEdge13-Dec-06 15:26
Bryant | DocEdge13-Dec-06 15:26 
QuestionHow about a working demo app Pin
Jeff Circeo13-Dec-06 5:05
Jeff Circeo13-Dec-06 5:05 
AnswerRe: How about a working demo app Pin
Bill Bither13-Dec-06 5:41
Bill Bither13-Dec-06 5:41 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.