Click here to Skip to main content
15,885,216 members
Articles / Programming Languages / C#
Article

OCR with Microsoft® Office

Rate me:
Please Sign up or sign in to vote.
4.93/5 (79 votes)
26 Oct 2007GPL35 min read 2M   34.7K   369   283
Coming with Microsoft Office 2003, the MODI library offers you an easy but effective way to integrate Optical Character Recognition (OCR) functionality into your own applications.

Image 1

Introduction

Optical Character Recognition (OCR) extracts text and layout information from document images. With the help of Microsoft Office Document Imaging Library (MODI), which is contained in the Office 2003 package, you can easily integrate OCR functionality into your own applications. In combination with the MODI Document Viewer control, you will have complete OCR support with only a few lines of code.

Important note: MS Office XP does not contain MODI, MS Office 2003 is required!

Getting Started

Adding the Library

First of all, you need to add the library's reference to your project: Microsoft Office Document Imaging 11.0 Type Library (located in MDIVWCTL.DLL).

Create a Document Instance and Assign an Image File

Supported image formats are TIFF, multi-page TIFF, and BMP.

C#
_MODIDocument = new MODI.Document(); 
_MODIDocument.Create(filename);

Call the OCR Method

The OCR process is started by the MODIDocument.OCR method.

C#
// The MODI call for OCR 
_MODIDocument.OCR(_MODIParameters.Language, 
                  _MODIParameters.WithAutoRotation, 
                  _MODIParameters.WithStraightenImage);

With the Document.OCR call, all the contained pages of the document are processed. You can also call the OCR method for each page separately, by calling the MODIImage.OCR method in the very same way. As you can see, the OCR method has three parameters:

  • Language
  • AutoRotation
  • StraightenImages

The use of these parameters depend on your specific imaging scenario.

Screenshot - modiSettings.JPG

Tracking the OCR Progress

Since the whole recognition process can take a few seconds, you may want to keep an eye on the progress. Therefore, the OnOCRProgress event can be used.

C#
// add event handler for progress visualisation
_MODIDocument.OnOCRProgress += 
  new MODI._IDocumentEvents_OnOCRProgressEventHandler(this.ShowProgress);
public void ShowProgress(int progress, ref bool cancel)
{
    statusBar1.Text = progress.ToString() + "% processed.";
}

The Document Viewer

Together with the MODI document model comes the MODI viewer component AxMODI.AxMiDocView. The viewer is contained in the same library as the document model (MDIVWCTL.DLL). With a single statement, you can assign the document to the viewer. The viewer offers you many operations like selection, pan etc..

C#
axMiDocView1.Document = _MODIDocument;

To make the component available in Visual Studio, just go to the Toolbox Explorer, open the context menu, select Add/Delete Elements.., and choose the COM Controls tab. Then, search for Microsoft Office Document Imaging Viewer 11.0, and enable it.

Processing the Recognition Result

Working on the result structure is pretty straightforward. If you just want to use the full text, you simply need the image's Layout.Text property. As an example for further processing, here is a little statistic method:

C#
private void Statistic()
{    
    // iterating through the document's structure doing some statistics.
    string statistic = "";
    for (int i = 0 ; i < _MODIDocument.Images.Count; i++)
    {
        int numOfCharacters = 0;
        int charactersHeights = 0;
        MODI.Image image = (MODI.Image)_MODIDocument.Images[i];
        MODI.Layout layout = image.Layout;
        // getting the page's words
        for (int j= 0; j< layout.Words.Count; j++)
        {
            MODI.Word word = (MODI.Word) layout.Words[j];
            // getting the word's characters
            for (int k = 0; k < word.Rects.Count; k++)
            {
                MODI.MiRect rect = (MODI.MiRect) word.Rects[k];
                charactersHeights  += rect.Bottom-rect.Top;
                numOfCharacters++;                        
            }
        }
        float avHeight = (float )charactersHeights/numOfCharacters;
        statistic += "Page "+i+ ": Avarage character height is: "+
                         "avHeight.ToString("0.00") +" pixel!"+ "\r\n";
    }
    MessageBox.Show("Document Statistic:\r\n"+statistic);
}

Searching

MODI also offers a full featured built-in search. Since a document may contain several pages, you can use the search method to browse through the pages.

Screenshot - modiSearch.JPG

MODI offers several arguments to customize your search.

C#
// convert our search dialog properties to corresponding MODI arguments
object PageNum = _DialogSearch.Properties.PageNum;
object WordIndex = _DialogSearch.Properties.WordIndex;
object StartAfterIndex = _DialogSearch.Properties.StartAfterIndex;
object Backward = _DialogSearch.Properties.Backward;
bool MatchMinus = _DialogSearch.Properties.MatchMinus;
bool MatchFullHalfWidthForm = _DialogSearch.Properties.MatchFullHalfWidthForm;
bool MatchHiraganaKatakana = _DialogSearch.Properties.MatchHiraganaKatakana;
bool IgnoreSpace =_DialogSearch.Properties.IgnoreSpace;

To use the search function, you need to create an instance of the type MiDocSearchClass, where all search arguments take place:

C#
// initialize MODI search
MODI.MiDocSearchClass search = new MODI.MiDocSearchClass();
search.Initialize(
    _MODIDocument,
    _DialogSearch.Properties.Pattern,
    ref PageNum,
    ref WordIndex,
    ref StartAfterIndex,
    ref Backward,
    MatchMinus,
    MatchFullHalfWidthForm,
    MatchHiraganaKatakana,
    IgnoreSpace);

After the initialization call of the search instance, the process call itself is simple:

C#
MODI.IMiSelectableItem SelectableItem = null;
// the one and only search call
search.Search(null,ref SelectableItem);

You will find the search results in the referenced SelectableItem argument. The MODI search has impressive features, and works very well. Sure, it is restricted to search for plain text. In most real world applications, you will need some kind of fuzzy searching since your text results may be corrupted by single OCR errors. But for a few lines of integration code, it is an impressive functionality.

MODI, Office 2007 and Vista

Good news: Office 2007 and Vista, both support MODI! It's not installed by default, but you can easily add the package via installing options of your Office 2007. You just need to rerun the setup.exe (of your Office installation) again and choose the package as in the screenshot below.

Screenshot - modi_vista.jpg

About Document Processing

OCR is only one step in document processing. To get a more qualified access to your paper based document information, usually a couple steps and techniques are required:

Scanning

Before documents are available as images, they have to be digitalized. This process is called 'scanning.' There are two important standards used for interacting with the scanning hardware: TWAIN and WIA. There are (at least) two good articles in CodeProject on how to use these APIs.

Image Processing

Although the scanning devices are getting better, a couple of methods can be used to increase the image quality. These pre-processing functions include noise reduction and angle correction, for instance.

OCR Itself

As a next step, OCR itself interprets pixel-based images to layout and text elements. OCR can be called the 'highest' bottom up technology, where the system has no or only little knowledge about the business context. Recognizing hand written documents is often called ICR (intelligent Character Recognition).

Document Classification

In most business cases, you have certain target structures you want to fill with the document information. That is called 'Document Classification and Detail Extraction.' For instance, you might want to process invoices, or you have certain table structures to fill. In Document Processing Part II, you can see how this kind of content knowledge can be used.

Beyond

After that, you might have an address database you want to match the document addresses with. Due to 'noisy' environments or disordered information, you need more sophisticated techniques than simple SQL. In the last step, the extracted information is given to the client application (like an ERP backbone) where customized workflow activities are triggered. The sector creates new names for that every couple of months: ECM (Enterprise Content Management), DMS (Document Management System), IDP (Intelligent Document Processing), (DLC) Document Life Cycle.

References

Versions

  • 3 Apr 2007: Added Vista hints
  • 29 Sep 2006: Added search functions
  • 31 May 2005: Added references
  • 15 Apr 2005: Initial version

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)


Written By
CEO Axonic Informationssysteme GmbH, Germany
Germany Germany

Comments and Discussions

 
AnswerRe: How do i get all the OCR reading to a txt file, without marking the picture? Pin
amgadhs13-Aug-08 2:59
amgadhs13-Aug-08 2:59 
GeneralNamespace name 'AxMODI' could not be found Pin
KBM7325-May-08 19:32
KBM7325-May-08 19:32 
GeneralRe: Namespace name 'AxMODI' could not be found Pin
weiyu198221-Sep-12 16:58
weiyu198221-Sep-12 16:58 
GeneralNamespace name 'AxMODI' could not be found Pin
KBM7325-May-08 19:26
KBM7325-May-08 19:26 
AnswerRe: Namespace name 'AxMODI' could not be found Pin
Peter O'Neill3-Jul-08 13:35
Peter O'Neill3-Jul-08 13:35 
GeneralRe: Namespace name 'AxMODI' could not be found Pin
Michał Białecki6-Nov-10 7:39
Michał Białecki6-Nov-10 7:39 
GeneralRe: Namespace name 'AxMODI' could not be found Pin
qingxiang wang11-Jun-11 4:10
qingxiang wang11-Jun-11 4:10 
GeneralRe: Namespace name 'AxMODI' could not be found Pin
sigfried42-Feb-11 13:23
sigfried42-Feb-11 13:23 
This happen when you don't have office 2003, thus the COM dll can't be found, but if you have oficce 2007 or greater then the only thing you have to do is update your references to this new version of MODI (Oficce 2003 is version 11.0 and Office 2007 is version 12.0, guess office 2010 would be version 13.0 but im not sure). To acomplish the update you can do one of two process:

Process A: (Visual Studio IDE is not necessary)
1.-Go to the folder where the solution is located(the folder where the file TableExtractor.csproj is).
2.-Open the file TableExtractor.csproj with notepad.exe or another text editor.
3.-Look for the entry with the word AxMODI (this entry is different in each version of visual studio, in vs2010 is something like this:
<Project ...>
...
<ItemGroup>
...
<COMReference Include="AxMODI">
<Guid>{A5EDEDF4-2BBC-45F3-822B-E60C278A1A79}</Guid>
<VersionMajor>11</VersionMajor>
<VersionMinor>0</VersionMinor>
<Lcid>0</Lcid>
<WrapperTool>aximp</WrapperTool>
<Isolated>False</Isolated>
</COMReference>
...

4.-In this node look for the "VersionMajor" property which have to have the value "11", here is where we need to change the version to the new version, so change the value 11 for the value 12 or the version value your dll is.
5.-Repeat steps 3 and 4 for the entry MODI and save the file. That's all.

Process B:
1.-Open your solution (TableExtractor.sln file) with visual studio
2.-Open Solution Explorer window (View->Solution Explorer or Ctrl+Alt+L) go to references and delete both MODI and AxMODI references.
3.-Add MODI to references. In Solution Explorer window click mouse right button in element references then click Add Reference, then in Add Reference Window select COM tab, and select "Microsoft Office Document Imaging <Version> Type Library" and click Accept. Replace <Version> with the number that appear in the element selected. With this now MODI is added in References
4.-Add AxMODI to references. This step is tricky, so IMHO the best way to accomplish this is:
4.1.-Open the toolbox window
4.2.-Click mouse right button then "Add Tab" and name it MODI.
4.3.-Click mouse right button then "Choose Items ..."
4.4.-In COM Components tab, search for "Microsoft Office Document Imaging Viewer Control <Version>" and mark the checkbox. <version> refers to the version to which you are updating. Here we now have the item added to the MODI tab created previously.
4.5.-Now to add the reference to the project, is solution explorer click mouse right button in TableExplorer element then "Add->Windows Form...", in the "Add new element" window, name it "deleteme.cs" and click add button.
4.6.-In solution explorer double click in the "deleteme.cs" windows form to open it in design viewer
4.7.-Drag and drop from the toolbox the element "Microsoft Office Document Imaging Viewer Control <version>" to the windows form. With this the reference AxMODI has been added to the project.
4.8.-Now just delete the "deleteme.cs" windows form we created before.
Generaluse MDIVWCTL.DLL in NET CF 2.0 Pin
aterzieva13-May-08 12:26
aterzieva13-May-08 12:26 
QuestionInstall only MODI 2003? Pin
JayashreeG10-Apr-08 0:03
JayashreeG10-Apr-08 0:03 
GeneralMODI on Windows CE Pin
Dreame9-Mar-08 20:50
Dreame9-Mar-08 20:50 
GeneralMODI on Windows Mobile Pin
Dreame9-Mar-08 20:48
Dreame9-Mar-08 20:48 
QuestionWhere is the text stored ? Pin
sten20055-Mar-08 23:03
sten20055-Mar-08 23:03 
AnswerRe: Where is the text stored ? Pin
Spike0xFF14-Mar-08 22:17
Spike0xFF14-Mar-08 22:17 
GeneralNavigate through the list of Images Pin
Munishprathap17-Dec-07 20:17
Munishprathap17-Dec-07 20:17 
GeneralMulti Language Support Pin
Prathapachandran24-Sep-07 22:36
professionalPrathapachandran24-Sep-07 22:36 
QuestionAccess is denied Pin
mallikarjun swamy6-Sep-07 4:08
mallikarjun swamy6-Sep-07 4:08 
QuestionOther supported languages... Pin
Daniel Angelovski5-Aug-07 23:04
Daniel Angelovski5-Aug-07 23:04 
QuestionCan MODI get the accurate font of the word Pin
GeminiSaka16-Jun-07 23:30
GeminiSaka16-Jun-07 23:30 
Questionrecognize an image containing one word Pin
zxw7114-Jun-07 21:20
zxw7114-Jun-07 21:20 
QuestionIncreased recognition rates in MODI 12.0? Pin
Charlton Kao13-Jun-07 14:06
Charlton Kao13-Jun-07 14:06 
AnswerRe: Increased recognition rates in MODI 12.0? Pin
Spike0xFF14-Mar-08 22:22
Spike0xFF14-Mar-08 22:22 
Question'Ocr running error' Windows Vista [modified] Pin
kosovan21-May-07 22:02
kosovan21-May-07 22:02 
AnswerRe: 'Ocr running error' Windows Vista Pin
cambelr5-Oct-07 12:39
cambelr5-Oct-07 12:39 
GeneralRe: 'Ocr running error' Windows Vista Pin
VanOrman18-Oct-07 9:23
VanOrman18-Oct-07 9:23 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.