Using DotImage to Scan Documents into the Cloud

Lou Franco

2.09/5 (7 votes)

Nov 13, 2008

CPOL

5 min read

40757

Got a stack of paper, a scanner, and an account on Scribd? Then this project is for you. With DotImage, Visual Studio, and a little bit of code, you can quickly and easily write a Scan-to-Scribd desktop application to put those documents online. Free sample code and step-by-step instructions!

Introduction

Got a stack of paper, a scanner, and an account on Scribd? Want to share those documents on your blog or with your co-workers? Then this project is for you.

With DotImage, Visual Studio, and a little bit of code, you can write a Scan-to-Scribd desktop application to put those documents online. Best of all, the upload code can be easily modified to handle any remote document repository that supports uploading documents as a service.

The first step is to build a basic scanning application with DotImage. If you would like to see videos detailing the exact steps to doing that, go to these links (total lesson series time = less than 23 minutes):

Video Tutorial: Lesson 1 - Basic Structure of a Capture Application
Video Tutorial: Lesson 2 – Implementing Save and AutoZoom
Video Tutorial: Lesson 3 – Getting a List of Devices and Scanning
Video Tutorial: Lesson 4 – Basic Cleanup of Scanned Documents

The basic steps are simple:

Create a WinForms application (get a free evaluation download of DotImage).
Drag a DotImage DocumentViewer to the form (set its Dock to fill).
Drag a ToolStrip to the form.
Add a ToolStripComboBox (to hold the list of scanners) to the ToolStrip, and name it tscbScanners.
Add a ToolStripButton (to initiate scanning) to the ToolStrip, and name it tsbScan.
Drag a DotImage Acquisition object to the Form (this is how you scan).

On Form Load, we want to fill in the list of installed scanners. To do so, call this function:

private void InitializeScannerList()
{
    tsbScan.Enabled = false;
    tscbScanners.Enabled = false;
    if (acquisition1.SystemHasTwain)
    {
        // Loop through each scanner, adding to list
        foreach (Device d in acquisition1.Devices)
        {
            string devName = d.Identity.ProductName;
            tscbScanners.Items.Add(devName);
            // Make sure the default one is selected
            if (d == acquisition1.Devices.Default)
            {
                tscbScanners.SelectedItem = devName;
            }
        }
        // If we have scanners, enable the scanning controls
        if (tscbScanners.Items.Count > 0) {
            tsbScan.Enabled = true;
            tscbScanners.Enabled = true;
        }
    }
}

When the scan button is pressed, we can scan documents with this code (in the tsbScan Click event handler)

private void tsbScan_Click(object sender, EventArgs e)
{
    // If a scanner is selected, use it to scan
    Device selectedDevice = GetSelectedDevice();
    if (selectedDevice != null)
    {
        selectedDevice.Acquire();
    }
}

private Device GetSelectedDevice()
{
    // Look for the selected scanner and return it
    foreach (Device d in acquisition1.Devices)
    {
        if (tscbScanners.SelectedItem.ToString() ==
            d.Identity.ProductName)
        {
            return d;
        }
    }
    return null;
}

Every time an image is scanned, the Acquisition object’s ImageAcquired event will fire. Add a handler with this code to add the image to the document viewer:

// This function is called for each page. Add the page
// to the document viewer
private void acquisition1_ImageAcquired(object sender,
                                        AcquireEventArgs e)
{
    documentViewer1.Add(AtalaImage.FromBitmap(e.Image), "", "");
}

Once the document is scanned, we can upload to any service that accepts document uploads. Scribd (www.scribd.com) is a free document sharing website that has a web-service interface for uploading documents. There is an excellent open-source .NET library called Scribd.NET that makes interacting with the service relatively simple. You can get it here: http://www.codeplex.com/scribdnet.

Here is what you need to do to upload a Document to Scribd using Scribd.NET.

Add a ToolStripProgressBar to your form (either in your ToolStrip or in a StatusStrip). Name it tspbUploadProgress.

Initialize the library with your API Key (get an API Key here: http://www.scribd.com/platform/account):

// Initialize and login
private void InitializeScribd()
{
    // replace with yours
    Scribd.Net.Service.APIKey = _apiKey;
    Scribd.Net.Service.SecretKey = _secretKey;
    Scribd.Net.Service.EnforceSigning = true;
}

Log the user in:

void LoginUser(string user, string password)
{
    // Subscribe to events
    User.LoggedIn += _loggedInHandler;
    User.LoginFailed += _loggedInHandler;

    // Sign into the service
    User.Login(user, password);
}

Handle the login events (called asynchronously)

// This method is called on login.
void User_LoggedIn(object sender, UserEventArgs e)
{
    User.LoggedIn -= _loggedInHandler;
    User.LoginFailed -= _loggedInHandler;
    if (e.Success)
    {
        _scribdInitialized = true;

    }
    else
    {
        _scribdInitialized = false;
    }
}

Declare _loggedInHandler and _scribdInitialized :

private EventHandler<UserEventArgs> _loggedInHandler;
static private bool _scribdInitialized = false;

And initialize _loggedInHandler like this:

_loggedInHandler
= new EventHandler<UserEventArgs>(User_LoggedIn);

Before you upload, you need to handle events that Scribd.NET raises once the file is uploaded and saved:

private void InitializeScribdEventHandlers()
{
    Document.Uploaded +=
        new EventHandler<DocumentEventArgs>(Document_Uploaded);
    Document.Saved +=
        new EventHandler<DocumentEventArgs>(Document_Saved);
    Document.UploadProgressChanged +=
        new EventHandler<System.Net.UploadProgressChangedEventArgs>
        (Document_UploadProgressChanged);
    Service.Error += new EventHandler<ScribdEventArgs>(Service_Error);
}

Scribd.NET uploads files by name, so we save it first – here’s how you use DotImage to save the file into a temporary TIFF.

// save the document as a tiff in a temporary location
// so that we can pass a path to the Scribd API
private string SaveDocumentAsTempTif()
{
    string tempName = Path.GetTempFileName() + ".tif";
    documentViewer1.Save(tempName, new TiffEncoder());
    return tempName;
}

Here’s how you upload (AccessTypes is a Scribd.NET type that you can use to specify if the document is public or private):

    private void UploadFileToScribd(string filename, AccessTypes accessType)
    {
        Scribd.Net.Document.UploadAsync(filename, accessType);
    }

And, here’s how you handle the events:

// Called by Scribd API to report an error
void Service_Error(object sender, ScribdEventArgs e)
{
    MessageBox.Show(this, "Scribd Error: " + e.Message,
        "Error", MessageBoxButtons.OK, MessageBoxIcon.Error);
}
// Called by Scribd API to show progress
void Document_UploadProgressChanged(object sender,
                                    System.Net.UploadProgressChangedEventArgs e)
{
    tspbUploadProgress.Value = e.ProgressPercentage;
}

// Called by Scribd API when the document is uploaded
void Document_Uploaded(object sender, DocumentEventArgs e)
{
    if (e.Document != null)
    {
        // _title is a String that you set up before uploading
        e.Document.Title = _title;
        e.Document.Save();
    }
}
// Called by Scribd API when the document is saved
void Document_Saved(object sender, DocumentEventArgs e)
{
    // Here the document is uploaded and saved, so
    // you can update your UI to reflect this
}

So you see how easy it is to scan and upload documents. The same basic structure could be used to upload documents into Amazon’s S3, the new Microsoft Azure SQL Data Services, Google Docs, any ECM that supports the emerging CMIS standard for documents, SharePoint, etc.

To get the full code and a build of this project go to Atalasoft’s Scan Documents to Scribd Project page.

BONUS: Converting to a searchable PDF before uploading

TIFFs are fine for scans, but they will not be indexed by Scribd, so you will not be able to search for the documents later. We can OCR the document and then create a PDF with the original page image on top and the text that it represents beneath. That way, we get a document that looks like a scan, but can be found by indexers. This is called a searchable PDF and they are easy to create with DotImage.

Since OCR is a time consuming process, it’s best to do it in the background with a BackgroundWorker object. Here is how you do it:

Add a BackgroundWorker to the form and name it saveAsSearchablePdfBackground. Set its WorkerReportsProgress property to true.
Go to the events tab of the properties pane for this object and add handlers for the three events: DoWork, ProgressChanged, RunWorkerCompleted.

Here is the function for creating a searchable PDF with DotImage OCR (Call it in the DoWork handler with your saved TIFF’s name and the name you want for the PDF):

// Create a searchable PDF (Image with text behind it)
private void CreateSearchablePdf(string tif, string pdf)
{
    using (TesseractEngine ocrEngine = new TesseractEngine())
    {
        ocrEngine.Initialize();
        ocrEngine.PreprocessingOptions.Deskew = false;
        try
        {
            ocrEngine.DocumentProgress += new
                OcrDocumentProgressEventHandler(
                ocrEngine_DocumentProgress);
            ocrEngine.Translators.Add(new PdfTranslator());
            ocrEngine.Translate(new FileSystemImageSource(
                new string[]{tif}, true), "application/pdf", pdf);
        }
        finally
        {
            ocrEngine.ShutDown();
        }
    }               
}

Here’s how you handle the progress event from the OCR Engine:

// Called by OCR engine (in background thread). 
// Need to call worker process ReportProgress so that the call
// to update the progress bar happens in the right thread.
void ocrEngine_DocumentProgress(object sender,
                                OcrDocumentProgressEventArgs e)
{
    if (e.ProgressIsValid)
    {
        saveAsSearchablePdfBackground.ReportProgress(e.Progress);
    }
}

And here are the handlers for the BackgroundWorker’s other events:

// Called indirectly by ReportProgress on the background worker
private void saveAsSearchablePdfBackground_ProgressChanged(
    object sender, ProgressChangedEventArgs e)
{
    tspbUploadProgress.Value = e.ProgressPercentage;
}

// called when DoWork is complete
private void saveAsSearchablePdfBackground_RunWorkerCompleted(
    object sender, RunWorkerCompletedEventArgs e)
{
    tspbUploadProgress.Visible = false;
}

Upload the PDF as before (inside of the RunWorkerCompleted handler if you want to do it automatically)

About Atalasoft

Atalasoft, Inc. provides ECM imaging technology to ISVs, Systems Integrators, and Enterprises with thousands of customers, and millions of end users worldwide. Specializing in zero-footprint, AJAX-enabled web image viewing, Atalasoft provides the tools to migrate enterprise solutions from the desktop to the web. For almost a decade, Atalasoft has produced imaging technology products including DotImage – the leading imaging toolkit for .NET developers, and Vizit SP – Document Viewing and Imaging for SharePoint.

Using DotImage to Scan Documents into the Cloud

Introduction

BONUS: Converting to a searchable PDF before uploading

About Atalasoft

Archives