Introduction
Got a stack of paper, a scanner, and an account on Scribd? Want to share those documents on your blog or with your co-workers? Then this project is for you.
With DotImage, Visual Studio, and a little bit of code, you can write a Scan-to-Scribd desktop application to put those documents online. Best of all, the upload code can be easily modified to handle any remote document repository that supports uploading documents as a service.
The first step is to build a basic scanning application with DotImage. If you would like to see videos detailing the exact steps to doing that, go to these links (total lesson series time = less than 23 minutes):
Video Tutorial: Lesson 1 - Basic Structure of a Capture Application
Video Tutorial: Lesson 2 – Implementing Save and AutoZoom
Video Tutorial: Lesson 3 – Getting a List of Devices and Scanning
Video Tutorial: Lesson 4 – Basic Cleanup of Scanned Documents
The basic steps are simple:
- Create a WinForms application (get a free evaluation download of DotImage).
- Drag a DotImage DocumentViewer to the form (set its Dock to fill).
- Drag a ToolStrip to the form.
- Add a ToolStripComboBox (to hold the list of scanners) to the ToolStrip, and name it
tscbScanners
. - Add a ToolStripButton (to initiate scanning) to the ToolStrip, and name it
tsbScan
. - Drag a DotImage Acquisition object to the Form (this is how you scan).
On Form Load, we want to fill in the list of installed scanners. To do so, call this function:
private void InitializeScannerList()
{
tsbScan.Enabled = false;
tscbScanners.Enabled = false;
if (acquisition1.SystemHasTwain)
{
foreach (Device d in acquisition1.Devices)
{
string devName = d.Identity.ProductName;
tscbScanners.Items.Add(devName);
if (d == acquisition1.Devices.Default)
{
tscbScanners.SelectedItem = devName;
}
}
if (tscbScanners.Items.Count > 0) {
tsbScan.Enabled = true;
tscbScanners.Enabled = true;
}
}
}
When the scan button is pressed, we can scan documents with this code (in the tsbScan Click event handler)
private void tsbScan_Click(object sender, EventArgs e)
{
Device selectedDevice = GetSelectedDevice();
if (selectedDevice != null)
{
selectedDevice.Acquire();
}
}
private Device GetSelectedDevice()
{
foreach (Device d in acquisition1.Devices)
{
if (tscbScanners.SelectedItem.ToString() ==
d.Identity.ProductName)
{
return d;
}
}
return null;
}
Every time an image is scanned, the Acquisition object’s ImageAcquired event will fire. Add a handler with this code to add the image to the document viewer:
private void acquisition1_ImageAcquired(object sender,
AcquireEventArgs e)
{
documentViewer1.Add(AtalaImage.FromBitmap(e.Image), "", "");
}
Once the document is scanned, we can upload to any service that accepts document uploads. Scribd (www.scribd.com) is a free document sharing website that has a web-service interface for uploading documents. There is an excellent open-source .NET library called Scribd.NET that makes interacting with the service relatively simple. You can get it here: http://www.codeplex.com/scribdnet.
Here is what you need to do to upload a Document to Scribd using Scribd.NET.
- Add a ToolStripProgressBar to your form (either in your ToolStrip or in a StatusStrip). Name it
tspbUploadProgress
. - Initialize the library with your API Key (get an API Key here: http://www.scribd.com/platform/account):
private void InitializeScribd()
{
Scribd.Net.Service.APIKey = _apiKey;
Scribd.Net.Service.SecretKey = _secretKey;
Scribd.Net.Service.EnforceSigning = true;
}
- Log the user in:
void LoginUser(string user, string password)
{
User.LoggedIn += _loggedInHandler;
User.LoginFailed += _loggedInHandler;
User.Login(user, password);
}
- Handle the login events (called asynchronously)
void User_LoggedIn(object sender, UserEventArgs e)
{
User.LoggedIn -= _loggedInHandler;
User.LoginFailed -= _loggedInHandler;
if (e.Success)
{
_scribdInitialized = true;
}
else
{
_scribdInitialized = false;
}
}
- Declare _loggedInHandler and _scribdInitialized :
private EventHandler<UserEventArgs> _loggedInHandler;
static private bool _scribdInitialized = false;
- And initialize _loggedInHandler like this:
_loggedInHandler
= new EventHandler<UserEventArgs>(User_LoggedIn);
Before you upload, you need to handle events that Scribd.NET raises once the file is uploaded and saved:
private void InitializeScribdEventHandlers()
{
Document.Uploaded +=
new EventHandler<DocumentEventArgs>(Document_Uploaded);
Document.Saved +=
new EventHandler<DocumentEventArgs>(Document_Saved);
Document.UploadProgressChanged +=
new EventHandler<System.Net.UploadProgressChangedEventArgs>
(Document_UploadProgressChanged);
Service.Error += new EventHandler<ScribdEventArgs>(Service_Error);
}
Scribd.NET uploads files by name, so we save it first – here’s how you use DotImage to save the file into a temporary TIFF.
private string SaveDocumentAsTempTif()
{
string tempName = Path.GetTempFileName() + ".tif";
documentViewer1.Save(tempName, new TiffEncoder());
return tempName;
}
Here’s how you upload (AccessTypes is a Scribd.NET type that you can use to specify if the document is public or private):
private void UploadFileToScribd(string filename, AccessTypes accessType)
{
Scribd.Net.Document.UploadAsync(filename, accessType);
}
And, here’s how you handle the events:
void Service_Error(object sender, ScribdEventArgs e)
{
MessageBox.Show(this, "Scribd Error: " + e.Message,
"Error", MessageBoxButtons.OK, MessageBoxIcon.Error);
}
void Document_UploadProgressChanged(object sender,
System.Net.UploadProgressChangedEventArgs e)
{
tspbUploadProgress.Value = e.ProgressPercentage;
}
void Document_Uploaded(object sender, DocumentEventArgs e)
{
if (e.Document != null)
{
e.Document.Title = _title;
e.Document.Save();
}
}
void Document_Saved(object sender, DocumentEventArgs e)
{
}
So you see how easy it is to scan and upload documents. The same basic structure could be used to upload documents into Amazon’s S3, the new Microsoft Azure SQL Data Services, Google Docs, any ECM that supports the emerging CMIS standard for documents, SharePoint, etc.
To get the full code and a build of this project go to Atalasoft’s Scan Documents to Scribd Project page.
BONUS: Converting to a searchable PDF before uploading
TIFFs are fine for scans, but they will not be indexed by Scribd, so you will not be able to search for the documents later. We can OCR the document and then create a PDF with the original page image on top and the text that it represents beneath. That way, we get a document that looks like a scan, but can be found by indexers. This is called a searchable PDF and they are easy to create with DotImage.
Since OCR is a time consuming process, it’s best to do it in the background with a BackgroundWorker object. Here is how you do it:
- Add a BackgroundWorker to the form and name it
saveAsSearchablePdfBackground.
Set its WorkerReportsProgress property to true. - Go to the events tab of the properties pane for this object and add handlers for the three events: DoWork, ProgressChanged, RunWorkerCompleted.
- Here is the function for creating a searchable PDF with DotImage OCR (Call it in the DoWork handler with your saved TIFF’s name and the name you want for the PDF):
private void CreateSearchablePdf(string tif, string pdf)
{
using (TesseractEngine ocrEngine = new TesseractEngine())
{
ocrEngine.Initialize();
ocrEngine.PreprocessingOptions.Deskew = false;
try
{
ocrEngine.DocumentProgress += new
OcrDocumentProgressEventHandler(
ocrEngine_DocumentProgress);
ocrEngine.Translators.Add(new PdfTranslator());
ocrEngine.Translate(new FileSystemImageSource(
new string[]{tif}, true), "application/pdf", pdf);
}
finally
{
ocrEngine.ShutDown();
}
}
}
- Here’s how you handle the progress event from the OCR Engine:
void ocrEngine_DocumentProgress(object sender,
OcrDocumentProgressEventArgs e)
{
if (e.ProgressIsValid)
{
saveAsSearchablePdfBackground.ReportProgress(e.Progress);
}
}
- And here are the handlers for the BackgroundWorker’s other events:
private void saveAsSearchablePdfBackground_ProgressChanged(
object sender, ProgressChangedEventArgs e)
{
tspbUploadProgress.Value = e.ProgressPercentage;
}
private void saveAsSearchablePdfBackground_RunWorkerCompleted(
object sender, RunWorkerCompletedEventArgs e)
{
tspbUploadProgress.Visible = false;
}
Upload the PDF as before (inside of the RunWorkerCompleted handler if you want to do it automatically)
About Atalasoft
Atalasoft, Inc. provides ECM imaging technology to ISVs, Systems Integrators, and Enterprises with thousands of customers, and millions of end users worldwide. Specializing in zero-footprint, AJAX-enabled web image viewing, Atalasoft provides the tools to migrate enterprise solutions from the desktop to the web. For almost a decade, Atalasoft has produced imaging technology products including DotImage – the leading imaging toolkit for .NET developers, and Vizit SP – Document Viewing and Imaging for SharePoint.
Archives
Lou Franco is the Director of Engineering at Atalasoft, provider of the leading .NET Imaging SDK (DotImage) and the Document Viewer for SharePoint (Vizit).
http://atalasoft.com/products/dotimage
http://vizitsp.com