Click here to Skip to main content
15,881,248 members
Articles / Programming Languages / C#

Data Scraping from Image using Tesseract

Rate me:
Please Sign up or sign in to vote.
5.00/5 (7 votes)
31 Mar 2018Apache2 min read 22.8K   1.7K   12   1
Scrape data from image using Tesseract OCR engine

Introduction

Data Science is a growing field. According to CRISP DM model and other Data Mining models, we need to collect data before mining out knowledge and conduct predictive analysis. Data Collection can involve data scraping, which includes web scraping (HTML to Text), image to text and video to text conversion. When data is in text format, we usually use text mining techniques to mine out knowledge.

In this article, I am going to introduce you to Optical Character Recognition (OCR) to convert images to text. I developed Just Another Tesseract Interface (JATI) to convert images into text files, and consolidate them into a set of text data for text mining and natural language processing.

JATI interface with Tesseract OCR engine to convert image into text. I have included the source code. In this article, I am going to explain interfacing of the popular open source Tesseract OCR engine using C#.

Image 1

Selecting the Image Portion to Convert

To OCR the whole image, it is easy, but I want to select a portion of the image to OCR. This can improve the accuracy of the result also. Hence, in JATI, user can click on the picturebox image and drag to draw a rectangle to select the portion. The selected area will then be cropped. The following are the steps to accomplish this.

References:

  1. http://www.c-sharpcorner.com/UploadFile/hirendra_singh/how-to-make-image-editor-tool-in-C-Sharp-cropping-image/
  2. https://stackoverflow.com/questions/34551800/get-the-exact-size-of-the-zoomed-image-inside-the-picturebox

Include the System.Drawing library:

C#
using System.Drawing;

Mouse Down event for PictureBox1:

C#
void PictureBox1MouseDown(object sender, MouseEventArgs e)
        {
            try {
           
             if (e.Button == System.Windows.Forms.MouseButtons.Left)
             {
                 Cursor = Cursors.Cross;
                startX = e.X;
                startY = e.Y;
               
                selPen = new Pen(Color.Red, 1);
              }
             
             pictureBox1.Refresh();
            }
           
            catch(Exception ex) {
               
            }
        }

Mouse Move event for PictureBox1:

C#
void PictureBox1MouseMove(object sender, MouseEventArgs e)
        {
            try {
            if(e.Button == System.Windows.Forms.MouseButtons.Left) {
                pictureBox1.Refresh();   
                //Cursor = Cursors.Cross;
                curX = e.X;
                curY = e.Y;
               
                Rectangle rect = new Rectangle(startX, startY, curX - startX, curY - startY);
                pictureBox1.CreateGraphics().DrawRectangle(selPen, rect);               
            }
            }
           
            catch(Exception ex) {
               
            }
           
        }

Mouse Up event for PictureBox1:

C#
void PictureBox1MouseUp(object sender, MouseEventArgs e)
        {
            try {
            Cursor = Cursors.Arrow;
       
            Rectangle rect = new Rectangle(startX, startY, curX-startX, curY-startY);
          
            Bitmap OriginalImage = new Bitmap(pictureBox1.Image, pictureBox1.Width, pictureBox1.Height);
            Bitmap _img = new Bitmap(curX-startX, curY-startY);

            Graphics g = Graphics.FromImage(_img);

            g.InterpolationMode = System.Drawing.Drawing2D.InterpolationMode.HighQualityBicubic;
            g.PixelOffsetMode = System.Drawing.Drawing2D.PixelOffsetMode.HighQuality;
            g.CompositingQuality = System.Drawing.Drawing2D.CompositingQuality.HighQuality;

            g.DrawImage(OriginalImage, 0, 0, rect, GraphicsUnit.Pixel);
 
            pictureBox2.Image = _img;
            pictureBox2.SizeMode = PictureBoxSizeMode.Zoom;
            pictureBox2.Width = _img.Width;
            pictureBox2.Height = _img.Height;
              
            }
           
            catch(Exception ex) {
               
            }
        }

The above code crops the selected image portion and places it into picturebox2. Following is the detailed explanation.

Create a new rectangle object for the selection:

C#
Rectangle rect = new Rectangle(startX, startY, curX-startX, curY-startY);

Save the original image into a Bitmap object:

C#
Bitmap OriginalImage = new Bitmap(pictureBox1.Image, pictureBox1.Width, pictureBox1.Height);

Create a new Bitmap Object:

C#
Bitmap _img = new Bitmap(curX-startX, curY-startY);

Create a Graphics Object based on the new Bitmap Object:

C#
Graphics g = Graphics.FromImage(_img);

Settings of Graphics Object:

C#
g.InterpolationMode = System.Drawing.Drawing2D.InterpolationMode.HighQualityBicubic;
g.PixelOffsetMode = System.Drawing.Drawing2D.PixelOffsetMode.HighQuality;
g.CompositingQuality = System.Drawing.Drawing2D.CompositingQuality.HighQuality;

Cropped the image based on selection and put into pictureBox2:

C#
g.DrawImage(OriginalImage, 0, 0, rect, GraphicsUnit.Pixel);
pictureBox2.Image = _img;

To get the selected coordinates for the image, I use:

C#
string selCoordinates = "(" + startX.ToString() + "," + startY.ToString() + 
                        "," + curX.ToString() + "," + curY.ToString() + ")";

Image to Text Recognition using Tesseract

I use Tesseract OCR engine to convert images into text. To interface with Tesseract OCR engine, include System.Diagnostic library:

C#
using System.Diagnostics;

Save the cropped image selection from pictureBox2 into a temporary directory:

C#
pictureBox2.Image.Save(Directory.GetCurrentDirectory() + "/JATI/temp/temp" + ".png");

Set the input file and output file for Tesseract OCR engine:

C#
string input = Directory.GetCurrentDirectory() + "/JATI/temp/temp" + ".png";
string output = Directory.GetCurrentDirectory() + "/JATI/temp/temp" + ".txt";

Create the Process and put in the arguments:

C#
Process myProcess = Process.Start(Directory.GetCurrentDirectory() + 
"/JATI/tesseract.exe", "--tessdata-dir ./JATI/ " + input + " " + 
output.Replace(".txt", "") + " -l " + languageTextBox.Text + " -psm " + psmTextBox.Text);

Wait for the process to exit:

C#
myProcess.WaitForExit();

License

This article, along with any associated source code and files, is licensed under The Apache License, Version 2.0


Written By
Founder SVBook Pte. Ltd.
Singapore Singapore
Eric Goh is a data scientist, software engineer, adjunct faculty and entrepreneur with years of experiences in multiple industries. His varied career includes data science, data and text mining, natural language processing, machine learning, intelligent system development, and engineering product design. He founded SVBook Pte. Ltd. and extended it with DSTK.Tech and EMHAcademy.com. DSTK.Tech is where Eric develops his own DSTK data science softwares (public version). Eric also published “Learn R for Applied Statistics” at Apress, and published some books at LeanPub, Google Books, Amazon kindle, and SVBook Pte. Ltd. He teaches the content at EMHAcademy.com, Udemy, SkillShare, BitDegree, Simpliv, and developed 28 courses, 7 advanced certificates. Eric is also an adjunct faculty at Universities and Institutions.

Eric Goh has been leading his teams for various industrial projects, including the advanced product code classification system project which automates Singapore Custom’s trade facilitation process, and Nanyang Technological University's data science projects where he develop his own DSTK data science software (NTU version). He has years of experience in C#, Java, C/C++, SPSS Statistics and Modeller, SAS Enterprise Miner, R, Python, Excel, Excel VBA and etc. He won Tan Kah Kee Young Inventors' Merit Award 2007, and Shortlisted Entry for TelR Data Mining Challenge.

Eric holds a Masters of Technology degree from the National University of Singapore, an Executive MBA degree from U21Global (currently GlobalNxt) and IGNOU, a Graduate Diploma in Mechatronics from A*STAR SIMTech (a national research institute located in Nanyang Technological University), Coursera Specialization Certificate in Business Statistics and Analysis (Excel) from Rice University, IBM Data Science Professional Certificate (Python, SQL), and Coursera Verified Certificate in R Programming from Johns Hopkins University. He possessed a Bachelor of Science degree in Computing fr

Comments and Discussions

 
QuestionError---- out of memory Pin
Imran AS Shaikh31-May-18 23:27
Imran AS Shaikh31-May-18 23:27 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.