Creating an OCR function

We'll change the previous example to work with Tesseract. Start by adding tesseract/baseapi.h and fstream to the include list:

#include opencv2/opencv.hpp; 
#include tesseract/baseapi.h; 
 
#include vector; 
#include fstream; 

Then, we'll create a global TessBaseAPI object that represents our Tesseract OCR engine:

tesseract::TessBaseAPI ocr; 
The ocr engine is completely self-contained. If you want to create a multi-threaded piece of OCR software, just add a different TessBaseAPI object in each thread, and the execution will be fairly thread-safe. You just need to guarantee that file writing is not done over the same file, otherwise you'll need to guarantee safety for this operation.

Next, we will create a function called identify text (identifyText) that will run the ocr:

const char* identifyText(Mat input, const char* language = "eng")  
{   
   ocr.Init(NULL, language, tesseract::OEM_TESSERACT_ONLY);     
   ocr.SetPageSegMode(tesseract::PSM_SINGLE_BLOCK); 
   ocr.SetImage(input.data, input.cols, input.rows, 1, input.step); 
    
   const char* text = ocr.GetUTF8Text(); 
   cout  "Text:"  endl; 
   cout  text  endl; 
   cout  "Confidence: "  ocr.MeanTextConf() endl; 
    
    // Get the text     
   return text; 
} 

Let's explain this function line-by-line. In the first line, we start by initializing tesseract. This is done by calling the Init function. This function has the following signature:

int Init(const char* datapath, const char* language, 
OcrEngineMode oem)

Let's explain each parameter:

  • datapath: This is the path to the root directory of tessdata files. The path must end with a backslash / character. The tessdata directory contains the language files that you installed. Passing NULL to this parameter will make tesseract search its installation directory, which is the location that this folder is normally present in. It's common to change this value to args[0] when deploying an application, and include the tessdata folder in your application path.
  • language: This is a three letter word for the language code (for example, eng for English, por for Portuguese, or hin for Hindi). Tesseract supports loading multiple language codes by using the + sign. Therefore, passing eng+por will load both the English and Portuguese languages. Of course, you can only use languages you have previously installed, otherwise the loading process will fail. A language config file may specify that two or more languages must be loaded together. To prevent that, you may use a tilde ~. For example, you can use hin+~eng to guarantee that English is not loaded with Hindi, even if it is configured to do so.
  • OcrEngineMode: These are the OCR algorithms that will be used. It can have one of the following values:
    • OEM_TESSERACT_ONLY: Uses just tesseract. It's the fastest method, but it also has less precision.
    • OEM_CUBE_ONLY: Uses the Cube engine. It's slower, but more precise. This will only work if your language was trained to support this engine mode. To check if that's the case, look for .cube files for your language in the tessdata folder. The support for English language is guaranteed.
    • OEM_TESSERACT_CUBE_COMBINED: This combines both Tesseract and Cube to achieve the best possible OCR classification. This engine has the best accuracy and the slowest execution time.
    • OEM_DEFAULT: This infers the strategy based on the language config file, command-line config file or, in the absence of both, uses OEM_TESSERACT_ONLY.

It's important to emphasize that the Init function can be executed many times. If a different language or engine mode is provided, Tesseract will clear the previous configuration and start again. If the same parameters are provided, Tesseract is smart enough to simply ignore the command. The init function returns 0 in case of success and -1 in case of failure.

Our program will then proceed by setting the page segmentation mode:

ocr.SetPageSegMode(tesseract::PSM_SINGLE_BLOCK); 

There are several segmentation modes available:

  • PSM_OSD_ONLY: Using this mode, Tesseract will just run its preprocessing algorithms to detect orientation and script detection.
  • PSM_AUTO_OSD: This tells Tesseract to do automatic page segmentation with orientation and script detection.
  • PSM_AUTO_ONLY: This does page segmentation, but avoids doing orientation, script detection, or OCR.
  • PSM_AUTO: This does page segmentation and OCR, but avoids doing orientation or script detection.
  • PSM_SINGLE_COLUMN: This assumes that the text of variable sizes is displayed in a single column.
  • PSM_SINGLE_BLOCK_VERT_TEXT: This treats the image as a single uniform block of vertically aligned text.
  • PSM_SINGLE_BLOCK: This assumes a single block of text, and is the default configuration. We will use this flag since our preprocessing phase guarantees this condition.
  • PSM_SINGLE_LINE: Indicates that the image contains only one line of text.
  • PSM_SINGLE_WORD: Indicates that the image contains just one word.
  • PSM_SINGLE_WORD_CIRCLE: Informs us that the image is a just one word disposed in a circle.
  • PSM_SINGLE_CHAR: Indicates that the image contains a single character.

Notice that Tesseract already has deskewing and text segmentation algorithms implemented, just like most OCR libraries do. But it's interesting to know of such algorithms since you may provide your own preprocessing phase for specific needs. This allows you to improve text detection in many cases. For example, if you are creating an OCR application for old documents, the default threshold used by Tesseract may create a dark background. Tesseract may also be confused by borders or severe text skewing.

Next, we call the SetImage method with the following signature:

void SetImage(const unsigned char* imagedata, int width, 
int height, int bytes_per_pixel, int bytes_per_line);

The parameters are almost self-explanatory, and most of them can be read directly from our Mat object:

  • data: A raw byte array containing image data. OpenCV contains a function called data() in the Mat class that provides a direct pointer to the data.
  • width: Image width.
  • height: Image height.
  • bytes_per_pixel: Number of bytes per pixel. We are using 1, since we are dealing with a binary image. If you want the code to be more generic, you could also use the Mat::elemSize() function, which provides the same information.
  • bytes_per_line: Number of bytes in a single line. We are using the Mat::step property since some images add trailing bytes.

Then, we call GetUTF8Text to run the recognition itself. The recognized text is returned, encoded with UTF8 and without BOM. Before returning it, we also print some debug information.

MeanTextConf returns a confidence index, which may by a number from 0 to 100:

   auto text = ocr.GetUTF8Text(); 
   cout  "Text:"  endl; 
   cout  text  endl; 
   cout  "Confidence: "  ocr.MeanTextConf()  endl; 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.156.122