Extracting text from an image

The process of extracting text from an image is called O ptical Character Recognition (OCR). This can be very useful when the text data that needs to be processed is embedded in an image. For example, the information contained in license plates, road signs, and directions can be very useful at times.

We can perform OCR using Tess4j (http://tess4j.sourceforge.net/), a Java JNA wrapper for Tesseract OCR API. We will demonstrate how to use the API using an image captured from the Wikipedia article on OCR (https://en.wikipedia.org/wiki/Optical_character_recognition#Applications). The Javadoc for the API is found at http://tess4j.sourceforge.net/docs/docs-3.0/. The image we use is shown here:

Extracting text from an image

Using Tess4j to extract text

The ITesseract interface contains numerous OCR methods. The doOCR method takes a file and returns a string containing the words found in the file, as shown here:

ITesseract instance = new Tesseract();  
try { 
    String result = instance.doOCR(new File("OCRExample.png")); 
    out.println(result); 
} catch (TesseractException e) { 
    // Handle exceptions
} 

Part of the output is shown next:

OCR engines nave been developed into many lunds oiobiectorlented OCR applicatlons, sucn as reoeipt OCR, involoe OCR, check OCR, legal billing document OCR
They can be used ior
- Data entry ior business documents, e g check, passport, involoe, bank statement and receipt
- Automatic number plate recognnlon

As you can see, there are numerous errors in this example. Often the quality of an image needs to be improved before it can be processed correctly. Techniques for improving the quality of the output can be found at https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality. For example, we can use the setLanguage method to specify the language processed. Also, the method often works better on TIFF images.

In the next example, we used an enlarged portion of the previous image, as shown here:

Using Tess4j to extract text

The output is much better, as shown here:

OCR engines have been developed into many kinds of object-oriented OCR applications, such as receipt OCR,
invoice OCR, check OCR, legal billing document OCR.
They can be used for:
. Data entry for business documents, e.g. check, passport, invoice, bank statement and receipt
. Automatic number plate recognition

These examples highlight the need for the careful cleaning of data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.188.138