Identifying paragraph blocks

The next step is to perform connect component analysis to find blocks that correspond with paragraphs. OpenCV has a function for this, which we previously used in Chapter 5, Automated Optical Inspection, Object Segmentation, and Detection. This is the findContours function:

vector;vector;Point;contours; 
findContours(dilated, contours, RETR_EXTERNAL, CHAIN_APPROX_SIMPLE);

In the first parameter, we pass our dilated image. The second parameter is the vector of detected contours. Then, we use the option to retrieve only external contours and to use simple approximation. The image contours are presented as follows. Each tone of gray represents a different contour:

The last step is to identify the minimum rotated bounding rectangle of each contour. OpenCV provides a handy function for this operation called minAreaRect. This function receives a vector of arbitrary points and returns a RoundedRect containing the bounding box. This is also a good opportunity to discard unwanted rectangles, that is, rectangles that are obviously not text. Since we are making software for OCR, we'll assume that the text contains a group of letters. With this assumption, we'll discard text in the following situations:

The rectangle width or size is too small, that is, smaller than 20 pixels. This will help discard border noises and other small artifacts.
The rectangle of the image has a width/height proportion smaller than two. That is, rectangles that resemble a square, such as the image icons, or are much taller, will also be discarded.

There's a little caveat in the second condition. Since we are dealing with rotated bounding boxes, we must test whether the bounding box angle is smaller than -45 degrees. If it is, the text is vertically rotated, so the proportion that we must take into account is height/width.

Let's check this out by looking at the following code:

//For each contour 

vector;RotatedRect; areas; 
for (const auto& contour : contours)  
{   
   //Find it's rotated rect 
   auto box = minAreaRect(contour); 
 
   //Discard very small boxes 
   if (box.size.width 20 || box.size.height 20) 
         continue; 
 
   //Discard squares shaped boxes and boxes  
   //higher than larger 
   double proportion = box.angle -45.0 ? 
         box.size.height / box.size.width :  
         box.size.width / box.size.height; 
 
   if (proportion 2)  
         continue; 
 
   //Add the box 
   areas.push_back(box); 
}

Let's see which boxes this algorithm selected:

That's certainly a good result!

We should notice that the algorithm described in step 2, in the preceding code, will also discard single letters. This is not a big issue since we are creating an OCR preprocessor, and single symbols are usually meaningless with context information; one example of such a case is the page numbers. The page numbers will be discarded with this process since they usually appear alone at the bottom of the page, and the size and proportion of the text will also be disturbed. But this will not be a problem, since after the text passes through the OCR, you will end up with a huge amount of text files with no page division at all.

We'll place all of this code in a function with the following signature:

vector RotatedRect; findTextAreas(Mat input)

Table of Contents for Identifying paragraph blocks

Create new playlist

Sign In

Sign Up

Table of Contents for
Identifying paragraph blocks