When using certain PSMs with certain inputs, the PageIterator::Baseline
function produces results that are incorrect due to a bug when getting line bounding boxes. I noticed this when using psm
8
(single word). This impacts API users trying to get a line's baseline, and also causes incorrect results in CLI output formats that report baseline (.hocr
).
While this is most noticeable using it->Baseline
through the API, the phenomenon can be demonstrated using the CLI with the example image below.
The word in the image is recognized correctly--including having the same bounding box--whether psm
is set to 6
(single block) or 8
(single word). However, the latter does not calculate the baseline correctly.
When setting psm
to 6
, the baseline attribute is set to -0.036 0
, which is correct.
tesseract simple_c2.png stdout --oem 0 --psm 6 hocr
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title></title> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/> <meta name='ocr-system' content='tesseract 5.1.0-471-gbc490' /> <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/> </head> <body> <div class='ocr_page' id='page_1' title='image "simple_c2.png"; bbox 0 0 328 194; ppageno 0; scan_res 96 96'> <div class='ocr_carea' id='block_1_1' title="bbox 85 84 195 104"> <p class='ocr_par' id='par_1_1' lang='eng' title="bbox 85 84 195 104"> <span class='ocr_line' id='line_1_1' title="bbox 85 84 195 104; baseline -0.036 0; x_size 24.310345; x_descenders 5.3103447; x_ascenders 5"> <span class='ocrx_word' id='word_1_1' title='bbox 85 84 195 104; x_wconf 83'>Tesseract</span> </span> </p> </div> </div> </body> </html>
However, when setting psm
to 8
the baseline attribute is set to -0 -2.005
, which is incorrect.
tesseract simple_c2.png stdout --oem 0 --psm 8 hocr
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title></title> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/> <meta name='ocr-system' content='tesseract 5.1.0-471-gbc490' /> <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/> </head> <body> <div class='ocr_page' id='page_1' title='image "simple_c2.png"; bbox 0 0 328 194; ppageno 0; scan_res 96 96'> <div class='ocr_carea' id='block_1_1' title="bbox 85 84 195 104"> <p class='ocr_par' id='par_1_1' lang='eng' title="bbox 85 84 195 104"> <span class='ocr_line' id='line_1_1' title="bbox 85 84 195 104; baseline -0 -2.005; x_size 24.310345; x_descenders 5.3103447; x_ascenders 5"> <span class='ocrx_word' id='word_1_1' title='bbox 85 84 195 104; x_wconf 83'>Tesseract</span> </span> </p> </div> </div> </body> </html>Cause
I investigated, and the root cause is that the PageIterator::Baseline
function assumes that the line's bounding box has already been calculated, however this is not always the case. The PageIterator::Baseline
gets the line's bounding box using row->bounding_box()
, which does not force these values to be calculated--it simply returns the default values (-32767
or 32767
) if they were not calculated already.
This can be confirmed by adding tprintf
statements within the PageIterator::Baseline
function:
bool PageIterator::Baseline(PageIteratorLevel level, int *x1, int *y1, int *x2, int *y2) const { if (it_->word() == nullptr) { return false; // Already at the end! } ROW *row = it_->row()->row; WERD *word = it_->word()->word; TBOX box = (level == RIL_WORD || level == RIL_SYMBOL) ? word->bounding_box() : row->bounding_box(); tprintf("Box: %d,%d -> %d,%d\n", box.left(), box.bottom(), box.right(), box.top()); int left = box.left(); ICOORD startpt(left, static_cast<int16_t>(row->base_line(left) + 0.5)); int right = box.right(); ICOORD endpt(right, static_cast<int16_t>(row->base_line(right) + 0.5)); // Rotate to image coordinates and convert to global image coords. startpt.rotate(it_->block()->block->re_rotation()); endpt.rotate(it_->block()->block->re_rotation()); *x1 = startpt.x() / scale_ + rect_left_; *y1 = (rect_height_ - startpt.y()) / scale_ + rect_top_; *x2 = endpt.x() / scale_ + rect_left_; *y2 = (rect_height_ - endpt.y()) / scale_ + rect_top_; tprintf("Baseline: (%d,%d)->(%d,%d)\n", *x1, *y1, *x2, *y2); return true; }
When run with psm
set to 8
this produces the following:
Box: 32767,32767 -> -32767,-32767
Baseline: (32767,-990)->(-32767,1342)
Potential Fixes
I think there are 3 potential approaches for fixing:
PageIterator::Baseline
for whether the default value is being returned, and if it is, calculate the actual bounding box.
row->bounding_box()
function to calculate the bounding box if it has never been calculated before.psm
settings, and edit so they are being calculated upon creation.Ubuntu 22.04 Jammy
tesseract 5.1.0-471-gbc490
leptonica-1.82.0
libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4