Translator Disclaimer
Paper
7 March 1996 Extraction of text lines and text blocks on document images based on statistical modeling
Author Affiliations +
Proceedings Volume 2660, Document Recognition III; (1996) https://doi.org/10.1117/12.234699
Event: Electronic Imaging: Science and Technology, 1996, San Jose, CA, United States
Abstract
In this paper, we developed statistical models to characterize the text line and text block structures on document images using the text word bounding boxes. We posed the extraction problem as finding the text lines and text blocks that maximize the Bayesian probability of the text lines and text blocks by observing the text word bounding boxes. We derived the so-called probabilistic linear displacement model (PLDM) to model the text line structures from text word bounding boxes. We also developed an augmented PLDM model to characterize the text block structures from text line bounding boxes. By systematically gathering statistics from a large population of document images, we are able to validate our models experimentally and determine the proper model parameters. We designed and implemented an iterative algorithm that utilized these probabilistic models to extract the text lines and text blocks. The quantitative performances of the algorithm in terms of the rates of miss, false, correct, splitting, merging and spurious detections of the text lines and text blocks were reported.
© (1996) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Su S. Chen, Robert M. Haralick, and Ihsin T. Phillips "Extraction of text lines and text blocks on document images based on statistical modeling", Proc. SPIE 2660, Document Recognition III, (7 March 1996); https://doi.org/10.1117/12.234699
PROCEEDINGS
12 PAGES


SHARE
Advertisement
Advertisement
RELATED CONTENT


Back to Top