Proceedings Article | 29 January 2007
Proc. SPIE. 6500, Document Recognition and Retrieval XIV
KEYWORDS: Imaging systems, Matrices, Image processing, Computing systems, Machine learning, Associative arrays, Analytical research, Optical character recognition, Statistical modeling, Systems modeling
Although modern OCR technology is capable of handling a wide variety of document images, there is no single
OCR engine that performs equally well on all documents for a given single language script. Naturally, each OCR
engine has its strengths and weaknesses, and therefore different engines tend to differ in the accuracy on different
documents, and in the errors on the same document image. While the idea of using multiple OCR engines
to boost output accuracy is not new, most of the existing systems do not go beyond variations on majority
voting. While this approach may work well in many cases, it has limitations, especially when OCR technology
used to process a given script has not yet fully matured. Our goal is to develop a system called MEMOE (for
"Multi-Evidence Multi-OCR-Engine") that combines, in an optimal or near-optimal way, output streams of
one or more OCR engines together with various types of evidence extracted from these streams as well as from
original document images, to produce output of higher quality than that of the individual OCR engines, or of
majority voting applied to multiple OCR output streams. Furthermore, we aim to improve the accuracy of OCR
output on images that might otherwise have low accuracy that significantly impacts downstream processing.
The MEMOE system functions as an OCR engine taking document images and some configuration parameters
as input and producing a single output text stream. In this paper, we describe the design of the system, various
evidence types and how they are incorporated into MEMOE in the form of filters. Results of initial tests that
involve two corpora of Arabic documents show that, even in its initial configuration, the system is superior to a
voting algorithm and that even more improvement may be achieved by incorporating additional evidence types
into the system.