Cross-validation comparison of NIST OCR databases

Patrick J. Grother

doi:10.1117/12.143632

14 April 1993 Cross-validation comparison of NIST OCR databases

Patrick J. Grother

Proceedings Volume 1906, Character Recognition Technologies; (1993) https://doi.org/10.1117/12.143632
Event: IS&T/SPIE's Symposium on Electronic Imaging: Science and Technology, 1993, San Jose, CA, United States

Abstract

The quality of reference databases for optical character recognition is vital to the meaningful assessment of classification algorithms. NIST has produced two databases of segmented handprinted characters obtained from socially distinct writer populations. Two approaches to the comparison of the databases are described. The first uses the eigenvalue spectrum of the covariance matrix as an a priori measure of the variance intrinsic to the data. The second cross validates the datasets using classification error to quantify the difficulty of OCR. The eigenvalue spectra from the training partitions of the datasets are generated during the production of the Karhunen Loeve Transforms, the leading components of which are used as prototype features for a classifier. The eignespectra are used to quantify diversity of the character sets and the Bhattacharrya distance is used to measure class separability. The digits, uppers and lowers from the two populations of 500 writers are partitioned into N disjoint sets. The KL transforms of each such set are used for testing, while the remaining N-1 sets form the training prototypes for a PNN nearest neighbor classifier. Recognition error rates and their variances are calculated over the N partitions for both databases independently. This quantifies intra-database diversity. The inter-database results, or `cross' terms, obtained by training and testing on different databases, indicate the generality of the training set. The results for digits suggest that the second NIST database (used nominally for testing) is significantly harder than the first (training) set; the testing images are 11% more variant. The NIST training data classifies partitions of itself with 1.7% error, and the test set with 6.8% error. Conversely the test set generalizes to both itself and the training data with 3.5% error. This effect has also ben reported using non-NIST classifiers.

Citation Download Citation

Patrick J. Grother "Cross-validation comparison of NIST OCR databases", Proc. SPIE 1906, Character Recognition Technologies, (14 April 1993); https://doi.org/10.1117/12.143632

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available