Scaling up document-image classifiers to handle an unlimited variety of document and image types poses serious
challenges to conventional trainable classifier technologies. Highly versatile classifiers demand representative
training sets which can be dauntingly large: in investigating document content extraction systems, we have
demonstrated the advantages of employing as many as a billion training samples in approximate k-nearest
neighbor (kNN) classifiers sped up using hashed K-d trees. We report here on an algorithm, which we call online
bin-decimation, for coping with training sets that are too big to fit in main memory, and we show empirically that
it is superior to offline pre-decimation, which simply discards a large fraction of the training samples at random
before constructing the classifier. The key idea of bin-decimation is to enforce an upper bound approximately on
the number of training samples stored in each K-d hash bin; an adaptive statistical technique allows this to be
accomplished online and in linear time, while reading the training data exactly once. An experiment on 86.7M
training samples reveals a 23-times speedup with less than 0.1% loss of accuracy (compared to pre-decimation);
or, for another value of the upper bound, a 60-times speedup with less than 5% loss of accuracy. We also compare
it to four other related algorithms.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.