Missing values make pattern analysis difficult, particularly with limited available data. In longitudinal research,
missing values accumulate, thereby aggravating the problem. Here we consider how to deal with temporal
data with missing values in handwriting analysis. In the task of studying development of individuality of
handwriting, we encountered the fact that feature values are missing for several individuals at several time
instances. Six algorithms, i.e., random imputation, mean imputation, most likely independent value imputation,
and three methods based on Bayesian network (static Bayesian network, parameter EM, and structural EM),
are compared with children's handwriting data. We evaluate the accuracy and robustness of the algorithms
under different ratios of missing data and missing values, and useful conclusions are given. Specifically, static
Bayesian network is used for our data which contain around 5% missing data to provide adequate accuracy and
low computational cost.
There is little work done in the analysis of children's handwriting, which can be useful in developing automatic evaluation systems and in quantifying handwriting individuality. We consider the statistical analysis of children's handwriting in early grades. Samples of handwriting of children in Grades 2-4 who were taught the Zaner-Bloser style were considered. The commonly occurring word "and" written in cursive style as well as hand-print were extracted from extended writing. The samples were assigned feature values by human examiners using a truthing tool. The human examiners looked at how the children constructed letter formations in their writing, looking for similarities and differences from the instructions taught in the handwriting copy book. These similarities and differences were measured using a feature space distance measure. Results indicate that the handwriting develops towards more conformity with the class characteristics of the Zaner-Bloser copybook which, with practice, is the expected result. Bayesian networks were learnt from the data to enable answering various probabilistic queries, such as determining students who may continue to produce letter formations as taught during lessons in school and determining the students who will develop a different and/or variation of the those letter formations and the number of different types of letter formations.
Forensic identification is the task of determining whether or not observed evidence arose from a known source. It involves
determining a likelihood ratio (LR) – the ratio of the joint probability of the evidence and source under the identification
hypothesis (that the evidence came from the source) and under the exclusion hypothesis (that the evidence did not arise from
the source). In LR- based decision methods, particularly handwriting comparison, a variable number of input evidences is
used. A decision based on many pieces of evidence can result in nearly the same LR as one based on few pieces of evidence.
We consider methods for distinguishing between such situations. One of these is to provide confidence intervals together
with the decisions and another is to combine the inputs using weights. We propose a new method that generalizes the
Bayesian approach and uses an explicitly defined discount function. Empirical evaluation with several data sets including
synthetically generated ones and handwriting comparison shows greater flexibility of the proposed method.
During the last few years many document recognition methods have been developed to determine whether a
handwriting specimen can be attributed to a known writer. However, in practice, the work-flow of the document
examiner continues to be manual-intensive. Before a systematic or computational, approach can be developed, an
articulation of the steps involved in handwriting comparison is needed. We describe the work flow of handwritten
questioned document examination, as described in a standards manual, and the steps where existing automation
tools can be used. A well-known ransom note case is considered as an example, where one encounters testing for
multiple writers of the same document, determining whether the writing is disguised, known writing is formal
while questioned writing is informal, etc. The findings for the particular ransom note case using the tools are
given. Also observations are made for developing a more fully automated approach to handwriting examination.
Forensic individualization is the task of associating observed evidence with a specific source. The likelihood ratio
(LR) is a quantitative measure that expresses the degree of uncertainty in individualization, where the numerator
represents the likelihood that the evidence corresponds to the known and the denominator the likelihood that
it does not correspond to the known. Since the number of parameters needed to compute the LR is exponential
with the number of feature measurements, a commonly used simplification is the use of likelihoods based on
distance (or similarity) given the two alternative hypotheses. This paper proposes an intermediate method
which decomposes the LR as the product of two factors, one based on distance and the other on rarity. It was
evaluated using a data set of handwriting samples, by determining whether two writing samples were written
by the same/different writer(s). The accuracy of the distance and rarity method, as measured by error rates, is
significantly better than the distance method.
We provide a statistical basis for reporting the results of handwriting examination by questioned document (QD)
examiners. As a facet of Questioned Document (QD) examination, the analysis and reporting of handwriting
examination suffers from the lack of statistical data concerning the frequency of occurrence of combinations of
particular handwriting characteristics. QD examiners tend to assign probative values to specific handwriting
characteristics and their combinations based entirely on the examiner's experience and power of recall. The
research uses data bases of handwriting samples that are representative of the US population. Feature lists of
characteristics provided by QD examiners, are used to determine as to what frequencies need to be evaluated.
Algorithms are used to automatically extract those characteristics, e.g., a software tool for extracting most
of the characteristics from the most common letter pair th, is functional. For each letter combination the
marginal and conditional frequencies of their characteristics are evaluated. Based on statistical dependencies
of the characteristics the probability of any given letter formation is computed. The resulting algorithms are
incorporated into a system for writer verification known as CEDAR-FOX.
Over the last century forensic document science has developed progressively more sophisticated pattern recognition
methodologies for ascertaining the authorship of disputed documents. These include advances not only
in computer assisted stylometrics, but forensic handwriting analysis. We present a writer verification method
and an evaluation of an actual historical document written by an unknown writer. The questioned document
is compared against two known handwriting samples of Herman Melville, a 19th century American author who
has been hypothesized to be the writer of this document. The comparison led to a high confidence result that
the questioned document was written by the same writer as the known documents. Such methodology can be
applied to many such questioned documents in historical writing, both in literary and legal fields.
In the analysis of handwriting in documents a central task is that of determining line structure of the text,
e.g., number of text lines, location of their starting and end-points, line-width, etc. While simple methods can
handle ideal images, real world documents have complexities such as overlapping line structure, variable line
spacing, line skew, document skew, noisy or degraded images etc. This paper explores the application of the
Hough transform method to handwritten documents with the goal of automatically determining global document
line structure in a top-down manner which can then be used in conjunction with a bottom-up method such as
connected component analysis. The performance is significantly better than other top-down methods, such as
the projection profile method. In addition, we evaluate the performance of skew analysis by the Hough transform
on handwritten documents.
Many governments have some form of "direct democracy" legislation procedure whereby individual citizens can
propose various measures creating or altering laws. Generally, such a process is started with the gathering of a
large number of signatures. There is interest in whether or not there are fraudulent signatures present in such a
petition, and if so what percentage of the signatures are indeed fraudulent. However, due to the large number
of signatures (tens of thousands), it is not feasible to have a document examiner verify the signatures directly.
Instead, there is interest in creating a subset of signatures where there is a high probability of fraud that can be
verified. We present a method by which a pairwise comparison of signatures can be performed and subsequent
sorting can generate such subsets.
A novel statistical model for determining whether a pair of documents, a known and a questioned, were written
by the same individual is proposed. The goal of this formulation is to learn the specific uniqueness of style in a
particular author's writing, given the known document. Since there are often insufficient samples to extrapolate
a generalized model of an writer's handwriting based solely on the document, we instead generalize over the
differences between the author and a large population of known different writers. This is in contrast to an earlier
model proposed whereby probability distributions were a priori without learning. We show the performance of
the model along with a comparison in performance to the non-learning, older model, which shows significant
Writer adaptation or specialization is the adjustment of handwriting recognition algorithms to a specific writer's
style of handwriting. Such adjustment yields significantly improved recognition rates over counterpart general
recognition algorithms. We present the first unconstrained off-line handwriting adaptation algorithm for Arabic
presented in the literature. We discuss an iterative bootstrapping model which adapts a writer-independent
model to a writer-dependent model using a small number of words achieving a large recognition rate increase in
the process. Furthermore, we describe a confidence weighting method which generates better results by weighting
words based on their length. We also discuss script features unique to Arabic, and how we incorporate them into
our adaptation process. Even though Arabic has many more character classes than languages such as English,
significant improvement was observed.
The testing set consisting of about 100 pages of handwritten text had an initial average overall recognition
rate of 67%. After the basic adaptation was finished, the overall recognition rate was 73.3%. As the improvement
was most marked for the longer words, and the set of confidently recognized longer words contained many fewer
false results, a second method was presented using them alone, resulting in a recognition rate of about 75%.
Initially, these words had a 69.5% recognition rate, improving to about a 92% recognition rate after adaptation.
A novel hybrid method is presented with a rate of about 77.2%.
Line segmentation is the first and the most critical pre-processing step for a document recognition/analysis task.
Complex handwritten documents with lines running into each other impose a great challenge for the line segmentation
problem due to the absence of online stroke information. This paper describes a method to disentangle
lines running into each other, by splitting and associating the correct character strokes to the appropriate lines.
The proposed method can be used along with the existing algorithm1 that identifies such overlapping lines in
documents. A stroke tracing method is used to intelligently segment the overlapping components. The method
uses slope and curvature information of the stroke to disambiguate the course of the stroke at cross points. Once
the overlapping components are segmented into strokes, a statistical method is used to associate the strokes with
Word segmentation is the most critical pre-processing step for any handwritten document recognition and/or
retrieval system. When the writing style is unconstrained (written in a natural manner), recognition of individual
components may be unreliable, so they must be grouped together into word hypotheses before recognition
algorithms can be used. This paper describes a gap metrics based machine learning approach to separate a line
of unconstrained handwritten text into words. Our approach uses a set of both local and global features, which
is motivated by the ways in which human beings perform this kind of task. In addition, in order to overcome
the disadvantage of different distance computation methods, we propose a combined distance measure computed
using three different methods. The classification is done by using a three-layer neural network. The algorithm is
evaluated using an unconstrained handwriting database that contains 50 pages (1026 line, 7562 words images)
handwritten documents. The overall accuracy is 90.8%, which shows a better performance than a previous
Matching of partial fingerprints has important applications in both biometrics and forensics. It is well-known that
the accuracy of minutiae-based matching algorithms dramatically decrease as the number of available minutiae
decreases. When singular structures such as core and delta are unavailable, general ridges can be utilized. Some
existing highly accurate minutiae matchers do use local ridge similarity for fingerprint alignment. However, ridges
cover relatively larger regions, and therefore ridge similarity models are sensitive to non-linear deformation. An
algorithm is proposed here to utilize ridges more effectively- by utilizing representative ridge points. These points
are represented similar to minutiae and used together with minutiae in existing minutiae matchers with simple
modification. Algorithm effectiveness is demonstrated using both full and partial fingerprints. The performance
is compared against two minutiae-only matchers (Bozorth and k-minutiae). Effectiveness with full fingerprint
matching is demonstrated using the four databases of FVC2002- where the error rate decreases by 0.2-0.7% using
the best matching algorithm. The effectiveness is more significant in the case of partial fingerprint matching-
which is demonstrated with sixty partial fingerprint databases generated from FVC2002 (with five levels of numbers
of minutiae available). When only 15 minutiae are available the error rate decreases 5-7.5%. Thus the method,
which involves selecting representative ridge points, minutiae matcher modification, and a group of minutiae
matchers, demonstrates improved performance on full and especially partial fingerprint matching.
The paper describes the use of Conditional Random Fields(CRF) utilizing contextual information in automatically
labeling extracted segments of scanned documents as Machine-print, Handwriting and Noise. The result of
such a labeling can serve as an indexing step for a context-based image retrieval system or a bio-metric signature
verification system. A simple region growing algorithm is first used to segment the document into a number of
patches. A label for each such segmented patch is inferred using a CRF model. The model is flexible enough
to include signatures as a type of handwriting and isolate it from machine-print and noise. The robustness of
the model is due to the inherent nature of modeling neighboring spatial dependencies in the labels as well as
the observed data using CRF. Maximum pseudo-likelihood estimates for the parameters of the CRF model are
learnt using conjugate gradient descent. Inference of labels is done by computing the probability of the labels
under the model with Gibbs sampling. Experimental results show that this approach provides for 95.75% of the
data being assigned correct labels. The CRF based model is shown to be superior to Neural Networks and Naive
A new technique to segment a handwritten document into distinct lines of text is presented. Line segmentation
is the first and the most critical pre-processing step for a document recognition/analysis task. The proposed
algorithm starts, by obtaining an initial set of candidate lines from the piece-wise projection profile of the
document. The lines traverse around any obstructing handwritten connected component by associating it to the
line above or below. A decision of associating such a component is made by (i) modeling the lines as bivariate
Gaussian densities and evaluating the probability of the component under each Gaussian or (ii)the probability
obtained from a distance metric. The proposed method is robust to handle skewed documents and those with
lines running into each other. Experimental results show that on 720 documents (which includes English, Arabic
and children's handwriting) containing a total of 11, 581 lines, 97.31% of the lines were segmented correctly. On
an experiment over 200 handwritten images with 78, 902 connected components, 98.81% of them were associated
to the correct lines.
The fingerprint verification task answers the question of whether or not two fingerprints belongs to the same finger. The paper focuses on the classification aspect of fingerprint verification. Classification is the third and final step after after the two earlier steps of feature extraction, where a known set of features (minutiae points) have been extracted from each fingerprint, and scoring, where a matcher has determined a degree of match between the two sets of features. Since this is a binary classification problem involving a single variable, the commonly used threshold method is related to the so-called receiver operating characteristics (ROC). In the ROC approach the optimal threshold on the score is determined so as to determine match or non-match. Such a method works well when there is a well-registered fingerprint image. On the other hand more sophisticated methods are needed when there exists a partial imprint of a finger- as in the case of latent prints in forensics or due to limitations of the biometric device. In such situations it is useful to consider classification methods based on computing the likelihood ratio of match/non-match. Such methods are commonly used in some biometric and forensic domains such as speaker verification where there is a much higher degree of uncertainty. This paper compares the two approaches empirically for the fingerprint classification task when the number of available minutiae are varied. In both ROC-based and likelihood ratio methods, learning is from a general population of ensemble of pairs, each of which is labeled as being from the same finger or from different fingers. In the ROC-based method the best operating point is derived from the ROC curve. In the likelihood method the distributions of same finger and different finger scores are modeled using Gaussian and Gamma distributions. The performances of the two methods are compared for varying numbers of minutiae points available. Results show that the likelihood method performs better than the ROC-based method when fewer minutiae points are available. Both methods converge to the same accuracy as more minutiae points are available.
The design and performance of a system for spotting handwritten Arabic words in scanned document images is presented. Three main components of the system are a word segmenter, a shape based matcher for words and a search interface. The user types in a query in English within a search window, the system finds the equivalent Arabic word, e.g., by dictionary look-up, locates word images in an indexed (segmented) set of documents. A two-step approach is employed in performing the search: (1) prototype selection: the query is used to obtain a set of handwritten samples of that word from a known set of writers (these are the prototypes), and (2) word matching: the prototypes are used to spot each occurrence of those words in the indexed document database. A ranking is performed on the entire set of test word images-- where the ranking criterion is a similarity score between each prototype word and the candidate words based on global word shape features. A database of 20,000 word images contained in 100 scanned handwritten Arabic documents written by 10 different writers was used to study retrieval performance. Using five writers for providing prototypes and the other five for testing, using manually segmented documents, 55% precision is obtained at 50% recall. Performance increases as more writers are used for training.
A signature verification method that combines recognition methods of one-dimensional signals, e.g., speech and on-line handwriting, and two-dimensional images, e.g., holistic word recognition in OCR and off-line handwriting is described. In the one-dimensional approach, a sequence of data is obtained by tracing the exterior contour of the signature which allows the application of string-matching algorithms. The upper and lower contours of the signature are first determined by ignoring small gaps between signature components. The contours are combined into a single sequence so as to define a pseudo-writing path. To match two signatures a non-linear normalization method, viz., dynamic time warping, is applied to segment them into curves. Shape descriptors based on Zernike moments are extracted as features from each segment. A harmonic distance is used for measuring signature similarity. The two-dimensional approach is based on using features describing the word-shape. When the two methods are combined, the overall performance is significantly better than either method alone. With a database of 1320 genuines and 1320 forgeries the combination method has an accuracy of 90%.
New machine learning strategies are proposed for person identification which can be used in several biometric
modalities such as friction ridges, handwriting, signatures and speech. The biometric or forensic performance
task answers the question of whether or not a sample belongs to a known person. Two different learning paradigms
are discussed: person-independent (or general learning) and person-dependent (or person-specific learning). In
the first paradigm, learning is from a general population of ensemble of pairs, each of which is labelled as being
from the same person or from different persons- the learning process determines the range of variations for given
persons and between different persons. In the second paradigm the identity of a person is learnt when presented
with multiple known samples of that person- where the variation and similarities within a particular person are
learnt. The person-specific learning strategy is seen to perform better than general learning (5% higher performace
with signatures). Improvement of person-specific performance with increasing number of samples is also
Search aspects of a system for analyzing handwritten documents are described. Documents are indexed using global image features, e.g., stroke width, slant as well as local features that describe the shapes of words and characters. Image indexing is done automatically using page analysis, page segmentation, line separation, word segmentation and recognition of words and characters. Two types of search are permitted: search based on global features of entire document and search using features at local level. For the second type of search, i.e., local, all the words in the document are characterized and indexed by various features and it forms the basis of different search techniques. The paper focuses on local search and describes four tasks: word/phrase spotting, text to image, image to text and plain text. Performance in terms of precision/recall and word ranking is reported on a database of handwriting samples from about 1,000 individuals.
Existing word image retrieval algorithms suffer from either low retrieval precision or high computation complexity. We present an effective and efficient approach for word image matching by using gradient-based binary features. Experiments over a large database of handwritten word images show that the proposed approach consistently outperforms the existing best handwritten word image retrieval algorithm. Dynamic Time Warping (DTW) with profile-based shape features. Not only does the proposed approach have much higher retrieval accuracy, but also it is 893 times faster than DTW.
Using handwritten characters we address two questions (i) what is the group identification performance of different alphabets (upper and lower case) and (ii) what are the best characters for the verification task (same writer/different writer discrimination) knowing demographic information about the writer such as ethnicity, age or sex. The Bhattacharya distance is used to rank different characters by their group discriminatory power and the k-nn classifier to measure the individual performance of characters for group identification. Given the tasks of identifying the correct gender/age/ethnicity or handedness, the accumulated performance of characters varies between 65% and 85%.
Several dissimilarity measures for binary vectors are formulated and examined for their recognition capability in handwriting identification for which the binary micro-features are used to characterize handwritten character shapes. Pertaining to eight dissimilarity measures, i.e., Jaccard-Needham, Dice, Correlation, Yule, Russell-Rao, Sokal-Michener, Rogers-Tanmoto and Kulzinsky, the discriminary power of ten individual characters and their combination is exhaustively studied. Conclusions are made on how to choose a dissimilarity measure and how to combine hybrid features.
In our previous work of writer identification, a database of handwriting samples (written in English) of over one thousand individuals was created, and two types of computer-generated features of sample handwriting were extracted: macro and micro features. Using these features, writer identification experiments were performed: given that a document is written by one of n writers, the task is to determine the writer. With n = 2, we correctly determined the writer with a 99% accuracy using only 10-character micro features in the writing; with n = 1000, the accuracy is dropped to 80%. To obtain higher performance, we propose a combination of macro and micro level features. First, macro level features are used in a filtering model: the computer program is presented with multiple handwriting samples from a large number (1000) of writers, and the question posed is: Which of the samples are consistent with a test sample? As a result of using the filtering model, a reduced set of documents (100) is obtained and presented to the final identification model which uses the micro level features. We improved our writer identification system from 80% to 87.5% by the proposed filtering-combination technique when n = 1000.
A study was undertaken to determine the power of handwriting to distinguish between individuals. Handwriting samples of one thousand five hundred individuals, representative of the US population with respect to gender, age, ethnic groups, etc., were obtained. Analyzing differences in handwriting was done by using computer algorithms for extracting features from scanned images of handwriting. Attributes characteristic of the handwriting were obtained, e.g., line separation, slant, character shapes, etc. These attributes, which are a subset of attributes used by expert document examiners, were used to quantitatively establish individuality by using machine learning approaches. Using global attributes of hadwriting and very few characters in the writing, the ability to determine the writer with a high degree of confidence was established. The work is a step towards providing scientific support for admitting handwriting evidence in court. The mathematical approach and the resulting software also have the promise of aiding the expert document examiner.
This paper describes an off-line handwritten document data collection effort conducted at CEDAR and discusses systems that manage the document image data. We introduce the CEDAR letter, discuss its completeness and then describe the specification of the CEDAR letter image database consisting of writer data and features obtained from a handwriting sample, statistically representative of the U.S. population. We divide the document image and information management system into four systems: (1) acquisition, (2) archiving, (3) indexing and retrieval and (4) display systems. This paper discusses the issues raised by constructing the CEDAR letter database and by its potential usefulness to document image analysis, recognition, and identification fields.
The problem of Writer Identification based on similarity is formalized by defining a distance between character or word level features and finding the most similar writings or all writings which are within a certain threshold distance. Among many features, we consider stroke direction and pressure sequence strings of a character as character level image signatures for writer identification. As the conventional definition of edit distance is not applicable in essence, we present the newly defined and modified edit distances depending upon their measurement types. Finally, we present a prototype stroke directional and pressure sequence string extractor used on the writer identification. The importance of this study is the attempt to give a definition of distance between two characters based on the two types of strings.
This paper describes an approach to retrieving information from document images stored in a digital library by means of knowledge-based layout analysis and logical structure derivation techniques. Queries on document image content are categorized in terms of the type of information that is desired, and are parsed to determine the type of document from which information is desired, the syntactic level of the information desired, and the level of analysis required to extract the information. Using these clauses in the query, a set of salient documents are retrieved, layout analysis and logical structure derivation are performed on the retrieved documents, and the documents are then analyzed in detail to extract the relevant logical components. A 'document browser' application, being developed based on this approach, allows a user to interactively specify queries on the documents in the digital library using a graphical user interface, provides feedback about the candidate documents at each stage of the retrieval process, and allows refinements of the query based on the intermediate results of the search. Results of a query are displayed either as an image or as formatted text.
In our recent research, we found that visual inter-word relations can be useful for different stages of English text recognition such as character segmentation and postprocessing. Different methods had been designed for different stages. In this paper, we propose a unified approach to use visual contextual information for text recognition. Each word image has a lattice, which is a data structure to keep results of segmentation, recognition and visual inter-word relation analysis. A lattice allows ambiguity and uncertainty at different levels. A lattice-based unification algorithm is proposed to analyze information in the lattices of two or more visually related word images, and upgrade their contents. Under the approach, different stages of text recognition can be accomplished by the same set of operations -- inter-word relation analysis and lattice-based unification. The segmentation and recognition result of a word image can be propagated to those visually related word images and can contribute to the recognition of them. In this paper, the formal definition of lattice, the unification operators and their uses are discussed in detail.
Cherry Blossom is a machine-printed Japanese document recognition system developed at CEDAR in past years. This paper focuses on the character recognition part of the system. for Japanese character classification, two feature sets are used in the system: one is the local stroke direction feature; another is the gradient, structural and concavity feature. Based on each of those features, two different classifiers are designed: one is the so-called minimum error subspace classifier; another is the fast nearest-neighbor (FNN) classifier. Although the original version of the FNN classifier uses Euclidean distance measurement, its new version uses both Euclidean distance and the distance calculation defined in the ME subspace method. This integration improved performance significantly. The number of character classes handled by those classifiers is about 3,300 (including alphanumeric, kana and level-1 Kanji JIS). Classifiers were trained and tested on 200 ppi character images from CEDAR Japanese character image CD-ROM.
The hand-printed address recognition system described in this paper is a part of the Name and Address Block Reader (NABR) system developed by the Center of Excellence for Document Analysis and Recognition (CEDAR). NABR is currently being used by the IRS to read address blocks (hand-print as well as machine-print) on fifteen different tax forms. Although machine- print address reading was relatively straightforward, hand-print address recognition has posed some special challenges due to demands on processing speed (with an expected throughput of 8450 forms/hour) and recognition accuracy. We discuss various subsystems involved in hand- printed address recognition, including word segmentation, word recognition, digit segmentation, and digit recognition. We also describe control strategies used to make effective use of these subsystems to maximize recognition accuracy. We present system performance on 931 address blocks in recognizing various fields, such as city, state, ZIP Code, street number and name, and personal names.
Determining the readability of documents is an important task. Human readability pertains to the scenario when a document image is ultimately presented to a human to read. Machine readability pertains to the scenario when the document is subjected to an OCR process. In either case, poor image quality might render a document un-readable. A document image which is human readable is often not machine readable. It is often advisable to filter out documents of poor image quality before sending them to either machine or human for reading. This paper is about the design of such a filter. We describe various factors which affect document image quality and the accuracy of predicting the extent of human and machine readability possible using metrics based on document image quality. We illustrate the interdependence of image quality measurement and enhancement by means of two applications that have been implemented: (1) reading handwritten addresses on mailpieces and (2) reading handwritten U.S. Census forms. We also illustrate the degradation of OCR performance as a function of image quality. On an experimental test set of 517 document images, the image quality metric (measuring fragmentation due to poor binarization) correctly predicted 90% of the time that certain documents were of poor quality (fragmented characters) and hence not machine readable.
An important aspect of document understanding is document logical structure derivation, which involves knowledge-based analysis of document images to derive a symbolic description of their structure and contents. Domain-specific as well as generic knowledge about document layout is used in order to classify, logically group, and determine the read-order of the individual blocks in the image, i.e., translate the physical structure of the document into a layout-independent logical structure. We have developed a computational model for the derivation of the logical structure of documents. Our model uses a rule-based control structure, as well as a hierarchical multi-level knowledge representation scheme in which knowledge about various types of documents is encoded into a document knowledge base and is used by reasoning processes to make inferences about the document. An important issue addressed in our research is the kind of domain knowledge that is required for such analysis. A document logical structure derivation system (DeLoS) has been developed based on the above model, and has achieved good results in deriving the logical structure of complex multi- articled documents such as newspaper pages. Applications of this approach include its use in information retrieval from digital libraries, as well as in comprehensive document understanding systems.
Traditionally, a Chinese or Japanese optical character reader (OCR) has to represent each character category individually as one or more feature prototypes, or a structural description which is a composition of manually derived components such as radicals. Here we propose a new approach in which various kinds of visual similarities between different Chinese characters are analyzed automatically at the feature level. Using this method, character categories are related to each other by training on fonts; and character images from a text page can be related to each other based on the visual similarities they share. This method provides a way to interpret character images from a text page systematically, instead of a sequence of isolated character recognitions. The use of the method for post processing in Japanese text recognition is also discussed.
The proposed approach addresses the problem of recognition of touching characters by forming a closed loop system between segmentation and isolated character classification for mutually beneficial feedback. Multiple hypotheses are generated and ranked on the basis of various constraints such as classifier confidence, geometric or structural requirements and contextual information at every intermediate step. The method uses a variable width window sliding throughout the word and results in a tree structure with intermediate nodes representing validated characters. Each partial path is assigned a confidence value on the basis of segmentation confidence, recognition confidence and first order transition probability, if applicable. Final candidates for the field truth are ranked according to the value of path confidence. The proposed system is more robust since it uses all the available knowledge sources, including context, run time at every intermediate step.
In lexicon based recognition of machine-printed word images, the size of the lexicon can be quite extensive. The recognition performance is closely related to the size of the lexicon. Recognition performance drops quickly when lexicon size increases. Here, we present an algorithm to improve the word recognition performance by reducing the size of the given lexicon. The algorithm utilizes the information provided by the first and last characters of a word to reduce the size of the given lexicon. Given a word image and a lexicon that contains the word in the image, the first and last characters are segmented and then recognized by a character classifier. The possible candidates based on the results given by the classifier are selected, which give us the sub-lexicon. Then a word shape analysis algorithm is applied to produce the final ranking of the given lexicon. The algorithm was tested on a set of machine- printed gray-scale word images which includes a wide range of print types and qualities.
This paper presents a robust text segmentation algorithm for printed Japanese documents. A divide-and-conquer approach is proposed to handle a large variety of image qualities and print styles. The approach can adapt its processing strategies according to the text quality, i.e., a method using diverse knowledge sources will be exploited to segment degraded text while a fast simple method will be used for good quality text. Since the algorithm can adaptively select the methods for different scenarios, the segmentation is highly efficient in terms of speed and accuracy. The segmenter has tree modules for image preprocessing, line segmentation, and character segmentation. The preprocessor uses the statistical information of the image connected components to globally estimate character size and uses projection profile to determine image quality. The line segmenter requires a `thresholding and smoothing' step prior to line extraction if the image is noisy. During character segmentation, the character segmenter first tries to locate components which contain touching characters. If touching characters exist, an algorithm which includes a profile-based splitting and classifier-based multiple hypothesis processing will be invoked to perform the segmentation.
There is an increasing trend in recent OCR research to improve recognition performance by combining several complementary algorithms. Numerous successful applications of classifier combination have been described in the literature. However, most of the combination algorithms neglect the fact that classifier performances are dependent on various pattern and image characteristics. More effective combination can be achieved if that dependency information is used to dynamically combine classifiers. Two types of dynamic selections are distinguished. The postconditioned selection seeks better approximation to unconditioned classifier output distribution. The preconditioned selection captures the variations in the density function of classifier outputs conditioned on the inputs. Although both types of selections have the potential to improve combination performance, we argue that preconditioned selections have lower error bounds than postconditioned selections. The difficulties of applying preconditioned selections are identifying characteristic variables and estimating their effects on classifier outputs. A solution using neural network to achieve preconditioned selection is suggested. Two examples on handprinted digit recognition are used to illustrate the potential gain.
Mathematical morphology has been widely used in recent years in the image processing area. Due to the nonlinear nature of morphological operations, their application to some image noise removal problems have achieved very impressive results. However, of the morphological algorithms by far most use only a single structuring element. To optimize the performances on various parts of an image, we introduce morphological operations using an adaptive structuring element for noise removal of binary and grayscale images. These techniques were used in the preprocessing of character recognition problems at CEDAR, SUNY at Buffalo, New York. Demonstrations of the improved performance of the algorithm are provided.
The word shape analysis approach to text recognition is motivated by discoveries in psychological studies of the human reading process. It attempts to describe and compare the shape of the word as a whole object without trying to segment and recognize the individual characters, so it bypasses the errors committed in character segmentation and classification. However, the large number of classes and large variation and distortion expected in all patterns belonging to the same class make it difficult for conventional, accurate, pattern recognition approaches. A word shape analysis approach using ideal word patterns to overcome the difficulty and improve recognition performance is described in this paper. A special word pattern which characterizes a word class is extracted from different sample patterns of the word class and stored in memory. Recognition of a new word pattern is achieved by comparing it with the special pattern of each word class called ideal word pattern. The process of generating the ideal word pattern of each word class is proposed. The algorithm was tested on a set of machine printed gray scale word images which included a wide range of print types and qualities.
A new thresholding algorithm is presented to address strong noise, complex patterns, poor contrast, and variable modalities in gray-scale histograms. It is based on document image texture analysis. The algorithm consists of three steps. First candidate thresholds are produced from the gray scale histogram analysis. Then, texture features associated with each candidate threshold are computed from the corresponding run-length histogram. Finally, the optimal threshold is selected according to the goodness evaluation so that the most desirable document texture features are produced. Only one pass through an image is required for optimal threshold selection. The test set consisted of 9000 machine printed address blocks from an unconstrained U.S. mail stream. Over 99.6% of the images were visually well-binarized. In an objective test, a system run with 594 mail address blocks, which contains many difficult images, shows that an 8.1% higher character recognition rate was achieved, compared to that by a previous algorithm due to Otsu.
A fast lexicon search based on trie is presented. Traditionally, a successful trie retrieval requires each element of the input string to find an exact match on a particular path in the trie. However, this approach cannot verify a garbled input string, which is generated due to imperfect character segmentation and recognition of an OCR. The string may contain multiple candidates for a character position or no candidate due to segmentation error. The proposed representation scheme, an extension of trie, allows lexicon look-up even with only some of the string elements specified. An input string is presented as a regular expression and all the paths in the trie that satisfy the expression will be considered as word candidates. The candidates can then be ranked based on character classifier decisions. Confusion in character recognition can be handled by allowing an expression component to have more than one character candidate while segmentation error can be overcome by postulating a word region to contain a certain number of unknown characters. Three extensions have been made to trie data structures for position independent access and fast exhaustive search of all valid paths: (1) bidirectional linkage between two nodes at adjacent levels to enable trie traversal in either direction, (2) the nodes with the same letter at the same word position are linked so that all the words which have the same letter at a specific position can be located immediately, and (3) an index table which records the entry points of letters at specific word positions in the trie in order to facilitate trie access at an arbitrary letter position. This representation scheme has been tested on postal address field recognition. The trie represents 11,157 words, the average access time is 0.02 sec on a SUN SPARC2 and with an average of 3 candidates.
An image enhancement technique for mail piece images based on window statistics is presented. The approach has been developed to increase the image quality for the subsequent segmentation and block analysis in the real-time address block location (RT-ABL) system that processes a stream of mail piece images and locates the destination address block. As a framework of this approach, window statistics consisting of local average, (Alpha) , local standard deviation, (sigma) , and center pixel value, (Rho) , over a 9 X 9 window are used. This approach includes contrast enhancement, bleed-through removal, and binarization. Contrast enhancement and bleed-through removal are achieved through a nonlinear mapping M((Alpha) , (sigma) , (Rho) ) obtained empirically. A simple and efficient binarization is also obtained using the ratio of a pixel value of gray scale output (Rho) ' obtained from the mapping and (Alpha) . Major advantages of this method are the avoidance of black-out or white-out that are encountered in other binarization methods on low contrast images, and improved character segmentation that helps the segmentation tool to locate key components of the destination address block, such as state abbreviation or ZIP Code. Examples of images transformed using the method are presented along with a discussion of the performance comparisons.
We present a fast new algorithm for grayscale character recognition. By operating on grayscale images directly, we attempt to maximize the information that can be extracted. Traditional recognition based on binarization of images is also avoided. Previous work using gradient-based contour encoding was used in a restricted domain where objects were size- invariant and fewer patterns had to be distinguished; a simpler feature set sufficed, in this domain. In character recognition, a feature-extractor has to handle character images of arbitrary size. Our algorithm extracts local contour features based on the pixel gradients present in an image. A gradient map, i.e., gradient magnitude and direction at every pixel, is computed; this map is thresholded to avoid responses to noise and spurious artifacts, the map is then partitioned coarsely. Character contours are encoded by quantizing gradient directions into a few representative direction bins. This method does not require that images be normalized to a fixed grid-size before feature extraction. Features extracted by this method have been used to train a 2-layer neural network classifier. Experiments with machine-printed numeric (10-class), and alphanumeric & punctuation (77-class) character images, captured at 300 ppi from mailpiece images, have been conducted. Currently, 99.4% numerals and 93.4% from 77-class images were recognized correctly from a testing set of 5420 numeric and 24,988 character images. When shape and case confusions are allowed, the recognition performance improves to 97.5%. A similar experiment with binarized 77-class images resulted in 92.1% correctly recognized test images. This method is being extended to handwritten characters recognition.
Optical character recognition (OCR) traditionally applies to binary-valued imagery although text always scanned and stored in gray-scale. Binarization of multivalued image may remove important topological information from characters and introduce noise to character background. Low quality imagery, produced by poor print text and improper image lift, magnifies the shortcomings of this process. A character classifier is proposed to recognize gray-scale characters by extracting structural features from character outlines. A fast local contrast based gray-scale edge detector has been developed to locate character boundaries. A pixel is considered as an edge-pixel if its gray value is below a threshold and has a neighbor whose gray value is above the threshold. Edges are then thinned to one pixel wide. Extracting structural features from edges is performed by convolving the edges with a set of feature templates. Currently, 16 features, such as strokes, curves, and corners, are considered. Extracted features are compressed to form a binary vector with 576 features and it is used as input to a classifier. This approach is being tested on machine-printed characters which are extracted from mail address blocks. Characters are sampled at 300 ppi and quantized with 8 bits. Experimental results also demonstrate that recognition rates can be improved by enhancing image quality prior to boundary detection.
The assignment of a nine digit ZIP Code (ZIP + 4 Code) to the digital image of a machine printed address block is a problem of central importance in automated mail sorting. This problem is especially difficult since most addresses do not contain ZIP + 4 Codes and often the information that must be read to match an address to one of the 28 million entries in the ZIP + 4 file is either erroneous, incomplete, or missing altogether. This paper discusses a system for interpreting a machine printed address and assigning a ZIP + 4 Code that uses a constraint satisfaction approach. Words in an address block are first segmented and parsed to assign probable semantic categories. Word images are then recognized by a combination of digit, character, and word recognition algorithms. The control structure uses a constraint satisfaction problem solving approach to match the recognition results to an entry in the ZIP + 4 file. It is shown how this technique can both determine correct responses as well as compensate for incomplete or erroneous information. Experimental results demonstrate the success of this system. In a recent test on over 1000 machine printed address blocks, the ZIP + 4 encode rate was over 73 percent. This compares to the success rate of current postal OCRs which is about 45 percent. Additionally, the word recognition algorithm recognizes over 92 percent of the input images (over 98 percent in the top 10 choices.
A robust algorithm for offline cursive script recognition is described. The algorithm uses a generate-and-test paradigm to analyze cursive word images. The generate phase of the algorithm intelligently segments the word after analyzing certain structural features present in the word. The test phase determines the most likely character candidates among the segmentation points by using a recognition algorithm trained on generalized cursive letter shapes. In a sense, word recognition is done by sliding a variable sized window across the word looking for recognizable characters and strokes. The output of this system is a list of all plausible interpretations of the word. This list is then analyzed by a two-step contextual post- processor which first matches all of the interpretations to a supplied dictionary using a string matching algorithm. This eliminates the least likely interpretations. The remaining candidates are then analyzed for certain character spatial relationships (local reference line finder) to finally rank the dictionary. The system has the advantage of not requiring explicit word training yet is able to recognize many handwriting styles. This system is being successfully tested on a database of handwritten words extracted from live mail with dictionary sizes of up to 300 words. Planned extensions include developing a multilevel generate-and-test paradigm which can handle any type of handwritten word.
A regression method is proposed to combine decisions of multiple character recognition algorithms. The method computes a weighted sum of the rank scores produced by the individual classifiers and derive a consensus ranking. The weights are estimated by a logistic regression analysis. Two experiments are discussed where the method was applied to recognize degraded machine-printed characters and handwritten digits. The results show that the combination outperforms each of the individual classifiers.
Diagnosis of a malfunctioning physical system is the task of identifying those component parts whose failures are responsible for discrepancies between observed and correct system behavior. The goal of interactive diagnosis is to repeatedly select the best information- gathering action to perform until the device is fixed. We developed a probabilistic diagnosis theory that incorporates probabilistic reasoning into model-based diagnosis. In addition to the structural and functional information normally used in model-based diagnosis, probabilities of component failure are also used to solve the two major subtasks of interactive model-based diagnosis: hypothesis generation and action selection. This paper describes a model-based diagnostic system built according to our probabilistic theory. The major contributions of this paper are the incorporation of probabilistic reasoning into model-based diagnosis and the integration of repair as part of diagnosis. The integration of diagnosis and repair makes it possible to effectively troubleshoot failures in complex systems.