The paper considers the classification methods for business documents images data extracted after recognition. The peculiarities of the recognized text analysis are pointed out. The identification mechanism for the recognized words is described. The advantages and disadvantages of the Levenshtein distance are listed. Other string distance metrics are considered: Jaro–Winkler similarity, multiset metric, Most Frequent K Characters (MFKC) metric. The standard Levenshtein distance is compared with other string distance metrics. A modification of the Levenshtein distance is proposed, which is aimed at the peculiarities of recognized characters. The paper provides the experimental results illustrating the proposed distance application.
The paper proposes an approach for matching of digitized copies of business documents. This task arises when comparing two versions of the same document, genuine and forgery, to find possible modifications, for example in the banking sector during the conclusion of contracts in paper form to avoid possible fraud. The matching method of two documents based on comparison images of text lines using Variational Autoencoder (VAE) trained on genuine images and calculation Fisher information metric to find modifications. Experiments were conducted on the public Payslips dataset (in French). The results show the high quality and reliability of finding document forgeries and are compared to the results of the method which applies OCR and image matching.
This article is focused on methods of search for falsifications in scanned copies of business documents. This task arises from a comparison of two copies of business documents signed by two parties. The comparison should be performed to detect possible changes made by one of the parties. This problem is relevant, for instance, in the banking sector when signing agreements on paper. The method of partial search for matching flexible documents, where text attributes may be changed, and unintentional modifications of non-essential words may be made is considered. The method of comparison of two scanned images based on the recognition and analysis of N-grams word sequences is proposed. The proposed method has been tested on private dataset. The proposed method has demonstrated high quality and reliability of the search for differences in two samples of one agreement-type document.
In this work the methods of comparison of digitized copies of administrative documents were considered. This problem arises, for example, when comparing two copies of documents signed by two parties in order to find possible modifications made by one party, in the banking sector at the conclusion of contracts in paper form. The proposed method of document image comparison is based on a combination of several ways of image comparison of words that are descriptors of text feature points. Testing was conducted on public Payslip Dataset (French). The results showed the high quality and the reliability of finding differences in two images that are versions of the same document.
This paper considers problems regarding the development of stochastic models consistent with the results of character image recognition in video stream. Assumptions about their structure and properties are formulated for the constructed models. The description of the model components defines the Dirichlet distribution and its generalizations. The parameters of these distributions are determined using statistical estimation methods. The Akaike information criterion is used to rank models. The verification of the agreement of the proposed theoretical distributions to the sample data is carried out.
In this paper the problem statement is given to compare the digitized pages of the official papers. Such problem appears during the comparison of two customer copies signed at different times between two parties with a view to find the possible modifications introduced on the one hand. This problem is a practically significant in the banking sector during the conclusion of contracts in a paper format. The method of comparison based on the recognition, which consists in the comparison of two bag-of-words, which are the recognition result of the master and test pages, is suggested. The described experiments were conducted using the OCR Tesseract and the siamese neural network. The advantages of the suggested method are the steady operation of the comparison algorithm and the high exacting precision, and one of the disadvantages is the dependence on the chosen OCR.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.