Duplicate document detection in DocBrowse

Vikram Chalana; Andrew G. Bruce; Thien Nguyen

doi:10.1117/12.304630

1 April 1998 Duplicate document detection in DocBrowse

Vikram Chalana, Andrew G. Bruce, Thien Nguyen

Proceedings Volume 3305, Document Recognition V; (1998) https://doi.org/10.1117/12.304630
Event: Photonics West '98 Electronic Imaging, 1998, San Jose, CA, United States

Abstract

Duplicate documents are frequently found in large databases of digital documents, such as those found in digital libraries or in the government declassification effort. Efficient duplicate document detection is important not only to allow querying for similar documents, but also to filter out redundant information in large document databases. We have designed three different algorithm to identify duplicate documents. The first algorithm is based on features extracted from the textual content of a document, the second algorithm is based on wavelet features extracted from the document image itself, and the third algorithm is a combination of the first two. These algorithms are integrated within the DocBrowse system for information retrieval from document images which is currently under development at MathSoft. DocBrowse supports duplicate document detection by allowing (1) automatic filtering to hide duplicate documents, and (2) ad hoc querying for similar or duplicate documents. We have tested the duplicate document detection algorithms on 171 documents and found that text-based method has an average 11-point precision of 97.7 percent while the image-based method has an average 11- point precision of 98.9 percent. However, in general, the text-based method performs better when the document contains enough high-quality machine printed text while the image- based method performs better when the document contains little or no quality machine readable text.

Citation Download Citation

Vikram Chalana, Andrew G. Bruce, and Thien Nguyen "Duplicate document detection in DocBrowse", Proc. SPIE 3305, Document Recognition V, (1 April 1998); https://doi.org/10.1117/12.304630

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available