Paper
1 April 1998 Duplicate document detection in DocBrowse
Vikram Chalana, Andrew G. Bruce, Thien Nguyen
Author Affiliations +
Proceedings Volume 3305, Document Recognition V; (1998) https://doi.org/10.1117/12.304630
Event: Photonics West '98 Electronic Imaging, 1998, San Jose, CA, United States
Abstract
Duplicate documents are frequently found in large databases of digital documents, such as those found in digital libraries or in the government declassification effort. Efficient duplicate document detection is important not only to allow querying for similar documents, but also to filter out redundant information in large document databases. We have designed three different algorithm to identify duplicate documents. The first algorithm is based on features extracted from the textual content of a document, the second algorithm is based on wavelet features extracted from the document image itself, and the third algorithm is a combination of the first two. These algorithms are integrated within the DocBrowse system for information retrieval from document images which is currently under development at MathSoft. DocBrowse supports duplicate document detection by allowing (1) automatic filtering to hide duplicate documents, and (2) ad hoc querying for similar or duplicate documents. We have tested the duplicate document detection algorithms on 171 documents and found that text-based method has an average 11-point precision of 97.7 percent while the image-based method has an average 11- point precision of 98.9 percent. However, in general, the text-based method performs better when the document contains enough high-quality machine printed text while the image- based method performs better when the document contains little or no quality machine readable text.
© (1998) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Vikram Chalana, Andrew G. Bruce, and Thien Nguyen "Duplicate document detection in DocBrowse", Proc. SPIE 3305, Document Recognition V, (1 April 1998); https://doi.org/10.1117/12.304630
Lens.org Logo
CITATIONS
Cited by 6 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Databases

Detection and tracking algorithms

Feature extraction

Optical character recognition

Image retrieval

Algorithm development

Vector spaces

RELATED CONTENT

Coarse-to-fine texture images retrieval method
Proceedings of SPIE (January 01 2001)
Combining fast search and learning for fast similarity search
Proceedings of SPIE (December 23 1999)
Tumor detection using digital mammography
Proceedings of SPIE (October 22 2001)

Back to Top