|
Optical Character Recognition (OCR) technology is used in digitization projects to convert paper documents into electronic form. OCR systems typically have a degree of error. As a result, if two exact copies of a given paper document are submitted for digitization, the electronic copies will be slightly different.
In such situations, traditional de-duplication technology is ineffective. The CRC or MD5 hash algorithm used for de-duping cannot cope with this scenario. Even if the original paper documents are exact duplicates, the de-duping does not work because the digitized files are rarely exact duplicates.
In digitization projects, this generates significant business costs, such as degradation of repository quality, low precision in search results and the redundant review of documents which are actually duplicates.
Equivio provides a convenient solution to the problem of de-duping in OCR scenarios. In digitization projects, Equivio's ability to detect and group near-duplicates is used to identify duplicates. By grouping the near-duplicates, Equivio allows reviewers of the documents to investigate the differences, identify redundancies and ascertain whether the documents originated from paper copies of the same source document.
Typical usage scenarios for Equivio in digitization projects include legal discovery, archives and records management projects for government and corporations, health care and insurance. |