Equivio's patent-pending algorithm addresses the technological challenge of detecting
near-duplicate files. This problem has challenged the software industry since the early 1990's.

The near-duplicate problem is analogous to the problem of detecting exact duplicates. The traditional approach for exact duplication detection is to scan all files, and to apply bit-by-bit comparison to files of the same size. In order to compare two files of the same size, the files need to be loaded to memory and compared. Hence, this operation is I/O bounded. In order to improve performance, the files need to be stored in memory. At this point, it becomes a memory-bounded operation.

The graph below illustrates a typical windows XP directory (c:\windows). The graph shows how the number of files with the same size correlates with file size. In this typical directory there are 4,600 files with the same size. The number of file comparisons required to verify same content is 45,000. The problem is computationally difficult, and the traditional solution demands O(N2) approximate string matching, where N is the number of scanned files.

 
 

An improvement to this naïve approach is to compute the "Cyclic Redundancy Check" (CRC) code based on the content of the files, and then compare only files with the same CRC. CRC is a hash function over the content of the file (often called "signature"). The probability that two files with different content have the same signature is extremely low. In the Windows XP directory, none of the files with file size greater than zero have the same CRC. Another widely-used de-duping algorithm is the MD5 hashing function.

While CRC and the MD5 hash algorithm are useful for exact duplication, they do not address the near-duplication problem. While CRC and MD5 generate exactly the same signature for exact duplicate files, they generate totally different signatures for similar files. This applies even if two files differ by only one letter, or if the content is the same but the formatting is different.

Equivio's innovation and contribution is the ability to generate similar signatures for similar files. Equivio solves this problem, while providing the precision, performance and scalability required for enterprise applications.

 
 
 
  HomeProductsTechnologySolutionsCorporateNews & EventsContact Us
   

© 2004-2008 Equivio. All rights reserved.