|
Equivio's patent-pending algorithm addresses the technological challenge of detecting near-duplicate files. This problem has challenged the software industry since the early 1990's.
The near-duplicate problem is analogous to the problem of detecting exact duplicates. The traditional approach for exact duplication detection is to scan all files, and to apply bit-by-bit comparison to files of the same size. In order to compare two files of the same size, the files need to be loaded to memory and compared. Hence, this operation is I/O bounded. In order to improve performance, the files need to be stored in memory. At this point, it becomes a memory-bounded operation.
The graph below illustrates a typical windows XP directory (c:\windows). The graph shows how the number of files with the same size correlates with file size. In this typical directory there are 4,600 files with the same size. The number of file comparisons required to verify same content is 45,000. The problem is computationally difficult, and the traditional solution demands O(N2) approximate string matching, where N is the number of scanned files.
|