Conceptual search is used to retrieve documents that relate to a given word or idea, even if the search term itself does not appear. To cull non-relevant data, many litigation teams start out with traditional keyword searching and then advance to conceptual search. Standard keywords are limited because humans cannot always think of the different ways that people refer to items or subjects of interest. Conceptual search technology addresses this challenge.
As the term suggests, near-duplicates are not exact copies of an electronic file but files with small textual or formatting differences. Typically, between 30–50 percent of electronic files in a case are near-duplicates.
The key differences between near-duplicate groupings as against sets of documents retrieved by a conceptual search query are:
1. Homogeneous. Near-duplicate groupings are relatively homogeneous because the documents have only small differences, as against conceptual search documents which can be totally different in content but are retrieved together because they relate to the same subject. The homogeneity of a near-duplicate grouping enables bulk handling of the set.
2. Mutually exclusive. A near-duplicate grouping is mutually exclusive, meaning that a document can belong to one and only one near-duplicate set. This is an important factor when you are using near-duplicate groupings to ensure the consistency of document treatment in a review process.
3. Pre-processed. Conceptual search queries are launched on the fly in order to retrieve additional documents related to a given document or subject. Near-duplicate groupings, on the other hand, are pre-processed persistent entities. As persistent categories, the near-duplicate groupings can be used to organize the review process, including document assignment, document flow, and quality assurance.
4. Email thread analysis. Near-duplicate groupings are a key component in condensing email chains. The need for near-duplicate groupings in email threads is manifest, for example, in grouping personalized emails, such as forms, mailers and postmaster replies, where the body of the email is the same with the exception of the "Dear $name" or specific details. The near-duplicate groupings are also useful for capturing email threads from OCR data.
The new white paper "eDiscovery Document Review: Understanding the Four Key Differences between Conceptual Searching and Near-Duplicate Grouping" discusses the significant different use scenarios and results for each method. This white paper can downloaded from here. |