In the past 15 to 18 months, we have seen an increasing number of requests to use near-duplicate detection to dedupe whole document collections. These requests usually stem from one of the following four scenarios:
- Long-running litigation where there is a need to refresh collection
- Turnover in the legal team resulting in the use of one or more vendors to process disparate collections over disparate systems
- Re-use of previously collected data from a separate litigation
- The need to synthesize productions from various parties in a multi-party litigation
It is easy to be seduced into thinking that applying technology will minimize the time and labor costs associated with these challenges. But like the sailor lured in by the siren’s call, there are real dangers in relying on near-duplication to dedupe.
The Difference Between DeDuplication and the Process of Near-Duplication
The process of file deduplication eliminates multiple copies of the same file. Using a series of criteria like email header, content, date, time sent, and attachment content, the software assigns a hash value to each file. Hash files are then compared against one another to identify duplicates. All but one of the files with duplicate hash values are then removed from the deliverable, but in eDiscovery, deduplication is performed on a family level rather than a document level. This means that the same Word document attached to two unique emails will not be deduped, because they are parts of unique families.
Unlike deduplication, near-duplication is performed on a document level rather than a family level. (Read four workflows that use near-duplication during review) Near-duplication is the process by which the textual content of a document is compared against all other documents creating groups of documents with similar content. Those grouped documents are assigned a similarity value based on how similar they are to the lead document in the group. It is important to note that the process of near-duplication does not remove documents, it only groups similar documents together. While deduplication factors in family content to create hash values for comparison, near-duplication does not. It only focuses on the individual documents. This means that our Word document from the example above would be assigned the value of ‘duplicate’ and then included in the collection during near-duplication.
As described above, it is easy to see that the process of deduplication is not at all similar to the process of near-duplicate detection. The danger here is in the misunderstanding of the major distinctions between the two processes and the attempt to use these different processes to perform the same task. The three major distinctions are:
- Per Family (email + attachment) vs. Per Document
- Textual Analysis vs. File Analysis
- Duplicates vs. Similarities
Near-Duplication is Still an Effective Strategy for Litigation Support
Although these differences prevent near-duplication from being an effective deduplication tool, there is still a viable strategy for its use, and near-duplication can still be leveraged to aid in all four of the scenarios mentioned in the beginning of this post.
This strategy is simple…
Use near -duplication to group similar documents together. Create batches, including both the new and the old document sets, based on those groupings. Sort the batches to show the master or pivot document at the top and sort all other documents in the group by similarity in order from highest to lowest. Finally, display the responsive, confidential, and issues fields in the document list. Now the reviewer can easily see how the previously reviewed documents line up with the newly added documents. They can quickly code those similar documents using either a one-click or mass-edit function. By creating a visual grouping you eliminate blind coding and subjective removal of near-dupes. You can save hours of labor and leverage the prior work product to inform the new.