In today’s typical enterprise environment, collection points expand far beyond an exchange server and a share drive. From the cloud and mobile data to social media and chat, collection points are incredibly diverse, leading to data related issues that result in over collection and the proliferation of duplicates records.
In 2019, I am sure that most litigators and eDiscovery professionals understand the premise of deduplication. It is the process by which the processing tool gathers strings of data, converts those strings into hash codes, compares those hash code values, identifying matching records and flagging one as unique and the others as duplicates. that being said, I’m not sure everyone realizes that the process for hashing loose files and attachments is different than the process used to hash email. For non-email records, the system hashes the file in its entirety, creating a unique “fingerprint” from the file signature. Email files; however; are hashed by looking at a variety of extracted metadata fields and combining those values into a string that is ultimately run through the hashing algorithm. Each processing tool performs this task differently from the next, but in most cases the metadata values include common email fields like: From; To; CC; BCC; Sent Date and Time; Email Subject; Email Body; and attachment names. Any variation in those fields and the algorithm will result in a different hash value.
Top deduplication issues
As data is managed at the source, subtle changes may take place rendering deduplication practices ineffective. Although hashing algorithms don’t recognize files as duplicates, matching records continue to enter review populations leading to the review of the same file over and over. But what would cause these values to be different when collecting data across a single enterprise? While several things may cause identical records to not deduplicate, in our experience there are five reasons that are the most common.
Reason #1: Enterprise email migration
As more organizations embark on cloud strategies, they are migrating their email from on-premise, exchange servers to cloud based Microsoft Office 365. During the migration process, some slight changes happen to a certain percentage of email files, resulting in deduplication failure. These changes could be a something as simple as the addition of added white space or a hard return. We’ve found that certain characters are treated differently in different versions of Outlook.
Reason #2: Mobile device collection
When collecting mobile data, we find that images and icons are commonly extracted as attachments, while that same message in enterprise email systems leaves those image files intact. Because attachments are part of the hashing algorithm, an email collected from both the phone and the exchange server are unlikely to dedupe under these conditions.
Reason #3: Custodian email management practices
Individuals have all sorts of email management styles, and sometimes those practices result in email deduplication issues. For example, I had a colleague that was in the habit of removing attachments from email before running his archival routine. Other folks store their email in rich text instead of the more common html format. Like mobile email this practice creates far more attachments and changes certain fonts generating more whitespace in the body of an email. These subtle changes are likely to result in deduplication issues.
Reason #4: File manipulation before processing
Recently we worked on a case where the enterprise email archival system created a wrapper or journal email to which the original email was attached. The collecting party had a developed a method for separating out the original email from the wrapper and saving those files to disc. This process certainly eliminated the email wrapper problem, but created a deduplication problem because the archived emails now had different properties than the non-archived email collected from exchange.
Reason #5: Collection from multiple email clients
Anyone that follows politics knows that people often use personal email to conduct business. In many cases individuals will copy a gmail, yahoo or aol account when emailing from their business account. When those accounts are collected we frequently find deduplication discrepancies between like emails.
Addressing the challenges
Understanding that email deduplication is no longer comprehensive is the first step in identifying a solution for the problem of redundant data in a review. Currently, there are a couple of well- known options for attempting to deal with this issue: email threading and near deduplication. Additional options, like metadata matching continue to be explored but have yet to gain wide- spread adoption. We believe that as more attention is given to the issues caused by the proliferation of data that these alternate solutions will gain more acceptance and come to be seen as a necessary part of the eDiscovery toolkit.
How can we identify and pull these duplicates from review?
Recognizing that hash value analysis is the gold-standard in deduplication, additional processes are available to address this problem. Three options include: email threading, near duplicate detection and metadata matching. Email threading can be used to pull conversations together, and some of those systems identify textually duplicate emails. Recall, though, that the same subtle differences in body content that cause deduplication to fail will also cause text comparison to fail. Near duplicate detection has a similar problem to email threading, and near dupes also has another issue in that emails that share long recipient lists and different content can be linked together as near dupes, because they share a high percentage of similar content. This brings us to metadata matching.
Metadata Matching Solution
Metadata matching is a process used by EQ Data Intelligence Consultants to compare a select number of metadata fields to identify matching emails. These fields are: Internet Message ID, Email Subject; Chron Date and Time; Number of Attachments and Email BCC. Once identified, each field is strung together and compared, then matches are flagged with one of three values: Master, Match, or Unique. Although each of the above fields has a significant role in the process, the key field is the internet message id, a unique identifier created by the email client when the message is sent by the system. This is why emails collected from both exchange and gmail can match. The Master and Match docs are joined together by another new field called Match Family which is populated with the docid of the Master record. While not a true deduplication process, this matching process has proven to significantly reduce redundant data in review populations where traditional deduplication and email threading do not. Before removing data we provide a sample of match families for review to confirm that process is working as intended.
Interested in learning more? Check out this White Paper, titled Picking Up Where Deduplication Leaves Off.
Chad Jones is Director of the Data Intelligence Group at EQ Consulting, a service offering of Special Counsel. Chad recently hosted a webinar: Is Deduplication Broken? Connect with Chad on LinkedIn or via email today!