I’m proud to be the first to announce that the Electronic Discovery Reference Model (EDRM) has developed a specification for cross-platform identification of duplicate email messages, allowing for ready detection of duplicate messages that waste review time and increase cost. Leading e-discovery service and software providers support the new specification, making it possible for lawyers to improve discovery efficiency by a simple addition to requests for production. If that sounds too good to be true, read on and learn why and how it works.

THE PROBLEM

The triumph of information technology is the ease with which anyone can copy, retrieve and disseminate electronically stored information. Yet, for email in litigation and investigations, that blessing comes with the curse of massive replication, obliging document reviewers to assess and re-assess nearly identical messages for relevance and privilege. Duplicate messages waste time and money and carry a risk of inconsistent characterization. Seeing the same thing over-and-over again makes a tedious task harder.

Electronic discovery service providers and software tools ameliorate these costs, burdens and risks using algorithms to calculate hash values—essentially digital fingerprints—of segments of email messages, comparing those hash values to flag duplicates. Hash deduplication works well, but stumbles when minor variations prompt inconsistent outcomes for messages reviewers regard as being “the same.” Hash deduplication fails altogether when messages are exchanged in forms other than those native to email communications—a common practice in U.S. electronic discovery where efficient electronic forms are often printed to static page images.

Without the capability to hash identical segments of identical formats across different software platforms, reviewers cannot easily identify duplicates or readily determine what’s new versus what’s been seen before. When identical messages are processed by different tools and vendors or produced in different forms (so-called “cross-platform productions”), identification of duplicate messages becomes an error-prone, manual process or requires reprocessing of all documents.

Astonishingly, no cross-platform method of duplicate identification has emerged despite decades spent producing email in discovery and billions of dollars burned by reviewing duplicates.

Wouldn’t it be great if there was a solution to this delay, expense and tedium?

THE SOLUTION

When parties produce email in discovery and investigations, it’s customary to supply information about the messages called “metadata” in accompanying “load files.” Load files convey Bates numbers/Document IDs, message dates, sender, recipients and the like. Ideally, the composition of load files is specified in a well-crafted request for production or production protocol. Producing metadata is a practice that’s evolved over time to prompt little argument. For service providers, producing one more field of metadata is trivial, rarely requiring more effort than simply ticking a box.

The EDRM has crafted a new load file field called the EDRM Message Identification Hash (MIH), described in the EDRM Email Duplicate Identification Specification.

Gaining the benefit of the EDRM Email Duplicate Identification Specification is as simple as requesting that load files contain an EDRM Message Identification Hash (MIH) for each email message produced. The EDRM Email Duplicate Identification Specification is an open specification, so no fees or permissions are required to use it, and leading e-discovery service and software providers already support the new specification. For others, it’s simple to generate the MIH without redesigning software or impeding workflows. Too, the EDRM has made free tools available supporting the specification.

Any party with the MIH of an email message can readily determine if a copy of the message exists in their collection. Armed with MIH values for emails, parties can flag duplicates even when those duplicates take different forms, enabling native message formats to be compared to productions supplied as TIFF or PDF images.

The routine production of the MIH supports duplicate identification across platforms and parties. By requesting the EDRM MIH, parties receiving rolling or supplemental productions will know if they’ve received a message before, allowing reviewers to dedicate resources to new and unique evidence. Email messages produced by different parties in different forms using different service providers can be compared to instantly surface or suppress duplicates. Cross-platform email duplicate identification means that email productions can be compared across matters, too. Parties receiving production can easily tell if the same message was or was not produced in other cases. Cross-platform support also permits a cross-border ability to assess whether a message is a duplicate without the need to share personally-identifiable information restricted from dissemination by privacy laws.

IS THIS REALLY NEW?

Yes, and unprecedented. As noted, e-discovery service providers and law firm or corporate e-discovery teams have long employed cryptographic hashing internally to identify duplicate messages; but each does so differently dependent upon the process and software platform employed—sometimes in ways they regard as being proprietary—making it infeasible to compare hash values across providers and platforms. Even if competitors could agree to employ a common method, subtle differences in the way each process and normalize messages would defeat cross-platform comparison.

The EDRM Email Duplicate Identification Specification doesn’t require software platform and service providers to depart from the proprietary ways they deduplicate email. Instead, the Specification contemplates that e-discovery software providers add the ability to produce the EDRM MIH to their platform and that service providers supply a simple-to-determine Message Identification Hash (MIH) value that sidesteps the challenges just described by taking advantage of an underutilized feature of email communication standards called the “Message ID” and pairing it with the power of hash deduplication. If it sounds simple, it is–and by design. It’s far less complex than traditional approaches but sacrifices little or no effectiveness or utility. Crucially, it doesn’t require any difficult or expensive departure from the way parties engage in discovery and production of email messages.

WHAT SHOULD YOU DO TO BENEFIT?

All you need to do to begin reaping the benefits of cross-platform message duplicate identification is amend your Requests for Production to include the EDRM Message Identification Hash (MIH) among the metadata values routinely produced as load files. As a prominently published specification by the leading standards organization in e-discovery, it’s likely the producing party’s service provider or litigation support staff know what’s required. But if not, you can refer them to the EDRM Email Duplicate Identification Specification & Guidelines published at https://edrm.net/active-projects/dupeid/.

HOW DO YOU LEARN MORE?

The EDRM publishes a comprehensive set of resources describing and supporting the Specification & Guidelines that can be found at https://edrm.net/active-projects/dupeid/. All persons and firms deploying the EDRM MIH to identify duplicate messages should familiarize themselves with the considerations for its use.

EDRM WANTS YOUR FEEDBACK

The EDRM welcomes any feedback you may have on this new method of identifying cross platform email duplicates or on any of the resources provided. We are interested in further ideas you may have and expect the use of the EDRM MIH to evolve over time. You can post any feedback or questions at https://edrm.net/active-projects/dupeid/.

7 thoughts on “Introducing the EDRM E-Mail Duplicate Identification Specification and Message Identification Hash (MIH)”

Pierre said:

February 16, 2023 at 12:18 PM

This is very, very cool! Hopefully it gets the industry traction required to get widely adopted.

LikeLiked by 1 person
davidkeithtobin said:

February 16, 2023 at 12:24 PM

I guess I’m missing something. If i get native emails from various sources, my ediscovery software will filter out the dupes. If the emails are PDF/TIFF, near dupe feature is available. I suppose this would be helpful, if getting emails produced in TIFF/loadfile from various sources to have that extra field. ??

LikeLike
- craigball said:
  
  February 16, 2023 at 1:05 PM
  
  If your production is made up of the all messages in their “native” form, yes, you could process all such messages on the same platform and employ that platform’s dedupe mechanism for email. But in that role, you’re surely the producing party.
  
  As a requesting party, how often is email produced natively, and is there consensus about what “native form” means when you speak of email? EDB? PST? OST? MSG?, EML?. It sounds straightforward until you try to do it, even when you are able to process it all on a single platform. But, what do you do when the data has already been processed on different platforms and includes a mix of production formats processed by different tools and vendors?
  The normalization and concatenation won;t be the same. Items will hash differently. You can near dupe as you posit, but it will require some costly processing and the level of deduplication will be much less efficient or will prove perilously overexclusive of items that should not be deemed dupes.
  
  Don’t take my word. Try taking multiple productions of overlapping e-mail collections processed on different platforms and produced in different formats (say, some TIFF+, some PDF+ and some MHT or MSG). You’ll see why the level of duplicate identification will disappoint and prompt a lot of needless cost in terms of review.
  
  LikeLiked by 1 person
Matthew said:

February 16, 2023 at 7:13 PM

It has been a pleasure to work alongside you and the all-star team, on this initiative, Craig. I hope it is rapidly adopted as a global standard.

LikeLike
Cory Noonan said:

February 20, 2023 at 2:30 PM

I’ve been using this method for years to help address deduplication across platforms, even down to the definition of a valid internet message ID. Glad to see it is now a standard.

However, care should be taken not to give this message ID hash the same respect as a hash from a processing engine. There are instances where you will get the same message ID and different content. Probably the most frequent cause of this is archived or “stubbed” messages where an some email systems truncate old email body text or remove attachments. These stubbed emails will get the same message ID as their counterpart. There are also some edge-cases where one email will have a corrupt attachment and the other will have a valid. This method also does not take into account the BCC situation where the sender will have BCC and the recipients will not, these will have the same message ID and you could unintentionally call the one with BCC the duplicate and drop it from review missing critical data. I’ve also seen cases where emails with the same message ID will have different footer disclaimer text.

There’s a reason processing engines don’t simply use a message ID hash. All that said, it is darn near the best we can do short of getting processing engines to agree on a standard way to hash emails.

LikeLike
- craigball said:
  
  February 20, 2023 at 4:36 PM
  
  Thank you for your comment. If you published your method at a date prior to September 2016, kindly cite me to that publication as I’d like to read it. I proposed the method in a post published in 2016 (https://craigball.net/2016/09/04/cross-matter-vendor-message-id/). If you will look at the EDRM publication cited, you will please note that all of the contraindicated instances you mention have been addressed under “considerations.”
  
  LikeLike
Pingback: Cloud Attachments: Versions and Purview | Ball in your Court