A Bit About Deduplication

The 4th of July is one of my very favorite holidays, second only to Thanksgiving. We try to do patriotic things like construct kitschy neighborhood parade floats or, as we did at breakfast, stand and sing a rousing rendition of the national anthem, hoping that I can still hit the high notes (I did). Last night, to get in the mood, I watched the 2008 BBC 6-part series Stephen Fry in America, which follows the wry English entertainer as he races about all fifty U.S. states in his trademark London cab. In Boston, Fry discussed contradictions in the American character with the late Peter Gomes, a pastor and Harvard professor of divinity who Fry described as “a black, gay, Republican Baptist.” Gomes observed that, “One of the many things one can say about this country is that we dislike complexity, so we will make simple solutions to everything that we possibly can, even when the complex answer is obviously the correct answer or the more intriguing answer. We want a simple ‘yes’ or ‘no,’ or a flat out ‘this’ or an absolutely certain ‘that.’”

Gomes wasn’t talking about electronic discovery, but he could have been.

For a profession that revels in convoluted codes and mind-numbing minutiae, lawyers and judges are queerly alarmed at the complexity and numerousity of ESI. They speak of ESI only in terms that underscore its burden, never extoling its benefits. They demand simple solutions without looking beyond the (often misleading) big numbers to recognize that the volume they vilify is mostly just the same stuff, replicated over and over again. It’s a sad truth that much of the time and money expended on e-discovery in the U.S. is wasted on lawyers reviewing duplicates of information that could have been easily, safely and cheaply culled from the collection. Sadder still, the persons best situated to eradicate this waste are the ones most enriched by it. Once, I might have said “innocently enriched by it,” but no more.

The oft-overlooked end of discovery is proving a claim or defense in court. So, the great advantage of ESI is its richness and revealing character. It’s better evidence in the sense of its more-candid content and the multitude of ways it sheds light on attitudes and actions. Another advantage of ESI is the ease with which it can be disseminated, collected, searched and deduplicated. This post is about deduplication, and why it might be attorney malpractice not to understand it well and use it routinely.

A decade or three ago, the only way to know if a document was a copy of something you’d already seen was to look at it again…and again…and again. It was slow and sloppy; but, it kept legions of lawyers employed and minted fortunes in fees for large law firms.

With the advent of electronic document generation and digital communications, users eschewed letters and memos in favor of e-mail messages and attachments. Buoyed by fast, free e-mail, paper missives morphed into dozens of abbreviated exchanges. Sending a message to three or thirty recipients was quick and cheap. No photocopies, envelopes or postage were required, and the ability to communicate without the assistance of typists, secretaries or postal carriers extended the work day.

But we didn’t start doing much more unique work. That is, human productivity didn’t burgeon, and sunsets and sunrises remained about 12 hours apart. In the main, we merely projected smaller slices of our work into more collections. And, I suspect any productivity gained from the longer workday was quickly surrendered to the siren song of eBay or Facebook.

Yes, there is more stuff. Deduplication alone is not a magic bullet. But there is not as much more stuff as the e-discovery doomsayers suggest. Purged of replication and managed sensibly with capable tools, ESI volume is still quite wieldy.

And that’s why I say a lot of the fear and anger aimed at information inflation is misplaced. If you have the tools and the skills to collect the relevant conversation, avail yourself of the inherent advantages of ESI and eradicate the repetition, e-discovery is just…discovery.

Some organizations imagine they’ve dodged the replication bullet through the use of single-instance archival storage solutions. But were they to test the true level of replication in their archives, they’d be appalled at how few items actually exist as single instances. In their messaging systems alone, I’d suggest that upwards of a third of the message volume are duplicates despite single instance features. In some collections, forty percent wouldn’t surprise me.

But in e-discovery—and especially in that platinum-plated phase called “attorney review”—just how much replication is too much, considering that replication risk manifests not only as wasted time and money but also as inconsistent assessments? Effective deduplication isn’t something competent counsel may regard as being optional. I’ll go further: Failing to deduplicate substantial collections of ESI before attorney review is tantamount to cheating the client.

Just because so many firms have gotten away with it for so long doesn’t make it right.

I’ve thought more about this of late as a consequence of a case where the producing party sought to switch review tools and couldn’t figure out how to exclude the items they’d already produced from the ESI they were loading to the new tool. This was a textbook case for deduping, because no one benefits by paying lawyers to review items already reviewed and produced; no one, that is, but the producing party’s counsel, who was unabashedly gung-ho to skip deduplication and jump right to review.

I pushed hard for deduplication before review. This isn’t altruism; responding parties aren’t keen to receive a production bloated by stuff they’d already seen. Replication wastes the recipient’s time and money, too.

The source data were Outlook .PSTs from various custodians, each under 2GB in size. The form of production was single messages as .MSGs. Reportedly, the new review platform (actually a rather old concept search tool) was incapable of accepting an overlay load file that could simply tag the items already produced, so the messages already produced would have to be culled from the .PSTs before they were loaded. Screwy, to be sure; but, we take our cases as they come, right?

A somewhat obscure quirk of the .MSG message format is that when the same Outlook message is exported as an .MSG at different times, each exported message generates a different hash value because of embedded time of creation values. [A hash value is a unique digital “fingerprint” that can be calculated for any digital object to facilitate authentication, identification and deduplication]. The differing hash values make it impossible to use hashes of .MSGs for deduplication without processing (i.e., normalizing) the data to a format better suited to the task.

Here, a quick primer on deduplication might be useful.

Mechanized deduplication of ESI can be grounded on three basic approaches:

Hashing the ESI as a file (i.e., a defined block of data) containing the ESI using the same hash algorithm (e.g., MD5 or SHA1) and comparing the resulting hash value for each file. If they match, the files hold the same data. This tends not to work for e-mail messages exported as files because, when an e-mail message is stored as a file, messages that we regard as identical in common parlance (such as identical message bodies sent to multiple recipients) are not identical in terms of their byte content. The differences tend to reflect either variations in transmission seen in the message header data (the messages having traversed different paths to reach different recipients) or variations in time (the same message containing embedded time data when exported to single message storage formats as discussed above with respect to the .MSG format).
Hashing segments of the message using the same hash algorithm and comparing the hash values for each corresponding segment to determine relative identicality. With this approach, a hash value is calculated for the various parts of a message (e.g., Subject, To, From, CC, Message Body, and Attachments) and these values are compared to the hash values calculated against corresponding parts of other messages to determine if they match. This method requires exclusion of those parts of a message that are certain to differ (such as portions of message headers containing server paths and unique message IDs) and normalization of segments, so that contents of those segments are presented to the hash algorithm in a consistent way.
Textual comparison of segments of the message to determine if certain segments of the message match to such an extent that the messages may be deemed sufficiently “identical” to allow them to be treated as the same for purposes of review and exclusion. This is much the same approach as (2) above, but without the use of hashing as a means to compare the segments.

Arguably, a fourth approach entails a mix of these methods.

All of these approaches can be frustrated by working from differing forms of the “same” data because, from the standpoint of the tools which compare the information, the forms are significantly different. Thus, if a message has been ‘printed’ to a TIFF image, the bytes which make up the TIFF image bear no digital resemblance to the bytes which comprise the corresponding e-mail message, any more than a photo of a rose smells or feels like the rose.

In short, changing forms of ESI changes data, and changing data changes hash values. Deduplication by hashing requires the same source data and the same algorithms be employed in a consistent way. This is easy and inexpensive to accomplish, but requires that a compatible work flow be observed to insure that evidence is not altered in processing so as to prevent the application of simple and inexpensive mechanized deduplication.

When parties cannot deduplicate e-mail, the reasons will likely be one or more of the following:

They are working from different forms of the ESI;
They are failing to consistently exclude inherently non-identical data (like message headers and IDs) from the hash calculation;
They are not properly normalizing the message data (such as by ordering all addresses alphabetically without aliases);
They are using different hash algorithms;
They are not preserving the hash values throughout the process; or
They are changing the data.

Once I was permitted to talk to the sensible technical personnel on the other side, it was clear there were several ways to skin this cat and exclude the items already produced from further review. It would require use of a tool that could more intelligently hash the messages, and not as a monolithic data block; but, there several such tools extant. Because the PSTs were small (each under 2GB), the tool I suggested would cost the other side only $100.00 (or about ten Big Law billing minutes). I wonder how many duplicates must be excluded from review to recoup that princely sum?

Deduplication pays big dividends even in imperfect implementations. Any duplicate that can be culled is time and money saved at multiple points in the discovery process, and deduplication delivers especially big returns when accomplished before review. Deduplication is not a substitute for processes like predictive coding or enhanced search that also foster significant savings and efficiencies; but, few other processes allow users to reap rewards as easily, quickly or cheaply as effective deduplication.

10 thoughts on “A Bit About Deduplication”

Pingback: Digital Forensics, Inc. | A Bit About Deduplication
Joshua N. Rubin said:

July 4, 2012 at 11:56 PM

I’d propose the following ideal email deduplication and production protocol based on the three approaches you set out above:

1. Counsel agree to agree on a bilateral deduplication method before any preproduction deduplication occurs.

2. Counsel agree not to hash whole email files for the reasons you cite.

3. Counsel agree to use hashed field comparisons. Alternatively, counsel agree to use textual comparisons if their method is cheaper than hashing fields or eliminates trivially different versions, and if any risk of missing discoverable documents is justified by proportionality.

4. If the case warrants the expense, counsel agree that the respondent will record the actual owners of the mailboxes from which each email is being produced, and will produce the list to the proponent in electronic form. This allows the respondent to deduplicate without regard to server paths or message IDs while preserving evidence of actual receipt.

LikeLike

- John Martin said:
  
  July 9, 2012 at 7:55 AM
  
  Remember that only the author’s email will contain the BCC field, so take that into account if you go with “field-based hashes”. Also distribution lists to consider.
  
  LikeLike
  
Pingback: Deduplication Protocol | Bits in the Balance
Pingback: The Many Faces of Mike McBride » Blog Archive » Links (weekly)
Pingback: Inventus | Minimize Data. Accelerate Review. Reduce Cost. | Resources | Industry Articles | Ball’s Thoughts on De-Duplication
Pingback: A Bit About Deduplication | The Electronic Discovery Reading Room
Pingback: Deduplication: Why Computers See Differences in Files that Look Alike to You | Ball in your Court
Pingback: Answer for Mail deduplication multiple users - Tech Magazine
Pingback: Cross-Matter & -Vendor Message ID | Ball in your Court

Ball in your Court

~ Musings on e-discovery & forensics.

A Bit About Deduplication

10 thoughts on “A Bit About Deduplication”

Leave a comment Cancel reply

Share this:

Related

10 thoughts on “A Bit About Deduplication”

Leave a comment Cancel reply