At last week’s ILTACON in Washington, D.C., Beth Patterson, Chief Legal & Technology Services Officer for Allens in Sydney asked a panel why e-discovery service providers couldn’t standardize hash values so as to support identification and deduplication across products and collections. If they did, you could use work from one matter in another. If an e-mail is privileged in one case, there’s a good chance it’s privileged in another; so, wouldn’t it be splendid to be able to flag its counterparts to insure it doesn’t slip through without review?
Beth asked a great question, and one regrettably characterized by the panel as “a big technical challenge.”
One panelist got off on the right foot: He said, “I’ve created artificial hashes in the past where what I had to do was aggregate and normalize metadata across different data sets to create a custom fingerprint to do that.” But, he added, “that’s probably not defensible, and it’s also really cumbersome.”
Pressed by Beth, the panel pushed back. “It’s because artificial hashes are kind of complicated,” one panelist offered, and not “a trivial technical problem.” The panel questioned whether MD5 hashes were the appropriate standard or whether SHA-1 would be required, positing that cross-matter deduplication is “something that requires significant buy-in across a broad spectrum of people.” Beth’s request was ultimately dismissed as “not an easy challenge” and one that would be confounded by “people, process and technology” and “the MD5 hash stuff.”
ILTACON is the rare venue where reasonably well-adjusted and -socialized people engage in lively discussions of such things. It’s not just that ILTA folks understand the technology issues (“GEEKS!”), we’re passionate about them (“NERDS!”) and debate them respectfully as peers (“WUSSIES!”).
Beth’s idea deserved more credit than it got. It really is a trivial technical problem, and one that could be resolved without much programming or politics.
Then, why don’t we have a proven means to uniquely identify messages across vendors? I suspect it’s due to a lack of leadership and validation. Insofar as I’m aware, no one has published a standard methodology for cross-vendor identification or established that it works. Certainly, no one has managed to get something accepted as a de facto industry standard, in the nature of, say, the Concordance load file format or EDRM XML. Instead, we invent reasons why it’s just too darn hard.
To be clear, any e-discovery tool worth its salt employs a method to hash and deduplicate messages; unfortunately, they don’t employ the same method. Each tool approaches the task in a slightly different way and, when it comes to comparisons based on hash values, even the most minute variation in the data hashed generates a markedly different hash value. This article looks at how to get everybody on the same page when it comes to generating consistent, hash-based message identifiers across vendors and matters.
Let’s start by recounting a few facts about hashing, then examining how these facts relate to e-mail message and loose file identification and deduplication in e-discovery.
HASH FACT 1: Same data, same algorithm, same hash value.
HASH FACT 2: Different data, different hash value.
HASH FACT 3: No hash value is “close” to another hash value in a manner tied to similarity of the messages represented by the digest values of same.
HASH FACT 4: Hashing is a one-way process; i.e., the message digest cannot be reversed engineered to learn the content of the message.
If you’re reading this, I trust you already know what “hashing” is and that the most common hash algorithm (i.e., “mathematical formula”) employed in e-discovery is called the MD5. But, you may not know that the “MD” in MD5 stands for “Message Digest,” a synonym for “hash” (as a noun, not a verb).
In the context of hashing, the term “message” refers to any data that’s hashed, not necessarily an e-mail message. But in this post, we are going to zero in on the tricky business of uniquely identifying e-mail messages, principally because generating consistent MD5 hash values for loose documents across vendors isn’t a problem…or certainly shouldn’t be.
Why is one easy and the other tricky? It’s because a properly preserved and -processed logical file will hold the same data whether it comes from one custodian or another and whether it comes from an e-mail, a network file share or a thumb drive. Too, if the loose file hasn’t been altered or corrupted, it will generate the same hash value each and every time it’s retrieved from a storage medium, extracted from a container (e.g., a compressed container employing lossless compression like a Zip file) or decoded from base64-encoded content in a transmitting e-mail. Because the data comprising the file is the same from its first logical byte to its last, it will hash identically using the same hash algorithm, each and every time, anywhere, from any source. See Hash Fact 1.
In contrast, an e-mail message exported from a PST container to a single message (.MSG) format will be different each time it’s exported because the conversion process incorporates and embeds different internal timestamp information in the message each time it’s saved. More to the point, the “same” e-mail message retrieved from different accounts will hold different data because the message traversed a different transmission path to its addressee and arrived at a different time in a different account.
There are other reasons the “same” e-mail holds different data. One reason is that e-mail client programs tend not to store messages as discrete blocks of data faithful to the structure of the message received but will divide constituent parts of the message into records and fields in a database. Another is that e-mail client programs and servers may “alias” sender and recipient addresses, substituting a name from a contacts list for its associated e-mail address or making other small changes intended to aid the user but wreaking havoc in terms of hash-value consistency across systems.
The upshot is that, where you can hash a loose file repeatedly and repeatedly generate a consistent hash value using the same hash algorithm, you often cannot obtain the same hash value for the “same” message extracted at different times or from different sources because it’s not truly the same data first byte to last. See Hash Fact 2.
I’ve sometimes put the word “same” in quotation marks to underscore that what humans deem to be identical messages is far more forgiving than what computers employing hash algorithms deem identical. Humans typically regard two different addressees as receiving the “same” e-mail, though the messages derive from different accounts. Humans focus on the rendered content, not the binary content. The notion of “sameness” for humans is a judgment call. Computers employing hash algorithms perform calculations on the precise sequence of data ingested; “sameness” is objective and rigid. Again, different data, different hash values, and as previously noted, the different data doesn’t prompt correspondingly different hash values. See Hash Fact 3.
So, when I say that any similarity between the digest values of two messages is unrelated to any similarity between the messages, I mean that there is much less correlation than that we would expect to see in terms of two people of similar appearance having similar fingerprints. In practical terms, there is no correlation at all between appearance and fingerprints and none between different messages and their message digests (often called “digital fingerprints”). The message digest values are different and only one-way; that is, not in a manner that tells you anything about what was different in the messages or how significant the difference might be. See Hash Fact 4.
Reliable Identification Without Identicality
So, computers are maddeningly exacting in their comparison of e-mail messages for identicality, and e-mail systems embed quirky variations into e-mail messages that humans pragmatically judge to be the “same.” It’s a conundrum; but, all is not lost! E-discovery tools work around the variations seen within the “same” e-mail by hashing only those parts of the message that will be identical if the messages are the “same” from the perspective of human judgment. Because the various software and service providers go about the task in different ways, some regard their approach as a “secret sauce.” It’s really not. It’s a process that should be documented and open.
Formulating a cross-vendor and -matter methodology for message identification requires we resolve threshold questions of selection, normalization and concatenation:
- Selection: Which parts of the message are suited to hash comparison and by what algorithm?
- Normalization: How will the data be tweaked for consistency of presentation?
- Concatenation: In what order will these parts be presented to the hash algorithm?
At first blush, it might seem that achieving the most precise and defensible identification methodology requires we hash as much of the message as we can while avoiding inherently different features. After all, we want our method to be “forensically-sound,” right?
Perhaps not. In fact, it’s practical to treat two messages as being the “same” by comparing just a handful of characteristics. Not perfect, mind you, but practical.
The fewer pieces of a message that must be normalized and concatenated to generate a hash value, the quicker, easier and less error-prone the process. The tradeoff is an increased risk of hash collision or omission; that is, the risk that two different messages will generate the same hash value or that the process will fail to flag multiple instances of the same message (however we define “same”).
Quantifying the Risk of Hash Collision
How much risk is too much, and at what point does the risk differential become so small as to be meaningless?
The ILTACON panel raised the specter that the MD5 algorithm posed too great a risk of hash collision such that use of the SHA-1 hash algorithm was required for the process to be “defensible.”
Let’s put this objection in context so you can see why the concern is unwarranted:
MD5 hashes are 128 bit values (2128), putting the chances of an MD5 hash collision at 1 in 340 undecillion (1036 ) or more precisely, 1 in 340,282,366,920,938,000,000,000,000,000,000,000,000. SHA-1 hashes are 160 bit values (2160), so the chance of a hash collision with SHA-1 is 232 times less likely than MD5. Coincidentally, 232 is equal to the number of angels that can dance on the head of a pin.
I challenge any reader to put 340 undecillion into human scale. Believe me, I’ve tried, and find myself analogizing it to, say, the number of atoms in the Milky Way galaxy. It’s unfathomably large. One IT guy put it this way:
“If you had a job that paid you 390 trillion dollars per hour (US) you would have to work 24 hours per day, 7 days per week, 365 days per year for a just a little less than 100 quadrillion years to earn 340 undecillion dollars.”
Specifying US dollars was a helpful touch, right?
In this application, challenging defensibility based on MD5 versus SHA1 is silly. We aren’t battling Lex Luthor trying to insert a forged record into the Krypton Library of Universal Knowledge. We just want to eke out cost savings from re-tasking the fruits of prior reviews. Considering the human scale of potentially different messages in an e-mail collection (even one as vast as all the e-mail in the world), using a cosmic scale tool like the MD5 hash is number space overkill. So, suggesting that the number space needs to be 232 times larger is like ordering a sandwich at Subway in parsecs. “I’m dieting this week, so I’ll have the 4.93895e-18 parsec meatball with provolone, toasted on herb cheese please.“
Our risk of a hash collision won’t grow out of our choice of hash algorithm; instead, it’s entirely dependent upon the potential for the data in the compared messages to be the same despite the messages being different. That occurs when, e.g., a message is fabricated using the body of another message or when the constituents of the message selected for hashing are insufficiently unique, i.e., generic and likely to repeat. Message subject lines, dates and addressees are all highly likely to recur over a collection. These are essential data points for review, but they aren’t very useful when it comes to hash identification and deduplication. For these purposes, we want to hone in on the most unique features of the individual message.
The flip side risk to hash collision is the risk of omission; that two messages deemed the same in human terms will be identified as different by the machine and assigned different hash values for purposes of cross-matter identification and deduplication. This typically occurs when the methods used to transmit, decode, parse, store and collect the message add, omit or alter features of the data. I’ve already discussed aliasing errors, but hashing is frustrated by something as simple as collecting one character too many or one too few from a message body or header. Whatever features of the message are hashed must be collected in a precise and consistent way for every message. This means that the methodology employed must favor characteristics of the message that tend to be preserved in precise and consistent ways.
This may entail keeping features of messages we usually throw away in e-discovery. Accordingly, an industry standard approach to message identification may work better going forward than applied to legacy data. That was the case with human fingerprinting and DNA evidence in their day. You have to start somewhere.
So, let’s look at our candidates for comparison.
The organization and constitution of e-mail messages is governed by a series of proposed, voluntary standards circulated as Request for Comments or RFCs. No e-mail system is obliged to adhere to the proposed RFC standards…if they don’t care if their messages can successfully navigate other servers and networks. So, the notion that the RFCs are merely voluntary is counterpoised against the recognition that e-mail isn’t very useful if messages can’t reach their intended recipients. The RFCs are e-mail structural standards, and strong ones at that. Accordingly, they are our jumping off point to identify unique features of messages to include in our hash.
There are many parts to e-mail messages that e-mail users never see but that are crucial to the ability of e-mail systems to successfully transmit, order and present the messages. One of these is the Message-ID (msg-id). Here’s an example of a msg-id value:
Per RFC 5322 (arguably the most important RFC respecting e-mail), “The message identifier (msg-id) itself MUST be a globally unique identifier for a message. The generator of the message identifier MUST guarantee that the msg-id is unique.”
That’s all well and good, except the msg-id is not perfectly unique in real world e-mail collections. Message IDs may be malformed or spoofed, prompting a msg-id collision between messages that aren’t identical.
But, how perfect must the methods we employ be to be good enough to support cross-vendor and -matter identification? Compared to an unobtainable, flawless methodology, nothing suffices; but compared to current alternatives (i.e., nothing), it’s a great leap forward.
First Candidate for a cross-vendor and -matter identifier for e-mail messages
Service providers and processors would supply an MD5 hash of each message’s msg-id value, computed against string of characters between the angle brackets. For the msg-id in the example above, the identifier would be: 249f818730a5872d8efc660d19d27832
The advantages of using the hash of the msg-id value instead of the msg-id itself are that the hash value is shorter and always of a fixed length (i.e., a 32-character hexadecimal value). Too, using the hash serves to shield the fully qualified domain name customarily appearing at the end of the msg-id value.
The advantage to using the msg-id value alone is that it entails few normalization and no concatenation issues. It’s simple…but not perfect.
If the message contains no msg-id value (which might occur with some oddball in-house e-mail system), the method fails. If attachments have been stripped from the message, the stripped version of the message will match to the unstripped version because they are the “same” originating message and share the same message ID.
Second Candidate for a cross-vendor and -matter identifier for e-mail messages
To eschew confusion stemming from stripped versions of messages, we can include other values into the data stream that will be hashed. Adding the message body or any of the message header values like To, From, CC or Subject won’t serve to fix the stripped attachment problem; for that, we must include the attachments in our hash or at least some component uniquely tied to the presence of the attachments. One or more attachment boundary values might suffice, and would be trivial to hash; but, the safest bet are the attachments themselves. To promote efficiency, attachments would be included in the identification hash only when the flag Content-Type: multipart/mixed is present. Another way to speed and simplify the process is to hash the concatenated hash values of the attachments seen in the message. Attachment hash values are routinely calculated in processing e-mail. Using these in the identification hash obviates the need to incur the time and processing cycles needed to re-hash large objects.
Normalization and Concatenation
Once we start hashing parts in addition to the msg-id, we must set standards for normalization and concatenation. Normalization serves to eliminate variations in the data that are not indicia of any meaningful difference between the messages but that serve to generate different hash values. Examples include stray spaces, tabs, line feeds, differences in case and other formatting anomalies prompted by differences in e-mail applications and systems. Other data that should be normalized are date and time values. Routinely, date and time values will be re-formatted by an e-mail client to reflect, e.g., European date ordering, local time zones and 12- and 24-hour clocks, to name just a few variants that make hash matching impossible.
These variations must be normalized, viz., presented to the hash algorithm in a consistent manner. For example, time and date values might be converted to UTC expressed in Win32 time formats. All spaces, tabs and line feeds might be stripped and all characters converted to uppercase values. Programmatically, these are trivial tasks. Precisely how it’s done is less important than that it is done consistently each and every time the message is hashed for identification. It sounds complex to a lawyer, but it’s plain vanilla to those who work with digital data.
When the values that will be hashed are normalized, the resulting data comprise strings of information that must be joined end-to-end to be fed into the hash algorithm. This ‘stringing together’ is called concatenation, and it’s crucial that the order of the strings be consistent whenever their hash is computed because any change of order will result in a different hash value. So, a standard for a cross-matter and -vendor message identifier must establish the order of concatenation. Sometimes this will entail specifying the fields by name; other times it’s sufficient to specify an ordering methodology (e.g., alphabetic order). Whatever works here–any approach that’s not needlessly complicated or processor intensive.
If you’ve made it this far, thanks. None of what I’ve put forth is groundbreaking. Again, vendors do this all the time; unfortunately, the vendor community hasn’t settled upon a consistent, published process that supports reliable identification of messages outside their own systems. A standard would allow us to re-use work product and even to de-duplicate across messages produced in different forms. Imagine being able to reliably match and deduplicate messages whether in native, TIFF or PDF forms. An effective identifier makes that possible, and much more.
For this to work, parties must stop jettisoning data from messages like mad balloonists throwing out sandbags. That means that evidence must remain in forms that not only retain the points of comparison but also support simple, reliable extraction of same. Native and near-native forms are best, but even a properly-populated TIFF plus load file production could be matched using a sensible set of standards.
A working standard doesn’t require the imprimatur of EDRM, The Sedona Conference or NIST. The MD5 hash algorithm was just something put out there by one smart guy, Ron Rivest, and adopted by multitudes. All a message identification standard requires is that it work and be used. Again, leadership and validation. The standard could have your name on it. I hope this post helps someone figure it out.
If you want to learn more about hashing and deduplication, you might look at these posts:
Deduplication: Why Computers See Differences in Files that Look Alike
Thanks for posting on this Craig. I think this is very important, and something that would be very useful many scenarios such as:
matters where you have data from multiple sources (mac mail, Exchange and an email archive system) where the metadata has been altered and you can’t comprehensively deduplicate
Maybe it will be called the ‘Patterson’ or ‘Patterson-Ball’ deduplication method!
Ross Johnson said:
Thanks Craig, you eloquently describe many of the challenges and trade offs that us e-discovery vendors must deal with when detecting duplicates.
Like you say, things get more complicated when dealing with email. As a further example of the insanity, some versions of Outlook don’t even store received HTML bodies as is, but instead encapsulate them to RTF before storing. You then need a non-trivial de-encapsulating RTF reader to retrieve the original HTML for further comparison ( An open source version of our de-encapsulator is available here https://github.com/mazira/rtf-parser).
I’ve recently been working on methods to detect cross-format duplicates and similar files in GoldFynch (SaaS e-discovery platform, GoldFynch.com), and I’ve concluded that it’s infeasible to really distinguish the two with much confidence. That is to say, I can tell you that this TIFF and this Word document contain 95% the same text, or that they visually look 95% the same, but I can’t automatically tell if that difference is due to an actual difference in the text (due to a revision), or an artifact of OCR or image processing. In GoldFynch, we’ve decided to show exact-duplicates (hashing) and potential-duplicates (fuzzier methods) separately on our interface.
I believe that an ideal future standard for de-duplication and interoperability should address / describe the following:
1. Whole-file hash for high-confidence exact-duplicate detection
2. Email “key fields” hash for high-confidence exact-duplicate detection (likely a normal hash using many of your suggestions)
3. Some text-based and/or image-based content identifier, likely some sort of locality-sensitive hash, for near-duplicate detection
With these 3 levels of standard identification, there would be so many possibilities. Of course there is still the main issue with standards, which is getting them actually adopted. I suppose if 193 nations can get together in a building in New York, a few vendors should be able to ‘hash’ it out!
Thanks for weighing in, Ross. Ironically, and for the reasons you note, I think e-mail is the lesser challenge compared to loose documents in changed versions across formats. Header data is inherent to e-mails but absent, altered or stripped away when loose docs change formats. But cross-format comparison of loose documents is not the problem I’m seeking to solve here. You make the good point that subtle differences in document versions may be too subtle–and too critical to the issues in contention–to casually put it down to imaging artifacts. It’s a problem perhaps best addressed by relentlessly moving away from static imaged productions.
I don’t think we need an accord among vendors for an e-mail standard to take hold as much as I believe it could be accomplished by the publication of a functional, reasonably-reliable methodology given a name so as to allow it to be unequivocally specified as required by litigants seeking the ability to ID and de-dupe in the manner I described. If “we” build it, people will spec it, and vendors will have to supply it. That’s the hope, anyway.
Pingback: Cross-Matter & -Vendor Message ID - @ComplexD | @ComplexD
Mike McBride said:
Interesting stuff Craig, and like you, I don’t really see what the problem is here. Most eDiscovery processing software is already making some sort of determination on what to include in the MD5 calculation when dealing with a PST, for example. (The tool is calculating the MD5 as it extracts each message out of the container.) We then routinely use that MD5 to deduplicate within a matter, so why would making some sort of determination about what to include in the MD5 not be an appropriate way to deduplicate across matters? Typically, we see something along the lines of To, From, CC, Subject, Email body, attachments, and then some options about whether to include a date or a BCC field in the software itself. As long as the calculation is done consistently across the matters, technically speaking, it would work. If you include the BCC field in the hash in one case and not the other, there wouldn’t be any duplicates, for example, so that would present a problem. No, it wouldn’t be perfect from a forensic point of view, but would it be reasonable and repeatable?
I”m not a lawyer, I’ll let you make that call! 😉
Thank you for weighing in. I’m happy you see it as a straightforward problem to solve. It *should* be as simple as you say; but, it is complicated for the fields you propose because the data in those fields may not be extracted, processed or normalized in a consistent way across tools. That’s why I’m trying to encourage a shift away from the characteristics that human reviewers look to in making a determination of identicality (those you propose, in the main) and instead focus on features of the message that machines emply as indicia of uniqueness and identicality. The results are comparable, but my experiemnts suggest my approach is less error prone.
Good point about the bcc field. In some matters, I would prefer not to de-dupe on bcc in most instances, instead treating the blind copy and a recipient copy as independent. I just like the belt and suspenders of same, if it serves to insure deliberate review of the item. There’s just something about that bcc that elevates its potential to be significant (i.e., shedding light on motive and intention).
Mike McBride said:
Agreed, my point was not to say we should absolutely use those fields, it was more of an overarching fact that processing software is already deciding how to calculate an MD5 based on a selection of metadata, why would coming up with a selection of metadata to use across matters be so very complicated? It simply takes a combination of legal and technical experts to agree to it, and we move forward. I know, as a technologist that seems simple to me, but nothing in the legal industry is ever quite that simple. 😉
I couldn’t have said it better. Thanks.
Pingback: E-Discovery Lessons from the Huma Abedin E-Mails | Ball in your Court
Dennis Kiker said:
Boy, am I late in coming to this conversation. Another excellent post, Craig, and much appreciated. I’ve finally been disabused of the notion that hash values are unavoidably different. And I also appreciate the context for MD5 vs. SHA-1. A little context goes a long way. Parsecs, one might say.
Since I am reading this nearly a year after the original post, and knowing how widely read your material is, it seems safe to say that the real problem (as you note) is not technical, but a matter of leadership and consensus. Does this suggest that the need is not as great as I would have thought? Given how many companies are routinely involved with matters that would benefit from the ability to leverage work product from one case to the next, where is the demand for standardization? I learned here that my Gmail message gets to my friends’ Hotmail and Yahoo accounts due to a voluntary standard that everyone complies with because to do otherwise would render the email product useless in the real world. Perhaps we just don’t have the same underlying demand driver for an entity identification standard in eDiscovery?
Thanks. Not too late. Very glad you stopped by, Dennis, as I greatly value your views.
Yes, you’re right that the demand has not been there. I’m inclined to look to Steve Jobs’ infamous remark in Business Week twenty years ago: “A lot of times, people don’t know what they want until you show it to them.” Quality and efficiency have never driven demand in e-discovery, and I can’t offer you any rationale for that. I’m labeled “cynical” when I note that, in a per-hour and per-gigabyte compensation world, the savings that flow from efficiencies tend to come from the pockets of those most responsible for finding and implementing efficiencies. Wonder why they don’t?
You or I could write a “standard,” but standards written by one or two persons tend to gain traction only when promulgated under the aegis of a committee or entity. The other problem is that few want a standard that enables something they’ve avoided having to do by simply claiming it was too difficult. However easy it is to do something, not doing anything is easier.
Where is the trade organization for e-discovery service providers? Ostensibly, it was EDRM; but, EDRM never realized its full promise much beyond its signature flowchart, not even in XML. I smell failure when it comes to EDRM TAR standards because some can’t abide a rising tide that floats competitor’s boats. As for me, all I can do is write, “Here’s How You to Do It” pieces and hope some try their hands at it and discover that it’s no big deal after all.