At last week’s ILTACON in Washington, D.C., Beth Patterson, Chief Legal & Technology Services Officer for Allens in Sydney asked a panel why e-discovery service providers couldn’t standardize hash values so as to support identification and deduplication across products and collections. If they did, you could use work from one matter in another. If an e-mail is privileged in one case, there’s a good chance it’s privileged in another; so, wouldn’t it be splendid to be able to flag its counterparts to insure it doesn’t slip through without review?
Beth asked a great question, and one regrettably characterized by the panel as “a big technical challenge.”
One panelist got off on the right foot: He said, “I’ve created artificial hashes in the past where what I had to do was aggregate and normalize metadata across different data sets to create a custom fingerprint to do that.” But, he added, “that’s probably not defensible, and it’s also really cumbersome.”
Pressed by Beth, the panel pushed back. “It’s because artificial hashes are kind of complicated,” one panelist offered, and not “a trivial technical problem.” The panel questioned whether MD5 hashes were the appropriate standard or whether SHA-1 would be required, positing that cross-matter deduplication is “something that requires significant buy-in across a broad spectrum of people.” Beth’s request was ultimately dismissed as “not an easy challenge” and one that would be confounded by “people, process and technology” and “the MD5 hash stuff.”
ILTACON is the rare venue where reasonably well-adjusted and -socialized people engage in lively discussions of such things. It’s not just that ILTA folks understand the technology issues (“GEEKS!”), we’re passionate about them (“NERDS!”) and debate them respectfully as peers (“WUSSIES!”).
Beth’s idea deserved more credit than it got. It really is a trivial technical problem, and one that could be resolved without much programming or politics.
Then, why don’t we have a proven means to uniquely identify messages across vendors? I suspect it’s due to a lack of leadership and validation. Insofar as I’m aware, no one has published a standard methodology for cross-vendor identification or established that it works. Certainly, no one has managed to get something accepted as a de facto industry standard, in the nature of, say, the Concordance load file format or EDRM XML. Instead, we invent reasons why it’s just too darn hard.
To be clear, any e-discovery tool worth its salt employs a method to hash and deduplicate messages; unfortunately, they don’t employ the same method. Each tool approaches the task in a slightly different way and, when it comes to comparisons based on hash values, even the most minute variation in the data hashed generates a markedly different hash value. This article looks at how to get everybody on the same page when it comes to generating consistent, hash-based message identifiers across vendors and matters.
Let’s start by recounting a few facts about hashing, then examining how these facts relate to e-mail message and loose file identification and deduplication in e-discovery.
HASH FACT 1: Same data, same algorithm, same hash value.
HASH FACT 2: Different data, different hash value.
HASH FACT 3: No hash value is “close” to another hash value in a manner tied to similarity of the messages represented by the digest values of same.
HASH FACT 4: Hashing is a one-way process; i.e., the message digest cannot be reversed engineered to learn the content of the message.
If you’re reading this, I trust you already know what “hashing” is and that the most common hash algorithm (i.e., “mathematical formula”) employed in e-discovery is called the MD5. But, you may not know that the “MD” in MD5 stands for “Message Digest,” a synonym for “hash” (as a noun, not a verb).
In the context of hashing, the term “message” refers to any data that’s hashed, not necessarily an e-mail message. But in this post, we are going to zero in on the tricky business of uniquely identifying e-mail messages, principally because generating consistent MD5 hash values for loose documents across vendors isn’t a problem…or certainly shouldn’t be.
Why is one easy and the other tricky? It’s because a properly preserved and -processed logical file will hold the same data whether it comes from one custodian or another and whether it comes from an e-mail, a network file share or a thumb drive. Too, if the loose file hasn’t been altered or corrupted, it will generate the same hash value each and every time it’s retrieved from a storage medium, extracted from a container (e.g., a compressed container employing lossless compression like a Zip file) or decoded from base64-encoded content in a transmitting e-mail. Because the data comprising the file is the same from its first logical byte to its last, it will hash identically using the same hash algorithm, each and every time, anywhere, from any source. See Hash Fact 1.
In contrast, an e-mail message exported from a PST container to a single message (.MSG) format will be different each time it’s exported because the conversion process incorporates and embeds different internal timestamp information in the message each time it’s saved. More to the point, the “same” e-mail message retrieved from different accounts will hold different data because the message traversed a different transmission path to its addressee and arrived at a different time in a different account.
There are other reasons the “same” e-mail holds different data. One reason is that e-mail client programs tend not to store messages as discrete blocks of data faithful to the structure of the message received but will divide constituent parts of the message into records and fields in a database. Another is that e-mail client programs and servers may “alias” sender and recipient addresses, substituting a name from a contacts list for its associated e-mail address or making other small changes intended to aid the user but wreaking havoc in terms of hash-value consistency across systems.
The upshot is that, where you can hash a loose file repeatedly and repeatedly generate a consistent hash value using the same hash algorithm, you often cannot obtain the same hash value for the “same” message extracted at different times or from different sources because it’s not truly the same data first byte to last. See Hash Fact 2.
I’ve sometimes put the word “same” in quotation marks to underscore that what humans deem to be identical messages is far more forgiving than what computers employing hash algorithms deem identical. Humans typically regard two different addressees as receiving the “same” e-mail, though the messages derive from different accounts. Humans focus on the rendered content, not the binary content. The notion of “sameness” for humans is a judgment call. Computers employing hash algorithms perform calculations on the precise sequence of data ingested; “sameness” is objective and rigid. Again, different data, different hash values, and as previously noted, the different data doesn’t prompt correspondingly different hash values. See Hash Fact 3.
So, when I say that any similarity between the digest values of two messages is unrelated to any similarity between the messages, I mean that there is much less correlation than that we would expect to see in terms of two people of similar appearance having similar fingerprints. In practical terms, there is no correlation at all between appearance and fingerprints and none between different messages and their message digests (often called “digital fingerprints”). The message digest values are different and only one-way; that is, not in a manner that tells you anything about what was different in the messages or how significant the difference might be. See Hash Fact 4.
Reliable Identification Without Identicality
So, computers are maddeningly exacting in their comparison of e-mail messages for identicality, and e-mail systems embed quirky variations into e-mail messages that humans pragmatically judge to be the “same.” It’s a conundrum; but, all is not lost! E-discovery tools work around the variations seen within the “same” e-mail by hashing only those parts of the message that will be identical if the messages are the “same” from the perspective of human judgment. Because the various software and service providers go about the task in different ways, some regard their approach as a “secret sauce.” It’s really not. It’s a process that should be documented and open.
Formulating a cross-vendor and -matter methodology for message identification requires we resolve threshold questions of selection, normalization and concatenation:
- Selection: Which parts of the message are suited to hash comparison and by what algorithm?
- Normalization: How will the data be tweaked for consistency of presentation?
- Concatenation: In what order will these parts be presented to the hash algorithm?
At first blush, it might seem that achieving the most precise and defensible identification methodology requires we hash as much of the message as we can while avoiding inherently different features. After all, we want our method to be “forensically-sound,” right?
Perhaps not. In fact, it’s practical to treat two messages as being the “same” by comparing just a handful of characteristics. Not perfect, mind you, but practical.
The fewer pieces of a message that must be normalized and concatenated to generate a hash value, the quicker, easier and less error-prone the process. The tradeoff is an increased risk of hash collision or omission; that is, the risk that two different messages will generate the same hash value or that the process will fail to flag multiple instances of the same message (however we define “same”).
Quantifying the Risk of Hash Collision
How much risk is too much, and at what point does the risk differential become so small as to be meaningless?
The ILTACON panel raised the specter that the MD5 algorithm posed too great a risk of hash collision such that use of the SHA-1 hash algorithm was required for the process to be “defensible.”
Let’s put this objection in context so you can see why the concern is unwarranted:
MD5 hashes are 128 bit values (2128), putting the chances of an MD5 hash collision at 1 in 340 undecillion (1036 ) or more precisely, 1 in 340,282,366,920,938,000,000,000,000,000,000,000,000. SHA-1 hashes are 160 bit values (2160), so the chance of a hash collision with SHA-1 is 232 times less likely than MD5. Coincidentally, 232 is equal to the number of angels that can dance on the head of a pin.
I challenge any reader to put 340 undecillion into human scale. Believe me, I’ve tried, and find myself analogizing it to, say, the number of atoms in the Milky Way galaxy. It’s unfathomably large. One IT guy put it this way:
“If you had a job that paid you 390 trillion dollars per hour (US) you would have to work 24 hours per day, 7 days per week, 365 days per year for a just a little less than 100 quadrillion years to earn 340 undecillion dollars.”
Specifying US dollars was a helpful touch, right?
In this application, challenging defensibility based on MD5 versus SHA1 is silly. We aren’t battling Lex Luthor trying to insert a forged record into the Krypton Library of Universal Knowledge. We just want to eke out cost savings from re-tasking the fruits of prior reviews. Considering the human scale of potentially different messages in an e-mail collection (even one as vast as all the e-mail in the world), using a cosmic scale tool like the MD5 hash is number space overkill. So, suggesting that the number space needs to be 232 times larger is like ordering a sandwich at Subway in parsecs. “I’m dieting this week, so I’ll have the 4.93895e-18 parsec meatball with provolone, toasted on herb cheese please.“
Our risk of a hash collision won’t grow out of our choice of hash algorithm; instead, it’s entirely dependent upon the potential for the data in the compared messages to be the same despite the messages being different. That occurs when, e.g., a message is fabricated using the body of another message or when the constituents of the message selected for hashing are insufficiently unique, i.e., generic and likely to repeat. Message subject lines, dates and addressees are all highly likely to recur over a collection. These are essential data points for review, but they aren’t very useful when it comes to hash identification and deduplication. For these purposes, we want to hone in on the most unique features of the individual message.
The flip side risk to hash collision is the risk of omission; that two messages deemed the same in human terms will be identified as different by the machine and assigned different hash values for purposes of cross-matter identification and deduplication. This typically occurs when the methods used to transmit, decode, parse, store and collect the message add, omit or alter features of the data. I’ve already discussed aliasing errors, but hashing is frustrated by something as simple as collecting one character too many or one too few from a message body or header. Whatever features of the message are hashed must be collected in a precise and consistent way for every message. This means that the methodology employed must favor characteristics of the message that tend to be preserved in precise and consistent ways.
This may entail keeping features of messages we usually throw away in e-discovery. Accordingly, an industry standard approach to message identification may work better going forward than applied to legacy data. That was the case with human fingerprinting and DNA evidence in their day. You have to start somewhere.
So, let’s look at our candidates for comparison.
The organization and constitution of e-mail messages is governed by a series of proposed, voluntary standards circulated as Request for Comments or RFCs. No e-mail system is obliged to adhere to the proposed RFC standards…if they don’t care if their messages can successfully navigate other servers and networks. So, the notion that the RFCs are merely voluntary is counterpoised against the recognition that e-mail isn’t very useful if messages can’t reach their intended recipients. The RFCs are e-mail structural standards, and strong ones at that. Accordingly, they are our jumping off point to identify unique features of messages to include in our hash.
There are many parts to e-mail messages that e-mail users never see but that are crucial to the ability of e-mail systems to successfully transmit, order and present the messages. One of these is the Message-ID (msg-id). Here’s an example of a msg-id value:
Per RFC 5322 (arguably the most important RFC respecting e-mail), “The message identifier (msg-id) itself MUST be a globally unique identifier for a message. The generator of the message identifier MUST guarantee that the msg-id is unique.”
That’s all well and good, except the msg-id is not perfectly unique in real world e-mail collections. Message IDs may be malformed or spoofed, prompting a msg-id collision between messages that aren’t identical.
But, how perfect must the methods we employ be to be good enough to support cross-vendor and -matter identification? Compared to an unobtainable, flawless methodology, nothing suffices; but compared to current alternatives (i.e., nothing), it’s a great leap forward.
First Candidate for a cross-vendor and -matter identifier for e-mail messages
Service providers and processors would supply an MD5 hash of each message’s msg-id value, computed against string of characters between the angle brackets. For the msg-id in the example above, the identifier would be: 249f818730a5872d8efc660d19d27832
The advantages of using the hash of the msg-id value instead of the msg-id itself are that the hash value is shorter and always of a fixed length (i.e., a 32-character hexadecimal value). Too, using the hash serves to shield the fully qualified domain name customarily appearing at the end of the msg-id value.
The advantage to using the msg-id value alone is that it entails few normalization and no concatenation issues. It’s simple…but not perfect.
If the message contains no msg-id value (which might occur with some oddball in-house e-mail system), the method fails. If attachments have been stripped from the message, the stripped version of the message will match to the unstripped version because they are the “same” originating message and share the same message ID.
Second Candidate for a cross-vendor and -matter identifier for e-mail messages
To eschew confusion stemming from stripped versions of messages, we can include other values into the data stream that will be hashed. Adding the message body or any of the message header values like To, From, CC or Subject won’t serve to fix the stripped attachment problem; for that, we must include the attachments in our hash or at least some component uniquely tied to the presence of the attachments. One or more attachment boundary values might suffice, and would be trivial to hash; but, the safest bet are the attachments themselves. To promote efficiency, attachments would be included in the identification hash only when the flag Content-Type: multipart/mixed is present. Another way to speed and simplify the process is to hash the concatenated hash values of the attachments seen in the message. Attachment hash values are routinely calculated in processing e-mail. Using these in the identification hash obviates the need to incur the time and processing cycles needed to re-hash large objects.
Normalization and Concatenation
Once we start hashing parts in addition to the msg-id, we must set standards for normalization and concatenation. Normalization serves to eliminate variations in the data that are not indicia of any meaningful difference between the messages but that serve to generate different hash values. Examples include stray spaces, tabs, line feeds, differences in case and other formatting anomalies prompted by differences in e-mail applications and systems. Other data that should be normalized are date and time values. Routinely, date and time values will be re-formatted by an e-mail client to reflect, e.g., European date ordering, local time zones and 12- and 24-hour clocks, to name just a few variants that make hash matching impossible.
These variations must be normalized, viz., presented to the hash algorithm in a consistent manner. For example, time and date values might be converted to UTC expressed in Win32 time formats. All spaces, tabs and line feeds might be stripped and all characters converted to uppercase values. Programmatically, these are trivial tasks. Precisely how it’s done is less important than that it is done consistently each and every time the message is hashed for identification. It sounds complex to a lawyer, but it’s plain vanilla to those who work with digital data.
When the values that will be hashed are normalized, the resulting data comprise strings of information that must be joined end-to-end to be fed into the hash algorithm. This ‘stringing together’ is called concatenation, and it’s crucial that the order of the strings be consistent whenever their hash is computed because any change of order will result in a different hash value. So, a standard for a cross-matter and -vendor message identifier must establish the order of concatenation. Sometimes this will entail specifying the fields by name; other times it’s sufficient to specify an ordering methodology (e.g., alphabetic order). Whatever works here–any approach that’s not needlessly complicated or processor intensive.
If you’ve made it this far, thanks. None of what I’ve put forth is groundbreaking. Again, vendors do this all the time; unfortunately, the vendor community hasn’t settled upon a consistent, published process that supports reliable identification of messages outside their own systems. A standard would allow us to re-use work product and even to de-duplicate across messages produced in different forms. Imagine being able to reliably match and deduplicate messages whether in native, TIFF or PDF forms. An effective identifier makes that possible, and much more.
For this to work, parties must stop jettisoning data from messages like mad balloonists throwing out sandbags. That means that evidence must remain in forms that not only retain the points of comparison but also support simple, reliable extraction of same. Native and near-native forms are best, but even a properly-populated TIFF plus load file production could be matched using a sensible set of standards.
A working standard doesn’t require the imprimatur of EDRM, The Sedona Conference or NIST. The MD5 hash algorithm was just something put out there by one smart guy, Ron Rivest, and adopted by multitudes. All a message identification standard requires is that it work and be used. Again, leadership and validation. The standard could have your name on it. I hope this post helps someone figure it out.
If you want to learn more about hashing and deduplication, you might look at these posts: