We all need certainty in our lives; we need to trust that two and two is four today and will be tomorrow.  But the more we learn about any subject, the more we’re exposed to the qualifiers and exceptions that belie perfect certainty.  It’s a conundrum for me when someone writes about cryptographic hashing, the magical math that allows an infinite range of numbers to match to a finite complement of digital fingerprints. Trying to simplify matters, well-meaning authors say things about hashing that just aren’t so.  Their mistakes are inconsequential for the most part—what they say is true enough–but it’s also misleading enough to warrant caveats useful in cross-examination.

I’m speaking of the following two assertions:

  1. Hash values are unique; i.e., two different files never share a hash value.
  2. Hash values are irreversible, i.e., you can’t deduce the original message using its hash value.

Both statements are wrong.

1. Hash values are not unique
Hash values can’t be unique because, even if there are vast numbers of different hash values (say, 2128 for MD5 or 2256 for SHA-256), those are still finite deterministic ranges tasked to serve as proxies for a much larger number of possible inputs.  If you try to use the MD5 algorithm to hash any input that constitutes the variation 2128+1 and larger, you must inevitably and unavoidably generate a hash that will be identical to one of the 2128 variations that came before (a so-called hash collision). Put another way, 340,282,366,920,938,463,463,374,607,431,768,211,455 is a very, very, VERY big number; but there are an infinite number of larger numbers, every one of which will generate MD5 hash values that match the MD5 hash value of one of the smaller numbers.  In math, this is the “pigeonhole principle” where, if you have m mailboxes and n letters placed in them, a mailbox contains more than one letter whenever n is greater than m.  It’s axiomatic.

It’s reasonable to counter, “Unique doesn’t mean mathematically unique; it means practically unique because we don’t have the practical ability to find two different messages that produce colliding hash values.”  It once was thought that hash collisions were “computationally infeasible;” a begrudging recognition that hash collisions can be created but only if you had eons and a supercomputer to fashion collisions.  An esteemed e-discovery colleague once called hash collisions “theoretical math exercises performed on high-speed computers [that] do not pose any real threats to the integrity of hash.”  That quickly proved to be wishful thinking.  We do have the ability; it’s been feasible for fifteen years and it’s gotten to the point that generating MD5 hash collisions is no big deal.

So, no one  should claim that hash values are “unique” because they never were and never will be.  That said, though I wouldn’t use the MD5 algorithm in applications that require collision resistance, I’m perfectly comfortable using MD5 hashes in the ways we use them in e-discovery.  The risk is real but largely irrelevant for, e.g., deduplication or de-NISTing.

2. Hash values are reversible
I’m not just picking on colleagues.  I recently wrote, “Hash functions are one-way calculations, meaning you can’t reverse (‘invert’) a hash value and ascertain the data corresponding to the hash value in the same way that you can’t decode a human fingerprint to deduce an individual’s eye color or IQ.  It identifies, but it doesn’t reveal.”  Even as I typed those words, I knew there to be a crucial exception to the proposition that hash values are irreversible.  It’s called a Rainbow Table and it can sometimes facilitate the reversing of hash values to reveal input data.

A Rainbow Table is precomputed list of hash values that correspond to a finite number of possible messages.  Suppose a group of movie fans wanted to see who could correctly pick the Best Picture but didn’t want to reveal their choices to one-another before the Academy opened the envelopes on Oscar Night.  Everyone could make their selection, hash their pick and share the hash value with the other players.  That way, no one who’d guessed wrong could claim they’d made the correct prediction because the hash value given all other players wouldn’t match to the correct pick.  If hash values were truly one way, no player could know another player’s choice from the hash values shared.

Can another player reverse engineer the hash values to decode everyone’s picks?  Yes and no.

There’s no magic algorithm that can unravel the hash values to get back to the selections; however, you don’t need one because the number of possible Best Picture choices is so limited–between five and ten nominees–it’s trivial to calculate the hash values of all the choices and generate a table of all possible inputs.  Matching the hashes submitted to the handful of hashes in the table enable the cheater to know everyone’s choices.

Extrapolating to a bigger range of possible inputs, consider a system where users choose a seven-character password constructed from one of the 95 printable ASCII characters on a standard U.S. keyboard.  [There are 128 ASCII characters, but 33 of them do not produce printed characters, so you can’t use them in passwords].  The number of potential seven-character passwords equals 957 or 69,833,729,609,375 possibilities.  Is 70 trillion a big honking number?  You bet!  But, is it beyond the calculation and search capabilities of a good personal computer?  Heck no! And you don’t need to tabulate your own Rainbow Table because hackers have done it for you and published the tables for anyone to download.

It’s common for online sites and software to store users’ passwords as hash values so that even those with high-level access to the site can’t see users’ passwords in plain text.  But with the right Rainbow Table, it’s child’s play for someone who steals the password hashes to reverse them to reveal the corresponding passwords.  Sites rectify this vulnerability using a process called “salting” where random data is stored and computed with the hash to inject enough variability to defeat Rainbow Table attacks.

Whew!  I’m happy to have gotten those asterisks off my chest.  It may not seem like a big deal but knowing the exceptions to categorical assertions about hashing can be a powerful cross examination tool.  I’ve seen expert witnesses undone by unqualified assertions as well as lawyers squander opportunities for impeachment by failing to know the limits of certainty.