Deduplication: Why Computers See Differences in Files that Look Alike

08 Wednesday Jul 2015

Posted by craigball in Computer Forensics, E-Discovery

An employee of an e-discovery service provider asked me to help him explain to his boss why deduplication works well for native files but frequently fails when applied to TIFF images. The question intrigued me because it requires we dip our toes into the shallow end of cryptographic hashing and dispel a common misconception about electronic documents.

Most people regard a Word document file, a PDF or TIFF image made from the document file, a printout of the file and a scan of the printout as being essentially “the same thing.” Understandably, they focus on content and pay little heed to form. But when it comes to electronically stored information, the form of the data—the structure, encoding and medium employed to store and deliver content–matters a great deal. As data, a Word document and its imaged counterpart are radically different data streams from one-another and from a digital scan of a paper printout. Visually, they are alike when viewed as an image or printout; but digitally, they bear not the slightest resemblance.

Exactly three years ago, I posted here concerning the challenge of deduplicating “the same” data in different formats. I addressed deduplication of e-mail messages then; now, let’s look at the issue with respect to word processed documents and their printed and imaged counterparts.

I’ll start by talking about hashing, as a refresher for you old hands and to bring newbies up to speed. Then we will look at how hashing is used to deduplicate files and wrap up by examining examples of the “same” data in a variety of common formats seen in e-discovery and explore why they will and won’t deduplicate. At that point, it should be clear why deduplication works well for native files but frequently fails when applied to TIFF images.

Hashing
My students at UTexas Law School and the Georgetown E-Discovery Training Academy spend considerable time learning that all ESI is just a bunch of numbers. Their readings and exercises about Base2 (binary), Base10 (decimal), Base16 (hexadecimal) and Base64; as well as about the difference between single-byte encoding schemes (like ASCIII) and double-byte encoding schemes (like Unicode) may seem like a wonky walk in the weeds; but the time is well spent when the students snap to the crucial connection between numeric encoding and our ability to use math to cull, filter and cluster data. It’s a necessary precursor to their gaining Proustian “new eyes” for ESI.

Because ESI is just a bunch of numbers, we can use algorithms (mathematical formulas) to distill and compare those numbers. In e-discovery, one of the most used and –useful family of algorithm are those which manipulate the very long numbers that comprise the content of files (the “message”) in order to generate a smaller, fixed length value called a “Message Digest” or “hash value.” The calculation process is called “hashing,” and the most common hash algorithms in use in e-discovery are MD5 (for Message Digest five) and SHA-1 (for Secure Hash Algorithm one)..

Using hash algorithms, any volume of data—from the tiniest file to the contents of entire hard drives and beyond—can be uniquely expressed as an alphanumeric sequence of fixed length. When I say “fixed length,” I mean that no matter how large or small the volume of data in the file, the hash value computed will (in the case of MD5) be distilled to a value written as 32 hexadecimal characters (0-9 and A-F). It’s hard to understand until you’ve figured out Base16; but, those 32 characters represent 340 trillion, trillion, trillion different possible values (2¹²⁸ or 16³²).

Hash algorithms are one-way calculations, meaning that although the hash value identifies just one sequence of data, it reveals nothing about the data; much as a fingerprint uniquely identifies an individual but reveals nothing about their appearance or personality.

Hash algorithms are simple in their operation: a number is inputted (and here, the “number” might be the contents of a file, a group of files, i.e., all files produced to the other side, or the contents of an entire hard drive or server storage array), and a value of fixed length emerges at a speed commensurate with the volume of data being hashed.

For example, the MD5 hash value of Lincoln’s Gettysburg Address in plain (Notepad) text is E7753A4E97B962B36F0B2A7C0D0DB8E8. Anyone, anywhere performing the same calculation on the same data will get the same unique value in a fraction of a second. But change “Four score and seven” to “Five score” and the hash becomes 8A5EF7E9186DCD9CF618343ECF7BD00A. However subtle the alteration—an omitted period or extra space—the hash value changes markedly. Hashing sounds like rocket science—and it’s a miraculous achievement—but it’s very much a routine operation, and the programs used to generate digital fingerprints are freely available and easy to use. Hashing lies invisibly at the heart of everyone’s computer and Internet activities and supports processes vitally important to electronic discovery, including identification, filtering, Bates numbering, authentication and deduplication.

Hashing for Deduplication
A modern hard drive holds trillions of bytes, and even a single Outlook e-mail container file typically comprises billions of bytes. Accordingly, it’s easier and faster to compare 32-character/16 byte “fingerprints” of voluminous data than to compare the data itself, particularly as the comparisons must be made repeatedly when information is collected and processed in e-discovery. In practice, each file ingested and item extracted is hashed and its hash value compared to the hash values of items previously ingested and extracted to determine if the file or item has been seen before. The first file is sometimes called the “pivot file,” and subsequent files with matching hashes are suppressed as duplicates, and the instances of each duplicate and certain metadata is typically noted in a deduplication or “occurrence” log.

When the data is comprised of loose files and attachments, a hash algorithm tends to be applied to the full contents of the files. Notice that I said to “contents.” Some data we associate with files is not actually stored inside the file but must be gathered from the file system of the device storing the data. Such “system metadata” is not contained within the file and, thus, is not included in the calculation when the file’s content is hashed. A file’s name is perhaps the best example of this. Recall that even slight differences in files cause them to generate different hash values. But, since a file’s name is not typically housed within the file, you can change a file’s name without altering its hash value.

So, the ability of hash algorithms to deduplicate depends upon whether the numeric values that serve as building blocks for the data differ from file-to-file. Keep that firmly in mind as we consider the many forms in which the informational payload of a document may manifest.

A Word .DOCX document is constructed of a mix of text and rich media encoded in Extensible Markup Language (XML), then compressed using the ubiquitous Zip compression algorithm. It’s a file designed to be read by Microsoft Word.

When you print the “same” Word document to an Adobe PDF format, it’s reconstructed in a page description language specifically designed to work with Adobe Acrobat. It’s structured, encoded and compressed in an entirely different way than the Word file and, as a different format, carries a different binary header signature, too.

When you take the printed version of the document and scan it to a Tagged Image File Format (TIFF), you’ve taken a picture of the document, now constructed in still another different format—one designed for TIFF viewer applications.

To the uninitiated, they are all the “same” document and might look pretty much the same printed to paper; but as ESI, their structures and encoding schemes are radically different. Moreover, even files generated in the same format may not be digitally identical when made at different times. For example, no two optical scans of a document will produce identical hash values because there will always be some variation in the data acquired from scan to scan. Small differences perhaps; but, any difference at all in content is going to frustrate the ability to generate matching hash values.

Opinions are cheap; testing is truth; so to illustrate this, I created a Word document of the text of Lincoln’s Gettysburg Address. First, I saved it in the latest .DOCX Word format. Then, I saved a copy in the older .DOC format. Next, I saved the Word document to a .PDF format, using both the Save as PDF and Print to PDF methods. Finally, I printed and scanned the document to TIFF and PDF. Without shifting the document on the scanner, I scanned it several times at matching and differing resolutions.

I then hashed all the iterations of the “same” document and, as the table below demonstrates, none of them matched hash wise, not even the successive scans of the paper document:

Thus, file hash matching–the simplest and most defensible approach to deduplication–won’t serve to deduplicate the “same” document when it takes different forms or is made optically at different times.

Now, here’s where it can get confusing. If you copied any of the electronic files listed above, the duplicate files would hash match the source originals, and would handily deduplicate by hash. Consequently, multiple copies of the same electronic files will deduplicate, but that is because the files being compared have the same digital content. But, we must be careful to distinguish the identicality seen in multiple iterations of the same file from the pronounced differences seen when different electronic versions are generated at different times from the same content. One notable exception seen in my testing was that successively saving the same Word document to a PDF format in the same manner sometimes generated identical PDF files. It didn’t occur consistently (i.e., if enough time passed, changes in metadata in the source document triggered differences prompting the calculation of different hash values); but it happened, so was worth mentioning.

21 thoughts on “Deduplication: Why Computers See Differences in Files that Look Alike”

ESIDence said:

July 8, 2015 at 10:22 PM

Hashing the various files produced in your test created, unsurprisingly, vastly different digital signatures, as your primer explains.

For those who may have felt lost following the logic trail, a visually-striking view of the differences sensed by hashing algorithms results from simply opening each of the seemingly-identical versions in a “non-native” tool, e.g., using “Open With” Notepad of each file (because Notepad is a ‘native’ tool for only the most basic text format). A casual comparison of just the first few lines reveals almost 100% uniqueness of the content that humans perceive as near-identical duplicates.

Because of its beautiful simplicity, readers who are just now discovering the truth of digital differences in apparent duplicates can easily (and should) re-create the test you used.

Just create and save a 2-line TXT document in Notepad, open the file in Word and “save as” a DOCX file, print the DOCX to PDF, print (to paper) and scan the PDF to TIFF – and then open each file (the original TXT, the Word DOCX, the .PDF and TIFF versions) using Notepad.

You’ll see firsthand what the computer sees – not a trace of duplicates.

LikeLike
- craigball said:
  
  July 9, 2015 at 12:23 AM
  
  Thank you for suggesting an alternate way to see the difference that hashing sees. I did not take your approach because it is not as effective at demonstrating the differences between the same document imaged at different times and similar documents digitized in the same or similar ways. In those cases, I’m sure you’d agree that looking at the first few lines (“file headers”) is unlikely to reveal much, if any, difference to the observer using Notepad, although the hashes would differ markedly, frustrating deduplication.
  
  LikeLike
Matthew Golab said:

July 8, 2015 at 10:50 PM

Excellent as always Craig. It would be interesting to see the file size differences in the test that you did.

It would be great if there could be a universal way of deduplication against emails, adopted by all ediscovery processing systems – my point being that it appears that the systems that I’ve come across appear to have subtle differences in the MD5s that they generate and so it doesn’t appear feasible for parties to be able to exchange MD5s and be able to deduplicate between each others’ productions – or at least I haven’t yet experienced this.

LikeLike
- craigball said:
  
  July 9, 2015 at 1:10 AM
  
  Dear Matthew:
  
  Adding the file sizes was a fine idea! I’ve revised the table to do so. Thank you.
  
  If the parties standardized on the right file format for e-mail and normalized fields in exactly the same way, hash deduplication should be possible across parties–error-prone, perhaps; but possible. It would never work for the MSG format; but, EML and RFC formats might suffice.
  
  LikeLike
  - Matthew Golab said:
    
    July 9, 2015 at 2:54 PM
    
    Thank you Craig.
    
    Its fascinating to see the size differences between the two PDFs, in that the Word generated PDF is close enough to being the same size as the native Word, whereas the save as PDF is significantly larger.
    
    On the inter-party deduplication, yes it would be ideal if things could be ‘tweaked’ to do this, or perhaps another consideration could be that the inter-party exchange protocol be adjusted so that the mandatory fields that the parties exchange include those that are used to generate MD5s (except I guess for the body).
    
    It would then be a little easier in ‘kind of’ / ‘quasi’ inter-party deduplication in that you are kind of comparing like with like. No doubt though this would get derailed with good old email address / alias shenanigans – ie if for whatever reason I have an email which is the full email address, and the other party has an identical copy where say the email address used the local alias and email address etc.
    
    Something to think about in the future, as this would (in Australia at least) be of assistance to all parties as the potential is that you may be able to reduce the review universe from the other party.
    
    On the MD5 v EML / RFC formats, I’d be interested in hearing further on this, as well as email recipients and when the email address is using an alias and when it isn’t.
    
    LikeLike
Pingback: Week’s Top 3 in e-discovery | Corporate E-Discovery by Damian Durrant
David Tobin said:

July 9, 2015 at 7:50 AM

Great post.

LikeLike
Pingback: Deduplication: Why Computers See Differences in Files that Look Alike | @ComplexD
Pingback: Craig Ball Explains HASH Deduplication As Only He Can: eDiscovery Best Practices | eDiscoveryDaily
Jesse said:

July 10, 2015 at 9:48 AM

Can you explain further (possibly in another post) why the Gettysburg Address in plain text will always get the same hash? I had thought that hash values were assigned anew to each document set — and that this was part of the value of using hashing.

Does every set of characters have a pre-determined hash? The possibilities seem endless. Even a trillion, trillion, trillion could not cover the possibilities.

LikeLike
- craigball said:
  
  July 10, 2015 at 1:53 PM
  
  The Gettysburg Address written in ASCII characters is a big number; to wit:
  
  74 20 63 61 75 73 65 20 66 6f 72 20 77 68 69 63
  68 20 74 68 65 79 20 67 61 76 65 20 74 68 65 20
  6c 61 73 74 20 66 75 6c 6c 20 6d 65 61 73 75 72
  65 20 6f 66 20 64 65 76 6f 74 69 6f 6e 20 2d 2d
  20 74 68 61 74 20 77 65 20 68 65 72 65 20 68 69
  67 68 6c 79 20 72 65 73 6f 6c 76 65 20 74 68 61
  74 20 74 68 65 73 65 20 64 65 61 64 20 73 68 61
  6c 6c 20 6e 6f 74 20 68 61 76 65 20 64 69 65 64
  20 69 6e 20 76 61 69 6e 20 2d 2d 20 74 68 61 74
  20 74 68 69 73 20 6e 61 74 69 6f 6e 2c 20 75 6e
  64 65 72 20 47 6f 64 2c 20 73 68 61 6c 6c 20 68
  61 76 65 20 61 20 6e 65 77 20 62 69 72 74 68 20
  6f 66 20 66 72 65 65 64 6f 6d 20 2d 2d 20 61 6e
  64 20 74 68 61 74 20 67 6f 76 65 72 6e 6d 65 6e
  74 20 6f 66 20 74 68 65 20 70 65 6f 70 6c 65 2c
  20 62 79 20 74 68 65 20 70 65 6f 70 6c 65 2c 20
  66 6f 72 20 74 68 65 20 70 65 6f 70 6c 65 2c 20
  73 68 61 6c 6c 20 6e 6f 74 20 70 65 72 69 73 68
  20 66 72 6f 6d 20 74 68 65 20 65 61 72 74 68 2e
  
  Anyone, anywhere who calculates an MD5 hash against that precise numeric sequence will generate the same hash value, just as anyone anywhere multiplying two times two should get four. Bigger numbers, same constancy.. Hash values are not “assigned.” They are calculated.
  
  Yes, every numeric value has the same hash value, assuming no deviation in the value, no computer failure and use of the same hash algorithm. This fact enables limited use of a hash reversal technique called generation of a Rainbow Table; but, that calculation of all potential hashes is only feasible when the universe of potential messages is fairly modest.
  
  Because we can say “trillion, trillion. trillion” doesn’t begin to help us fathom the enormity of that number. Think “more than the number of atoms in the Milky Way galaxy” and perhaps you can get a sense of the astronomic size of of an undecilion.
  
  LikeLike
Pingback: Top 10 from Texas Bar Today: Mulligans, Ambiguity, and Deduplication | Texas Bar Today
Mike said:

July 10, 2015 at 1:23 PM

Great explanations. I had a question – does the 32 character MD5 hash represent 340 trillion . . . or 340 billion . . . ? I see others listing it with billions not trillions – for example http://e-discoveryissues.blogspot.com/2010/10/can-hash-values-leave-your-e-discovery.html

LikeLike
- craigball said:
  
  July 10, 2015 at 1:44 PM
  
  Ordinarily, I would respond that it’s 340 undecillion; but, however one expresses the value–340 undecillion = 340 billion billion billion billion = 340 trillion trillion trillion, all we are talking about are different ways to state the same value. Four times nine is thirty-six (zeroes in a billion to the fourth power) and three times twelve is is also thirty-six (zeroes in trillion cubed).
  
  So, which is heavier, a pound of goose down or a pound of lead shot?
  
  LikeLike
  - Mike said:
    
    July 10, 2015 at 1:52 PM
    
    Thanks for the clarification
    
    LikeLike
Chris Janak said:

July 15, 2015 at 12:53 PM

This is precisely why “near duplicate” identification is useful in that it generally can identify similar or identical content even when file signatures and hash values are different.

LikeLike
- WTKJD said:
  
  July 24, 2015 at 5:01 PM
  
  Exactly, Chris. The simple reduction of cooking this sauce is to understand the difference between technical de-duplication based on the binary of a file reduced to the hash signature that Craig discusses at length, and content de-duplication based on other methodologies, which is tacitly set up in the premise, but ignored (for good reason, as it is not the basis for the question presented.)
  
  I have always bristled at the term “near duplicate” identification because it is based on a technical understanding, not the common ken of the consumer (litigator.) At the end of the day, identifying duplicate content is more important to addressing the review problem than the form of the various files presenting that content. When discussing de-duplication with a neophyte lawyer or technologist, they always assume the process is broader than it actually is. This answers the question of “I thought we de-duplicated, why am I seeing the same document over and over again?” at least on the technical level.
  
  At the end of the day, there is still a very broad gap of understanding in answering the question “What is a duplicate?”
  
  LikeLike
Pingback: QuickLINK » QuickLINK #23
Mark Holland-Avery said:

October 26, 2015 at 5:30 PM

System metadata is not contained within the file. There is a “not” missing in the sentence: Such “system metadata” is contained within the file and, thus, is not included in the calculation when the file’s content is hashed.

LikeLike
Pingback: Cross-Matter & -Vendor Message ID | Ball in your Court
yegermal semahegn said:

January 5, 2018 at 12:11 PM

good explanation!!!!!!!!!!

LikeLike