An employee of an e-discovery service provider asked me to help him explain to his boss why deduplication works well for native files but frequently fails when applied to TIFF images. The question intrigued me because it requires we dip our toes into the shallow end of cryptographic hashing and dispel a common misconception about electronic documents.
Most people regard a Word document file, a PDF or TIFF image made from the document file, a printout of the file and a scan of the printout as being essentially “the same thing.” Understandably, they focus on content and pay little heed to form. But when it comes to electronically stored information, the form of the data—the structure, encoding and medium employed to store and deliver content–matters a great deal. As data, a Word document and its imaged counterpart are radically different data streams from one-another and from a digital scan of a paper printout. Visually, they are alike when viewed as an image or printout; but digitally, they bear not the slightest resemblance.
Exactly three years ago, I posted here concerning the challenge of deduplicating “the same” data in different formats. I addressed deduplication of e-mail messages then; now, let’s look at the issue with respect to word processed documents and their printed and imaged counterparts.
I’ll start by talking about hashing, as a refresher for you old hands and to bring newbies up to speed. Then we will look at how hashing is used to deduplicate files and wrap up by examining examples of the “same” data in a variety of common formats seen in e-discovery and explore why they will and won’t deduplicate. At that point, it should be clear why deduplication works well for native files but frequently fails when applied to TIFF images.
My students at UTexas Law School and the Georgetown E-Discovery Training Academy spend considerable time learning that all ESI is just a bunch of numbers. Their readings and exercises about Base2 (binary), Base10 (decimal), Base16 (hexadecimal) and Base64; as well as about the difference between single-byte encoding schemes (like ASCIII) and double-byte encoding schemes (like Unicode) may seem like a wonky walk in the weeds; but the time is well spent when the students snap to the crucial connection between numeric encoding and our ability to use math to cull, filter and cluster data. It’s a necessary precursor to their gaining Proustian “new eyes” for ESI.
Because ESI is just a bunch of numbers, we can use algorithms (mathematical formulas) to distill and compare those numbers. In e-discovery, one of the most used and –useful family of algorithm are those which manipulate the very long numbers that comprise the content of files (the “message”) in order to generate a smaller, fixed length value called a “Message Digest” or “hash value.” The calculation process is called “hashing,” and the most common hash algorithms in use in e-discovery are MD5 (for Message Digest five) and SHA-1 (for Secure Hash Algorithm one)..
Using hash algorithms, any volume of data—from the tiniest file to the contents of entire hard drives and beyond—can be uniquely expressed as an alphanumeric sequence of fixed length. When I say “fixed length,” I mean that no matter how large or small the volume of data in the file, the hash value computed will (in the case of MD5) be distilled to a value written as 32 hexadecimal characters (0-9 and A-F). It’s hard to understand until you’ve figured out Base16; but, those 32 characters represent 340 trillion, trillion, trillion different possible values (2128 or 1632).
Hash algorithms are one-way calculations, meaning that although the hash value identifies just one sequence of data, it reveals nothing about the data; much as a fingerprint uniquely identifies an individual but reveals nothing about their appearance or personality.
Hash algorithms are simple in their operation: a number is inputted (and here, the “number” might be the contents of a file, a group of files, i.e., all files produced to the other side, or the contents of an entire hard drive or server storage array), and a value of fixed length emerges at a speed commensurate with the volume of data being hashed.
For example, the MD5 hash value of Lincoln’s Gettysburg Address in plain (Notepad) text is E7753A4E97B962B36F0B2A7C0D0DB8E8. Anyone, anywhere performing the same calculation on the same data will get the same unique value in a fraction of a second. But change “Four score and seven” to “Five score” and the hash becomes 8A5EF7E9186DCD9CF618343ECF7BD00A. However subtle the alteration—an omitted period or extra space—the hash value changes markedly. Hashing sounds like rocket science—and it’s a miraculous achievement—but it’s very much a routine operation, and the programs used to generate digital fingerprints are freely available and easy to use. Hashing lies invisibly at the heart of everyone’s computer and Internet activities and supports processes vitally important to electronic discovery, including identification, filtering, Bates numbering, authentication and deduplication.
Hashing for Deduplication
A modern hard drive holds trillions of bytes, and even a single Outlook e-mail container file typically comprises billions of bytes. Accordingly, it’s easier and faster to compare 32-character/16 byte “fingerprints” of voluminous data than to compare the data itself, particularly as the comparisons must be made repeatedly when information is collected and processed in e-discovery. In practice, each file ingested and item extracted is hashed and its hash value compared to the hash values of items previously ingested and extracted to determine if the file or item has been seen before. The first file is sometimes called the “pivot file,” and subsequent files with matching hashes are suppressed as duplicates, and the instances of each duplicate and certain metadata is typically noted in a deduplication or “occurrence” log.
When the data is comprised of loose files and attachments, a hash algorithm tends to be applied to the full contents of the files. Notice that I said to “contents.” Some data we associate with files is not actually stored inside the file but must be gathered from the file system of the device storing the data. Such “system metadata” is not contained within the file and, thus, is not included in the calculation when the file’s content is hashed. A file’s name is perhaps the best example of this. Recall that even slight differences in files cause them to generate different hash values. But, since a file’s name is not typically housed within the file, you can change a file’s name without altering its hash value.
So, the ability of hash algorithms to deduplicate depends upon whether the numeric values that serve as building blocks for the data differ from file-to-file. Keep that firmly in mind as we consider the many forms in which the informational payload of a document may manifest.
A Word .DOCX document is constructed of a mix of text and rich media encoded in Extensible Markup Language (XML), then compressed using the ubiquitous Zip compression algorithm. It’s a file designed to be read by Microsoft Word.
When you print the “same” Word document to an Adobe PDF format, it’s reconstructed in a page description language specifically designed to work with Adobe Acrobat. It’s structured, encoded and compressed in an entirely different way than the Word file and, as a different format, carries a different binary header signature, too.
When you take the printed version of the document and scan it to a Tagged Image File Format (TIFF), you’ve taken a picture of the document, now constructed in still another different format—one designed for TIFF viewer applications.
To the uninitiated, they are all the “same” document and might look pretty much the same printed to paper; but as ESI, their structures and encoding schemes are radically different. Moreover, even files generated in the same format may not be digitally identical when made at different times. For example, no two optical scans of a document will produce identical hash values because there will always be some variation in the data acquired from scan to scan. Small differences perhaps; but, any difference at all in content is going to frustrate the ability to generate matching hash values.
Opinions are cheap; testing is truth; so to illustrate this, I created a Word document of the text of Lincoln’s Gettysburg Address. First, I saved it in the latest .DOCX Word format. Then, I saved a copy in the older .DOC format. Next, I saved the Word document to a .PDF format, using both the Save as PDF and Print to PDF methods. Finally, I printed and scanned the document to TIFF and PDF. Without shifting the document on the scanner, I scanned it several times at matching and differing resolutions.
I then hashed all the iterations of the “same” document and, as the table below demonstrates, none of them matched hash wise, not even the successive scans of the paper document:
Thus, file hash matching–the simplest and most defensible approach to deduplication–won’t serve to deduplicate the “same” document when it takes different forms or is made optically at different times.
Now, here’s where it can get confusing. If you copied any of the electronic files listed above, the duplicate files would hash match the source originals, and would handily deduplicate by hash. Consequently, multiple copies of the same electronic files will deduplicate, but that is because the files being compared have the same digital content. But, we must be careful to distinguish the identicality seen in multiple iterations of the same file from the pronounced differences seen when different electronic versions are generated at different times from the same content. One notable exception seen in my testing was that successively saving the same Word document to a PDF format in the same manner sometimes generated identical PDF files. It didn’t occur consistently (i.e., if enough time passed, changes in metadata in the source document triggered differences prompting the calculation of different hash values); but it happened, so was worth mentioning.