I’m teaching e-discovery at the University of Texas Law School this semester, and though it’s been a lot of work, and challenging to conform my peripatetic practice to a fixed routine, I love being back in the classroom with bright students. So far, I’m pretty sure no one in the class has learned more than I have.
I’ve learned that however witty I imagine I might be in front of a lawyer audience, I’m not funny in the slightest to a bunch of stressed out 2Ls. And, I’ve discovered that I need fresh technology metaphors because references to pre-1990 devices draw blank stares. Despite the resurgent coolness of vinyl, twenty-somethings have never heard of a “tone arm” nor experienced an “auto reverse” cassette deck. Of course, what were you thinking, old timer?!?!
Unlike practicing lawyers, law students don’t devote all their creative ingenuity to fashioning arguments why they can’t (or shouldn’t have to) learn the nuts and bolts of information technology. I tell the class it will be on the midterm, and they have all the motivation they need to wrap their nimble noggins around sectors and clusters, hashing and hex. The power to test those you teach is awesome, and may be what’s missing from CLE. You can bet you’d see better speakers and more attentive listeners if attendees had to pass a test on the material to get their CLE credit. But I digress.
The greatest delights are the excellent questions from the students. I thought I’d start posting a few of the better ones here with my replies, in case what they want to know happens to be what’s bugging you, too.
KB asked “In class you explained that computing a hash value is ‘digitally fingerprinting’ a file, so that there is a record of what you produced natively. As far as I understood it, if the document is then changed at all (such as by being accessed on a different date) the hash value of the document will change because the document properties have changed. I’m confused then about the value of this as a means to identify what you produced because of how easily it can be changed. Or have I misunderstood how easily the hash number can later be changed?”
It’s a very good question, and a source of confusion to many in e-discovery.
One of the points I’m trying to nail down with the class is the distinction between information that resides within a file and that which resides without it. It’s an important distinction for, inter alia, preservation, production and authentication. We touch on this crucial distinction in the context of metadata. Application metadata resides within the file and moves with the file, not changing unless the contents of the file are altered. System metadata resides outside the file and can be altered without impacting the contents of the file. Hashing the file hashes its contents, not information about the file. That is, you only hash what’s stored inside the file, not its system metadata.
In class, I asked how one might access a file to investigate its contents without changing the file and the class correctly answered, “You could make a working copy of the file without changing the file you’re seeking to preserve.” That’s possible because certain file operations–copying, moving–don’t alter the contents of a file, although they necessarily change system metadata about the file. Think about why that must be the case by hearkening back to the card catalog and library book analogy I use to explain file systems. You can do anything you want to the card in the card catalog, but nothing you do to the cards changes the books in the stacks.
System metadata has to change when you make a copy of a file because the copy has to be stored somewhere, and the system has to track where the copy resides, and the new file’s name, size and MAC dates (Modified/Accessed/Created), including the date it was created on that system. Importantly, a “creation date” in a Windows master file table doesn’t mean “date authored.” It means date created on that storage medium, so copies of files typically have creation dates later than the date they were last modified! When a lawyer sees a file that’s been modified before it was created, the lawyer may think “fraud!” When a forensic examiner sees it, the examiner thinks “copy.” It’s important for lawyers to understand the disconnect between use of the term “creation” to mean authoring in casual conversation and the meaning afforded “creation” when referencing metadata. It’s yet another area where the use of precise technical language needs to matter to lawyers, especially when discussing technical topics with technicians.
So, back to hashing. If you hash the file, you have a reliable way (the “digital fingerprint”) to establish the precise content of the file when produced. The producing party maintains a pristine set of production, just as they might have kept a true and correct set of a paper production in the old days. They hash the digital production (individually and/or collectively) to be able to later prove the statement, “This is exactly what we gave you in response to your Request for Production No. 6.”
You are absolutely correct that, if the party receiving the production opens the file with the native applications from something other than read-only media, they run the risk of changing the contents of the files (e.g., the embedded properties a/k/a the application metadata). So, they must work from a copy, just as we would expect them to do if they were planning to mess around with a paper set of production (I think most careful lawyers keep a pristine original of the paper document production set they receive, free of their own notes, highlighting, etc., and work on a copy). Of course, it’s much, much easier, faster and cheaper to make a working copy of digital information than of a paper counterpart.
One saving grace in all of this is so-called “read only media.” Unless it has been “write protected” by the use of special tools or software, electromagnetic storage, i.e., hard drives, floppies and tape, are read-write media. That is, you can easily alter their contents. Happily, most recordable optical media (CD-Rs and DVD-Rs) are read only media, such that, once written, they will not change with subsequent access. When we produce ESI on optical media, we are supplying a pristine copy that won’t change with usage because it is “read only.” There are such things as re-writable optical media (called CD-RW and DVD-RW), but they are rarely seen today and not used for e-production.
I wrote a short column on some of this many years ago called “In Praise of Hash, but I think it’s still useful. Take a look at it and then please think about these questions:
- Change the file’s extension from .DOC to .JPG?
- Copy the file to a thumb drive?
- Save a Microsoft Word file as a PDF?
- E-mail the file to me as an attachment and I forward it back to you?
- Paste the text of a Word document into the body of an e-mail, mail it to yourself and then carefully paste the exact same text into a new Word document?
- Print a Word document using the MS Word program but do not otherwise save or change the contents of the document?
- Rename the document in Windows without opening it in Word?
- Rename the document by saving it under a new name using Word?
1) No. The name of a file is a system metadata value stored in the master file table outside of the file, so changing a file’s name or extension won’t alter its contents and so won’t change its hash value.
2)No. Copying will change a file’s system metadata values, including its location, creation data and last accessed date, but as these are all stored outside of the file, the hash shouldn’t change.
3)Yes. Word documents and Adobe PDF documents may look the same onscreen, but they are encoded in entirely different ways. Hence, changing the format of the file changes the file’s contents and changes its hash value.
4)No. Attachments to e-mails are encoded to base64 for transmission, but they are decoded back to their native formats on arrival without altering the contents of the file. They should hash identically no matter how many times they ride the digital rails as an e-mail attachment.
5)Yes. This is a tricky one, but you can be certain the files will not hash identically due to differences in embedded data, particularly the file create time. Yes, Created Date is a system metadata value, but Microsoft Word embeds a second, more resilient create date in the file itself. Because the data in the file is slightly different, the hash values will be different
6)Yes, because Microsoft Office stores a document’s last printed date inside the file. When application metadata changes, hash values change.
7)No. You probably got this one right. A file’s name is stored in the master file table, outside the file, so it can be changed without changing the contents of the file, keeping the file’s hash value the same.
8)Yes. Another tricky one. Word documents are somewhat unique in storing certain names given the file within the file as an embedded property. When you change a file’s name from within Word by saving the file under a new name, you alter the embedded file name and thus change the hash value. You may be changing other embedded metadata as well, but any change within the file is enough to alter the hash value.