I’m teaching e-discovery at the University of Texas Law School this semester, and though it’s been a lot of work, and challenging to conform my peripatetic practice to a fixed routine, I love being back in the classroom with bright students. So far, I’m pretty sure no one in the class has learned more than I have.
I’ve learned that however witty I imagine I might be in front of a lawyer audience, I’m not funny in the slightest to a bunch of stressed out 2Ls. And, I’ve discovered that I need fresh technology metaphors because references to pre-1990 devices draw blank stares. Despite the resurgent coolness of vinyl, twenty-somethings have never heard of a “tone arm” nor experienced an “auto reverse” cassette deck. Of course, what were you thinking, old timer?!?!
Unlike practicing lawyers, law students don’t devote all their creative ingenuity to fashioning arguments why they can’t (or shouldn’t have to) learn the nuts and bolts of information technology. I tell the class it will be on the midterm, and they have all the motivation they need to wrap their nimble noggins around sectors and clusters, hashing and hex. The power to test those you teach is awesome, and may be what’s missing from CLE. You can bet you’d see better speakers and more attentive listeners if attendees had to pass a test on the material to get their CLE credit. But I digress.
The greatest delights are the excellent questions from the students. I thought I’d start posting a few of the better ones here with my replies, in case what they want to know happens to be what’s bugging you, too.
KB asked “In class you explained that computing a hash value is ‘digitally fingerprinting’ a file, so that there is a record of what you produced natively. As far as I understood it, if the document is then changed at all (such as by being accessed on a different date) the hash value of the document will change because the document properties have changed. I’m confused then about the value of this as a means to identify what you produced because of how easily it can be changed. Or have I misunderstood how easily the hash number can later be changed?”
I replied:
It’s a very good question, and a source of confusion to many in e-discovery.
One of the points I’m trying to nail down with the class is the distinction between information that resides within a file and that which resides without it. It’s an important distinction for, inter alia, preservation, production and authentication. We touch on this crucial distinction in the context of metadata. Application metadata resides within the file and moves with the file, not changing unless the contents of the file are altered. System metadata resides outside the file and can be altered without impacting the contents of the file. Hashing the file hashes its contents, not information about the file. That is, you only hash what’s stored inside the file, not its system metadata.
In class, I asked how one might access a file to investigate its contents without changing the file and the class correctly answered, “You could make a working copy of the file without changing the file you’re seeking to preserve.” That’s possible because certain file operations–copying, moving–don’t alter the contents of a file, although they necessarily change system metadata about the file. Think about why that must be the case by hearkening back to the card catalog and library book analogy I use to explain file systems. You can do anything you want to the card in the card catalog, but nothing you do to the cards changes the books in the stacks.
System metadata has to change when you make a copy of a file because the copy has to be stored somewhere, and the system has to track where the copy resides, and the new file’s name, size and MAC dates (Modified/Accessed/Created), including the date it was created on that system. Importantly, a “creation date” in a Windows master file table doesn’t mean “date authored.” It means date created on that storage medium, so copies of files typically have creation dates later than the date they were last modified! When a lawyer sees a file that’s been modified before it was created, the lawyer may think “fraud!” When a forensic examiner sees it, the examiner thinks “copy.” It’s important for lawyers to understand the disconnect between use of the term “creation” to mean authoring in casual conversation and the meaning afforded “creation” when referencing metadata. It’s yet another area where the use of precise technical language needs to matter to lawyers, especially when discussing technical topics with technicians.
So, back to hashing. If you hash the file, you have a reliable way (the “digital fingerprint”) to establish the precise content of the file when produced. The producing party maintains a pristine set of production, just as they might have kept a true and correct set of a paper production in the old days. They hash the digital production (individually and/or collectively) to be able to later prove the statement, “This is exactly what we gave you in response to your Request for Production No. 6.”
You are absolutely correct that, if the party receiving the production opens the file with the native applications from something other than read-only media, they run the risk of changing the contents of the files (e.g., the embedded properties a/k/a the application metadata). So, they must work from a copy, just as we would expect them to do if they were planning to mess around with a paper set of production (I think most careful lawyers keep a pristine original of the paper document production set they receive, free of their own notes, highlighting, etc., and work on a copy). Of course, it’s much, much easier, faster and cheaper to make a working copy of digital information than of a paper counterpart.
One saving grace in all of this is so-called “read only media.” Unless it has been “write protected” by the use of special tools or software, electromagnetic storage, i.e., hard drives, floppies and tape, are read-write media. That is, you can easily alter their contents. Happily, most recordable optical media (CD-Rs and DVD-Rs) are read only media, such that, once written, they will not change with subsequent access. When we produce ESI on optical media, we are supplying a pristine copy that won’t change with usage because it is “read only.” There are such things as re-writable optical media (called CD-RW and DVD-RW), but they are rarely seen today and not used for e-production.
I wrote a short column on some of this many years ago called “In Praise of Hash, but I think it’s still useful. Take a look at it and then please think about these questions:
- Change the file’s extension from .DOC to .JPG?
- Copy the file to a thumb drive?
- Save a Microsoft Word file as a PDF?
- E-mail the file to me as an attachment and I forward it back to you?
- Paste the text of a Word document into the body of an e-mail, mail it to yourself and then carefully paste the exact same text into a new Word document?
- Print a Word document using the MS Word program but do not otherwise save or change the contents of the document?
- Rename the document in Windows without opening it in Word?
- Rename the document by saving it under a new name using Word?
Answers:
1) No. The name of a file is a system metadata value stored in the master file table outside of the file, so changing a file’s name or extension won’t alter its contents and so won’t change its hash value.
2)No. Copying will change a file’s system metadata values, including its location, creation data and last accessed date, but as these are all stored outside of the file, the hash shouldn’t change.
3)Yes. Word documents and Adobe PDF documents may look the same onscreen, but they are encoded in entirely different ways. Hence, changing the format of the file changes the file’s contents and changes its hash value.
4)No. Attachments to e-mails are encoded to base64 for transmission, but they are decoded back to their native formats on arrival without altering the contents of the file. They should hash identically no matter how many times they ride the digital rails as an e-mail attachment.
5)Yes. This is a tricky one, but you can be certain the files will not hash identically due to differences in embedded data, particularly the file create time. Yes, Created Date is a system metadata value, but Microsoft Word embeds a second, more resilient create date in the file itself. Because the data in the file is slightly different, the hash values will be different
6)Yes, because Microsoft Office stores a document’s last printed date inside the file. When application metadata changes, hash values change.
7)No. You probably got this one right. A file’s name is stored in the master file table, outside the file, so it can be changed without changing the contents of the file, keeping the file’s hash value the same.
8)Yes. Another tricky one. Word documents are somewhat unique in storing certain names given the file within the file as an embedded property. When you change a file’s name from within Word by saving the file under a new name, you alter the embedded file name and thus change the hash value. You may be changing other embedded metadata as well, but any change within the file is enough to alter the hash value.
jimshook said:
Craig, as always great stuff and nicely explained. In particular the Q&A section really helps to confirm an understanding of this process.
What do you see as the most important practical use of hashes by lawyers in civil litigation? Hashes are clearly a must for systems to maintain and to verify “objects” as they are processed, but that’s on the machine level. Do see much hands-on use in the courtroom or even in motion practice? Could we use them more fully, say, as part of authentication?
LikeLike
Craig Ball said:
Dear Jim:
Thanks for the kind words. Especially appreciated coming from one as steeped in EDD as you.
It’s a fair question. One could argue–as many lawyers do–that knowing the technology is overkill because lawyers can simply hire someone to do it for them. That’s just incompetence at a client’s cost, and I think you and I both see the value in lawyers understanding hashing, so I took your question to ask whether lawyers should ever have occasion to actually do hashing.
I do foresee many lawyers being hands on with discovery tools in the future in much the same way as I use them today. So, I do see lawyers using homemade hash sets to tag or exclude known files and hashing to authenticate or challenge native items claimed to be unaltered from the file as produced. As you say, some of their use of hashing may be shrouded in an app, like near-deduplication, or, as I hope and expect, it will be something that post-digital lawyers will simply learn to use because it’s simple, powerful and cheap.
Just today, a lawyer claimed two files were identical because they had the same name and, on quick perusal, looked much the same. Had the lawyer had a free hash app installed, he could have simply right-clicked the files and hashed both instantly. I use hashes as a quick, unique identifier, looking just to the first or last four characters of the hash value (16^4) for a quick-and-dirty way to compare groups of files. This is especially useful when files have been renamed to conform to Bates numbers. Yes, I also foresee a time when an opposing counsel using a native file as evidence might have to produce the hash of the file to demonstrate its authenticity, if challenged.
So, right now, hashes are for lawyer geeks like me, much as computers and the Internet were early in their history. But others will come along. They always do.
LikeLike
Ralph Losey said:
Your questions 5, 6 and 8 are devilishly tricky. Excellent article on one of my favorite topics. Now if only lawyers would let go of their old Bates stamps!
LikeLike
Rosa Waller said:
Hi Craig — I opened a Word document, went to File > Print and then closed without saving. The Hash did not change. What am I missing? I am using WinMD5Free.exe to hash the file. Thanks!
LikeLike
craigball said:
Did your Last Printed Date in the Properties tab (within Word) update to reflect that you printed the document from within Word? If it did, the hash would change (because the file changed). If not, I can’t say what’s going on without some testing. Sorry.
LikeLike
Pingback: More on Metadata: Is It Material and Necessary? | E-Discovery Consulting, Transactional Services and Legal Research - Law Office of Emily K. Stitleman, PLLC
BernardPollardIsAnAss said:
Craig,
Thanks for the educational article, as well as all the other great information you offer. I completed viewing of your Ten Nerdy Things CLE last night and found it, and the reading materials, to be very helpful in understanding some of the more technical points of e-Discovery. When reading this post and quiz I had a question pop into my mind. In the case where one word document The child document) is embedded within another Word document or a Spreadsheet (a standalone parent, not attached to an e-mail), would the parent document hash value change if someone made a change to the contents of the embedded Word document? I have seen this example on recent reviews where a meeting minutes summary will have other documents embedded within. My guess is that the hash values of both documents must be changed as a result. Thanks again.
LikeLike
craigball said:
Hard to say in the abstract, and will likely depend on the method by which the documents are linked or embedded as well as the version of the apps. Office applications employ a linking method called OLE (for Object Linking and Embedding). If the content in the parent is merely a pointer to the changed file, I wouldn’t expect the parent file’s hash to change. If any embedded data changes, then the hash changes. I’d have to test to say for sure.
LikeLike
Nick Carl said:
Craig,
I thoroughly enjoyed reading your informative article. It does a great job of summarizing many of the concerns regarding collection. I’d like your thoughts on a challenge that has come to me recently. File hashing of the file’s data and application metadata ensures accuracy when copying and moving ESI, but I came across a statement in which the author said vendors may not create the hash in the same way. Obviously, MD% and SHA-1 would yield different values, but I believe they were implying that different parts of the file may be included in the hash. Have you ever heard of this? Thanks!
LikeLike
craigball said:
Yes. I expect they were referencing a form of near-deduplication employed when the items being hashed are deemed identical in practical terms but incorporate some bytewise differences that, although they alter the hash value, are deemed unimportant for purposes of comparison. The best example would be e-mail messages sought to be deduplicated as between copies of the same message dispatched to different recipients (i.e., CCs). These “same” messages will necessarily reflect different header data for different addressees and so will hash differently, even though we want to treat them as being the “identical” for deduplication pruposes.
To deduplicate these, the common practice is to hash segments of the message source so as to exclude the parts of the message deemed irrelevant from the standpoint of deduplication. These segment hashes are then compared to determine relative identicality.
The problem you reference stems from the potential for different service providers and tools to generate and compare these segment hashes in varying ways (e.g., in terms of how they resolve the order of addressees or concatenate segments).
But these are differences attributable to implementation. That is, they are not hashing exactly the same data, so they get different hash values. Otherwise, the same hash algorithm will generate the same hash value when the data being hashed is exactly the same data.
I hope that clarifies more than complicates.
LikeLike
Nick Carl said:
It does clarify, thanks.
One more question, that hopefully isn’t too far off topic. I’ve read comments here and there about the need to preserve ALL file metadata during processing. Realizing the impact of the file system to creation and last accessed dates, how are these dates perceived by courts and juries? It seems that the chain of custody would have to be pretty well documented to prove that these dates hadn’t been altered by any copying process.
Have you seen these dates called into question?
Thanks!
LikeLike
craigball said:
Sure, I’ve seen the dates called into question when they were corrupted or misunderstood. As a general proposition with notable exceptions (e.g., databases), I certainly think it behooves litigants to preserve “all” application metadtata because application metadata is part-and-parcel of the evidence file. In that regard, I’m simply saying that litigants should preserve the integrity of the electronic evidence items themselves, which isn’t too radical a proposition.
As to system metadata (and, again, speaking generally, not universally), there is a complement of system metadata that I believe should be routinely collected and preserved for electronic evidence, including the filename, last modified data, file path, custodian and/or originating source identification, file size and hash value. When the integrity of the evidence is legitimately called into question, other system metadata values and contextual information may be implicated; but, in the run-of-the-mill e-discovery situation, collecting and producing the “dogtag” values I’ve listed strikes me as the right balance of utility and burden. Last accessed dates are unreliable and creation dates are misleading. They aren’t the dates I value most when it comes to loose files. When it comes to e-mail messages, the complete RFC5322 mail content is the most important data to be preserved. I’m generally less interested in container file metadata.
LikeLike
Pingback: არის თუ არა PDF-ის metadata ფაილის ნაწილი | Mikheil on InfoSec
Pingback: Cross-Matter & -Vendor Message ID | Ball in your Court
Pingback: Atkinson-Baker | Cross-Matter and Vendor Message ID
Kumarshankar said:
Craig,
A very nice article on hashing and its importance practically. I am working in the field of Digital Forensics and i was doing a research on hash value and its dependency on various factors of a document. While doing the research I came across a very weird situation, when the hash value of a word document changed on altering the properties of the file such as “Author Name, comments, Title, Subject” etc. in the “Details” tab while viewing the properties of the word file. When i read your article and some other articles on the internet, I came across the terms “System Metadata” and “Application Metadata”. Now, as per the information found in the internet, the properties of the word document fall under System Metadata, which has nothing to do with the contents of the file and should not alter the hash value of the file. However, everytime, the properties are changed, the hash value changes. This indicates that the properties of the file such as “Title, Subject, Tags, Author’s Name” etc. has some relation with the contents of the document or Application Metadata. Can you throw some light on this, as I am indeed very confused and is not able to find a suitable explanation to this?
Thanks in Advance
Kumarshankar
LikeLike
craigball said:
Thank you for enjoying the article. Somehow, some bad information has made it onto the Internet. First time for everything, right? 😉
There is nothing “weird” about what you describe. It’s normal. The Internet is weird.
It’s common for people to misunderstand the Properties pane of a Word document. As displayed, Properties are composed of a mix of application and system metadata for the convenience of users who could care less where those data points reside. Application metadata is part of the file and resides within the file. Accordingly it’s hashed when the file is hashed (again, because it’s part of the file) and travels with the file when copied or transmitted. That is, application metadata is CONTENT. By contrast, system metadata resides outside the file (typically in the Master File Table–MFT–of a Windows system). Examples include Modified, Accessed and Created (MAC) dates as well as the file’s name, location, archive status and other parameters. System metadata is CONTEXT.
So, if we consider your examples (“Title, Subject, Tags, Author’s Name”), all of these are Application Metadata and reside within the file (remember, Title and Subject are not synonymous with file name). Change any application metadata value and the hash of the file changes. Rename the Word file (using the OS, not using the Word app) and the hash value of the Word file will not change. That’s because the renaming only effects a change in the MFT, not within the file.
Now, here’s where almost EVERYONE gets confused, so sit down: The MAC dates and file name are NOT stored within a Word file because they are system metadata. HOWEVER, there are similar values that ARE stored within a Word file (such as the originating name given the file and its creation date measured in 100ns intervals from 1/1/1601). These confuse the hell out of people who can’t keep the values straight, especially when they see them presented side-by-side in the Properties pane. You’ve just got to know the difference and keep them straight in your head to do forensics (to do it reasonably well, at any rate). The Properties pane pulls data from file (application metadata) AND the MFT (system metadata) and presents them together for convenience. Does that help?
Moral of the story: Don’t believe everything you read on the Internet. Investigate and run your own tests to determine what’s accurate.
LikeLike
Kumarshankar said:
Thankyou so much for clearing my doubt. Hope to communicate more with you in the future.
LikeLike
Pingback: Metadados e Hash – Alterar Metadado altera o Hash? – GNU/LINUX BRASIL