If you’re on this turf, chances are you already know that de-NISTing is a technique used in e-discovery and computer forensics to reduce the number of files requiring review by excluding standard components of the computer’s operating system and off-the-shelf software applications like Word, Excel and other parts of Microsoft Office. Everyone has this digital detritus on their systems; things like Windows screen saver images, document templates, clip art, system sound files and so forth. It’s the stuff that comes straight off the installation disks, and it’s just noise to a document review.
It’s called “de-NISTing” because those noise files are identified by matching their hash values (i.e., digital fingerprints) to a huge list of software hash values maintained and published by the National Software Reference Library, a branch of the National Institute for Standards and Technology (NIST). The NIST list is free to download, and pretty much everyone who processes data for e-discovery and computer forensic examination uses it. If you’re paying a vendor to de-NIST, you probably think you’re getting value for the service. I expect nearly everybody who de-NISTs believes that they’re culling the most common operating system and application files. I mean, that’s the whole point, right?
Sorry to burst your bubble.
Earlier this summer, I began to wonder why de-NISTing was doing such a poor job reducing the volume of files in systems I’d collected for review. These were late-model systems running Windows Vista or Windows 7 and the latest release of Microsoft Office. That is, they were the sort of machine one is likely to encounter in millions of homes and businesses today.
The NIST list is updated four times a year, and I was using the very latest release; but, most of the noise files I expected would be excluded by de-NISTing weren’t going away. So, I ran a test. I created a pristine install of Windows 7 on a sterile hard drive. The pristine install consisted of 47,690 files, and everything on the drive that wasn’t fashioned on the fly as part of the install process came straight off the Windows installation disk.
But, do you know how may of those 47,690 files were on the latest NIST list? Just 7,277! That’s right, the NIST list misses 85% of the files in a pristine Windows 7 installation.
Some of you surely share my astonishment. The rest of you are rightly thinking, “Craig really needs to get out more.”
But seriously folks, that’s a terrible performance, and it translates into real, honest-to-goodness wasted wampum for litigants when the noise files that should have been culled pass through one of the pay-by-the-gigabyte tolls downstream.
I did some exploring and found that one reason the NIST list missed so many noise files is because NIST hasn’t yet processed Windows 7 for addition to the list. More than 350 million machines run Windows 7, but apparently none at NIST. Arrrgh! What’s more, the NIST list doesn’t include the components of Microsoft Office 2010 either. Only 100 million machines run Office 2010.
The purpose of this post isn’t to disparage some overworked government technician trying to catch up with last year’s work. Instead, I’m questioning whether some vendors are using hash lists they’re calling NIST lists but are actually cobbled together on their own? If you can trace and defend your process abetting the NIST list, great. There’s nothing wrong with making your own exclusionary hash set; but, don’t try to pass it off as the official, government-issued NIST list. Your Prada knockoffs may be pretty, but they aren’t the real McCoy.
As with Prada bags and Rolex watches, authenticity is a key component of value and inspirator of confidence. We don’t quarrel about de-NISTing because the roster of items excluded derive from a government agency through a controlled, transparent process anyone can validate or test, as I did. When vendors employ proprietary, undocumented exclusion mechanisms for ESI under the rubric of de-NISTing, it may be a better process (or not); but, it’s not a trustworthy process.
Michael said:
Craig,
I can certainly understand your frustration. However, I must comment on several of your observations. First of all, as you correctly point out, the NIST hash sets are free. You get what you pay for, and I hardly think its fair to criticize NIST if they are behind in publishing updated lists for use by vendors in ediscovery. So what! I think that If the vendors are that helpless then they should get what they pay for.
Here’s a thought! Maybe if some vendors chipped in a little and provided a little funding or assistance for the NIST organization instead of doing everything they can to make a quick buck off of something they can get for free then the industry might be headed in a better direction.
As you point out, consumers that do not know what they are actually buying are often times coaxed into believing the quick silver tongue quips of the marketing teams and their claims that they effectively eliminate ALL NON USER based files prior to processing. This is a FALSE claim. Even if they did have an up to date list. The fact of the matter is that there will never be a complete and up to date list containing every single file that is a part of an application which has no value. Obviously, the reason for this is that it is impossible to gather every single program and create these lists.
Concerning your comment about cobbled together lists. You know as well as I do that if I were to document the process in which I either obtained a MD5 or SHA1 hash value for the content of a given file and added that as a supplement to the values supplied by NIST, as long as my process in obtaining those values was repeatable by another person, using the same set of circumstances, even with different software, those values would be defensible and not even you, one of the smartest, silver tongued lawyers I am aware of when it comes to electronic discovery would be successful in disputing that. All that simply needs to be advertised by the vendor is that they use a combination of digital fingerprint values derived from multiple sources to accomplish the elimination of known program related files. Pretty simple.
Hash values have come to be of good use for many different things, not just elimination of files. Hash values can also be used to find files that you are looking for, such as stolen IP, trade secrets, customer lists….etc. What are we going to say here, that since the hash value of a known non system type file was matched to a file on another computer in the search for stolen gems that the find is no good since its not part of the current NIST library….
Since you took the time to do a clean install of Windows 7 to point out the number of files that were missing in the NIST hash sets, might I suggest you make a hash set of those files using one of the many applications at your disposal (I know you know how to do this, but if not, feel free to email me and I will send you instructions).
Then, maybe, once you were done with that you would be gracious enough to submit that hash set to NIST which may reduce their workload a bit. Maybe they will use it, maybe they wont, but at the end of the day no one could say that you pointed out an issue and weren’t willing to do anything to assist in a solution.
LikeLike
craigball said:
Michael:
Thanks for the comment. I agree there’s nothing untoward about a vendor compiling their own well-documented, defensible list of non-user files and employing it for volume reduction. My point is that such a list shouldn’t be sold as being the NIST list. There is a value to the use of hash sets from a trusted government source such that any two lists from the same release will be identical.
I don’t take issue with most of what you say, but I hope you will forgive me a few quibbles. The NIST list is free in the same way that other government services are free. You and I paid for it, and I think we have the right any citizen possesses to demand that the services provided to us by government represent a good use of our resources.
A grass roots effort to fix the problem is a fine idea, except that the folks at NIST won’t accept donations of software from anyone other than the publisher. Further (and with good cause), they will not accept hash values you or I might submit. They have a process, and I respect that.
I’d already expressed to another reader my willingness to contribute money to buy the software about which I groused, and I know I’m not alone in such willingness among my colleagues. I didn’t know anyone WAS saying I was pointing out an issue but was unwilling to assist. In fact, I’m ready, willing and able to assist, if I can. And, yes, I will put my money where my mouth is.
I don’t take your point about the NIST list issues somehow impacting the use of hashes to look for contraband files (or de-duplication). One really has nothing to with the other, except that hashing happens to be a tool employed in each.
Thanks again for sharing your thoughts and for the (sort of) flattering words.
LikeLike
Pingback: Weekly Top Story Digest - September 7, 2011 | ComplexDiscovery
Pingback: Notable articles in e-discovery, Streamline, Portland | Streamline Imaging
Pingback: September 9th weekend edition of the “Top 20 … plus more” – a compendium of e-discovery articles, vendor news and upcoming events | The Electronic Discovery Reading Room
Pingback: Craig Ball on De-NISTing » Scope 2.0 - a new perspective
Doug White said:
NSRL became aware of this blog post on Feb. 6 2012.
We try to keep up with items like this, but didn’t see it at the time.
We have had multiple Windows 7 installations in place against which
we test the hashsets, and for a period, the Windows 7 OS coverage
was well below our expectations. In Dec. 2011, we were able to
identify a type of “container” file that we were not recursing into,
we addressed that issue, and reprocessed all Win 7 OS media.
These will be in our March 2012 2.36 release.
We have applied the hashset against our Win 7 Ultimate and Professional
32 and 64 bit installations. RDS 2.36 appears to identify a majority of the Win 7 OS files.
While the 2.36 CD ISOs are not yet available, a zip file containing a
2.36 “minimal” hashset – as described at
http://www.nsrl.nist.gov/Downloads.htm#reduced
is online for a short time at
http://www.nsrl.nist.gov/ftp/RDS_2.36/RDS_236m.zip
so you may test the coverage of the latest version.
NSRL appreciates ALL feedback from the community.
We endeavor to respond in a timely manner, and I encourage
contacting us directly at nsrl@nist.gov to enable NSRL to
turn a solution around within a publication cycle.
LikeLike
Pingback: Why De-NISTing is just plain silly | Elluma Discovery
http://tinyurl.com/crealusk24781 said:
This unique blog post, “De-NISTing: De-FECTive Ball in your
Court” was very good. I am printing out a backup to demonstrate
to my associates. Regards,Forrest
LikeLike
Pingback: National Institute of Standards and Technology Creates Cybersecurity Standards | e-Discovery Team ®