A question common to investigation of alleged data theft is, “Are any of our stolen files on our competitor’s systems?”  Forensic examiners track purloined IP using several strategies: among them, searching for matching filenames, hash values, metadata and content.  Any of these can be altered by data thieves seeking to cover their tracks, but most are too confident or too dim to bother.

A current matter underscored the pitfalls of filename and hash searches, prompting me to reflect on a long-ago case where hash searches caused headaches.  The old case stemmed from a settlement of a data theft event requiring a periodic audit of hashes of the defendant’s data to ensure that stolen data hasn’t re-emerged.  The plaintiff sought sanctions because its expert found hash values in the audit that matched hashes tied to stolen PowerPoint presentations.  The defendants were dumbfounded, certain they’d adhered to the settlement and not used any purloined PowerPoints. 

When I stepped in, I confirmed there were matching hash values, but none matched the PowerPoint PPT and PPTX files of interest.  Instead, the hashes matched only benign component image data within the presentations.  The components hashed were standard slide backgrounds (e.g., “woodgrain”) found in any copy of PowerPoint.  Both parties possessed PowerPoints using some of the same generic design elements, but none were the same presentations.  The hashing tool so thoroughly explored the files that embedded images were hashed separately from the files in which they were used and matched other generic elements in other presentations.  No threat at all!

Still other matching files turned out to be articles freely distributed at an industry trade show and zero-byte “null” files that would match any similarly empty files on any machine.  When every hash match was scrutinized, none proved to be stolen data.  Away went the sanctions motion.

The moral of the story is, although it’s extremely unlikely that two different files will share the same hash value, matching hash values don’t always signify the “same” file in practical terms.  Matching files may derive from independent sources, could be benign components of compilations or might match because they hold little or no content.  The math is powerful, but it mustn’t displace common sense.

In the ongoing matter, a simple method used to identify contraband data was filename matching.  The requesting party sought to identify instances of a file called “Book3.xlsx;” and the search turned up hundreds of instances of identically named files in the producing party’s data–though not a single one hash-matched the file of interest.

Why so many false positives?  It turns out Microsoft Excel assigns an incremented name to any new spreadsheet (despite earlier-opened sheets having been closed) so long as even one prior sheet remains open.  So, if you’ve created eight Excel spreadsheets, renamed them and closed all but one, the next new sheet will be named Book9.xlsx by default.  The name “Book3.xlsx” signified only that two prior spreadsheets had been opened.  The takeaway is that, in any large collection, expect to turn up instances of various Book(n).xlsx files created when a user exited and saved a sheet without renaming it from its default name.

Electronic search—by hash, filename, metadata or keyword–is an invaluable tool in investigation and e-discovery; but one best used with a modicum of common sense by those who appreciate its limitations.