path of email-3This is the eighth in a series revisiting Ball in Your Court columns and posts from the primordial past of e-discovery–updating and critiquing in places, and hopefully restarting a few conversations.  As always, your comments are gratefully solicited.

The Path to Production: Harvest and Population

(Part III of IV)

[Originally published in Law Technology News, December 2005]

On the path to production, we’ve explored e-mail’s back alleys and trod the mean streets of the data preservation warehouse district.  Now, let’s head to the heartland for harvest time.  It’s data harvest time.

After attorney review, data harvest is byte-for-byte the costliest phase of electronic data discovery.  Scouring servers, local hard drives and portable media to gather files and metadata is an undertaking no company wants to repeat because of poor planning.

The Harvest
Harvesting data demands a threshold decision: Do you collect all potentially relevant files, then sift for responsive material, or do you separate the wheat from the chaff in the field, collecting only what reviewers deem responsive?  When a corporate defendant asks employees to segregate responsive e-mail, (or a paralegal goes from machine-to-machine or account-to-account selecting messages), the results are “field filtered.” Today, we’d call this “targeted collection.”

Field filtering holds down cost by reducing the volume for attorney review, but it increases the risk of repeating the collection effort, loss or corruption of evidence and inconsistent selections.  If keyword or concept searches alone are used to field filter data, the risk of under-inclusive production skyrockets.

Initially more expensive, comprehensive harvesting (unfiltered but defined by business unit, locale, custodian, system or medium), saves money when new requests and issues arise.  A comprehensive collection can be searched repeatedly at little incremental expense, and broad preservation serves as a hedge against spoliation sanctions.  Companies embroiled in serial litigation or compliance production benefit most from comprehensive collection strategies.

A trained reviewer “picks up the lingo” as review proceeds, but a requesting party can’t frame effective keyword searches without knowing the argot of the opposition.  Strategically, a producing party requires an opponent to furnish a list of search terms for field filtering and seeks to impose a “one list, one search” restriction.  The party seeking discovery must either accept inadequate production or force the producing party back to the well, possibly at the requesting party’s cost.

Chain of Custody
Any harvest method must protect evidentiary integrity.  A competent chain of custody tracks the origins of e-evidence by, e.g., system, custodian, folder, file and dates.  There’s more to e-mail than what you see on screen, so it’s wise to preempt attacks on authenticity by preserving complete headers and encoded attachments.

Be prepared to demonstrate that no one tampered with the data between the time of harvest and its use in court.  Custodial testimony concerning handling and storage may suffice, but better approaches employ cryptographic hashing of data — “digital fingerprinting” — to prove nothing has changed.

There’s more to an e-mail than its contents: there’s metadata, too.  Each e-mail is tracked and indexed by the e-mail client (“application metadata”) and every file holding e-mail is tracked and indexed by the computer’s file system (“system metadata”).  E-mail metadata is important evidence in its own right, helping to establish whether and when a message was received, read, forwarded, changed or deleted.  Metadata’s evidentiary significance garnered scant attention until Williams v. Sprint, 2005 W.L. 2401626 (D. Kan. Sept. 29, 2005), where in a dispute over production of spreadsheets, the court held that a party required to produce electronic documents as kept in the ordinary course of business must produce metadata absent objection, agreement or protective order.

System metadata is particularly fragile.  Just copying a file from one location to another alters the file’s metadata, potentially destroying critical evidence.  Ideally, your data harvest shouldn’t corrupt metadata, but if it may, archive the metadata beforehand.  Though unwieldy, a spreadsheet reflecting original metadata is preferable to spoliation. EDD and computer forensics experts can recommend approaches to resolve these and other data harvest issues.  Today, we use load files for this purpose; but, “load files” just wasn’t a part of the everyday lawyer lexicon in 2005.

Processing and Population
However scrupulous your e-mail harvest, what you’ve reaped isn’t ready to be text searched.  It’s a mish-mash of incompatible formats on different media: database files from Microsoft Exchange or Lotus Domino Servers, .PST and .NSF files copied from local hard drives, HTML fragments of browser-based e-mail and .PDF or .tiff images.  Locked, encrypted and compressed, it’s not text, so keyword searches fail.

Before search tools or reviewers can do their jobs, harvested data must be processed to populate the review set, i.e., deciphered and reconstituted as words by opening password-protected items, decrypting and decompressing container files and running optical character recognition on image files.  Searching now will work, but it’ll be slow going thanks to the large volume of duplicate items.  Fortunately, there’s a fix for that, too.

Tomorrow: de-duplication, deliverables, documentation and the destination on the path to production.

Things have little changed in the collection process in ten years–one reason we haven’t seen the efficiencies and economies that e-discovery should engender.  But, we’ve learned some things.  

The first lesson, gleaned from the NIST TREC Legal Track work that followed this article is that the quality of lexical (e.g., keyword) search improves dramatically if you go back to the well.  That is, a one-time keyword search of an ESI collection might return results that are about 80% non-responsive junk and 20% what you seek.  But, if you make that first production and let the parties return to the well with a second set of searches, the second effort returns results that are about 40% non-responsive junk and 60% what you’re looking for.  That second trip to the well yields much sweeter water.

Another thing we’ve learned: if you’re going to employ predictive coding, you’re better off if you don’t cull the collection using keywords.  Costs of ingestion aside, more is better when you’re using technology-assisted review.  Of course, costs of TAR ingestion are hard to set aside because they’re so high.  The happy news is that, since you must ingest and process to perform keyword search, the cost of using advanced analytics can be expected to fall significantly as the two ingestion and processing steps collapse into one.  It’s not much more costly to generate the n-grams needed for predictive coding at the same time you’re tokenizing the data.  If you have no idea what I just said, don’t worry about it.  Most live rich, productive lives never knowing any of this stuff.

The holy grail isn’t better collection or even better analytics.  The best approach will be preservation in situ coupled with advanced search and analytics embedded in the enterprise information governance environment.  When most potentially responsive data resides in the cloud, the best search and analytics software will sit right beside it all the time.  Eschewing the jargon, it means we will know what we have and whether it’s responsive or privileged without having to copy it and hand it over to a vendor and an army of reviewers.  The iceman no longer cometh, and someday, neither will the ESI service provider.