Whether you call it “predictive coding” or “technology assisted search,” the time is nigh when we will leave much of the heavy lifting of search to machines trained to find responsive documents. These tools won’t be the heuristic marvels like HAL-9000 envisioned by Arthur C. Clarke, but they probably won’t try to kill us either.

We’ll train these tools by presenting them with examples of patently responsive documents culled by flesh-and-blood reviewers from key custodians’ ESI. Using sophisticated algorithms that analyze these “seed sets” and identify patterns, the tools will ferret out other documents like the examples. Because we can train the tools to find similar ESI using any documents, we won’t be relegated to using seed sets derived from actual documents. We can train the tools with contrived documents–fabrications of items like the genuine counterparts we hope to find. I call this “imagining the evidence,” and it’s not nearly as crazy as it sounds.

Today, it’s commonplace for an opponent to contribute to the list of terms used to search for responsive documents. Unfortunately, experience and study bear out that keyword search is a relatively ineffective means to identify responsive documents, especially when search terms are selected without careful analysis of the collection or testing for precision.

Keyword search is a frustratingly literal technology. If responsive documents contain terms even slightly different from those searched, the responsive documents will likely be missed. Use of stemming, alternate spellings and synonyms helps, but keywords are, at best, a crude tool when you know the collection well and a crap shoot when you don’t.

In contrast, predictive coding is not as linguistically fussy as keyword search. If an opponent submits contrived examples of the sorts of documents they seek, it’s far more likely a similar document will surface than if keywords alone were used. As importantly, it’s less likely that a responsive document wil be lost in a blizzard of false hits.

The use of contrived examples may ruffle some feathers. I can almost hear a chorus of, “How dare they draft such a vile thing. It’s libelous!” Certainly, we will need to insure that contrived seed sets aren’t mistaken for genuine evidence. But the methodology is sound, and how we will go about “imagining the evidence” is likely to be a topic of discssion in the negotiation of search protocols once use of predictive coding and other enhanced search technologies is the norm.