unclear on the conceptThis is the second in a series revisiting Ball in Your Court columns and posts from the primordial past of e-discovery–updating and critiquing in places, and hopefully restarting a few conversations.  As always, your comments are gratefully solicited.

Unclear on the Concept

 [Originally published in Law Technology News, May 2005]

A colleague buttonholed me at the American Bar Association’s recent TechShow and asked if I’d visit with a company selling concept search software to electronic discovery vendors.  Concept searching allows electronic documents to be found based on the ideas they contain instead of particular words. A concept search for “exploding gas tank” should also flag documents that address fuel-fed fires, defective filler tubes and the Ford Pinto. An effective concept search engine “learns” from the data it analyzes and applies its own language intelligence, allowing it to, e.g., recognize misspelled words and explore synonymous keywords.

I said, “Sure,” and was delivered into the hands of an earnest salesperson who explained that she was having trouble persuading courts and litigators that the company’s concept search engine worked. How could they reach them and establish credibility?  She extolled the virtues of their better mousetrap, including its ability to catch common errors, like typing “manger” when you mean “manager.”

But when we tested the product against its own 100,000 document demo dataset, it didn’t catch misspelled terms or search for synonyms. It couldn’t tell “manger” from “manager.” Phrases were hopeless. Worse, it didn’t reveal its befuddlement. The program neither solicited clarification of the query nor offered any feedback revealing that it was clueless on the concept.

The chagrined company rep turned to her boss, who offered, “100,000 documents are not enough for it to really learn. The program only knows a word is misspelled when it sees it spelled both ways in the data it’s examining and makes the connection.”

The power of knowledge lies in using what’s known to make sense of the unknown. If the software only learns what each dataset teaches it, it brings nothing to the party. Absent from the application was a basic lexicon of English usage, nothing as fundamental as Webster’s Dictionary or Roget’s Thesaurus. There was no vetting for common errors, no “fuzzy” searching or any reference foundation. The application was the digital equivalent of an idiot savant (and I’m taking the savant on faith because this application is the plumbing behind some major vendors’ products).

Taking the Fifth?
In the Enron/Andersen litigation, I was fortunate to play a minor role for lead plaintiff’s counsel as an expert monitoring the defendant’s harvesting and preservation of electronic evidence.  The digital evidence alone quickly topped 200 terabytes, far more information than if you digitized all the books in the Library of Congress. Printed out, the paper would reach from sea-to-shining sea several times.  These gargantuan volumes — and increasingly those seen in routine matters — can’t be examined without automated tools. There just aren’t enough associates, contract lawyers and paralegals in the world to mount a manual review, nor the money to pay for it. Of necessity, lawyers are turning to software to divine relevancy and privilege.

But as the need for automated e-discovery tools grows, the risks in using them mount.  It’s been 20 years since the only study I’ve seen pitting human reviewers against search tools. Looking at a (paltry by current standards) 350,000 page litigation database, the computerized searches turned up just 20 percent of the relevant documents found by the flesh-and-bone reviewers.

The needle-finding tools have improved, but the haystacks are much, much larger now. Are automated search tools performing well enough for us to use them as primary evidence harvesting tools?

Metrics for a Daubert World
Ask an e-discovery vendor about performance metrics and you’re likely to draw either a blank look or trigger a tap dance that would make the late Ann Miller proud.  How many e-discovery products have come to market without any objective testing demonstrating their efficacy? Where is the empirical data about how concept searching stacks up against human reviewers? How has each retrieval system performed against the National Institute of Standards and Technology text retrieval test collections?

If the vendor response is, “We’ve never tested our products against real people or government benchmarks,” how are users going to persuade a judge it was a sound approach come the sanctions hearing?

We need to apply the same Daubert-style standards [Daubert v. Merrell Dow Pharmaceuticals (92-102) 509 U.S. 579 (1993)] to these systems that we would bring to bear against any other vector for junk science: Has it been rigorously tested?  Peer-reviewed?  What are the established error rates?

Calibration and Feedback
Like the airport security staff periodically passing contraband through the x-ray machines and metal detectors to check the personnel and equipment, automated search systems must be periodically tested against an evolving sample of evidence scrutinized by human intelligence.  Without this ongoing calibration, the requesting party may persuade the court that your net’s so full of holes, only a manual search will suffice.  If that happens, what can you do but settle?

Thanks to two excellent teachers, I read Solzhenitsyn in seventh grade and Joyce Carol Oates in the ninth. I imagine that if I re-read those authors today, I’d get more from them than my adolescent sensibilities allowed.  Likewise, if software gets smarter as it looks at greater and greater volumes of information, is there a mechanism to revisit data processed before the software acquired its “wisdom” lest it derive no more than my 11-year-old brain gleaned from One Day in the Life of Ivan Denisovitch?  What is the feedback loop that ensures the connections forged by progress through the dataset apply to the entire dataset?

For example, in litigation about a failed software development project, the project team got into the habit of referring to the project amongst themselves as the “abyss” and the “tar baby.”  Searches for the insider lingo, as concepts or keywords, are likely to turn up e-mails confirming that the project team knowingly poured client monies into a dead end.

If the software doesn’t make this connection until it processes the third wave of data, what about what it missed in waves one and two?  Clearly, the way the data is harvested and staged impacts what is located and produced.  Of course, this epiphany risk—not realizing what you saw until after you’ve reviewed a lot of stuff—afflicts human examiners too, along with fatigue, inattentiveness and sloth to which machines are immune.

But, we trust that a diligent human examiner will sense when a newly forged connection should prompt re-examination of material previously reviewed.

Will the software know to ask, “Hey, will you re-attach those hard drives you showed me yesterday?  I’ve figured something out.”

Concept Search Tools
Though judges and requesting parties must be wary of concept search tools absent proof of their reliability, even flawed search tools have their place in the trial lawyer’s toolbox.

Concept searching helps overcome limitations of optical character recognition, where seeking a match to particular text may be frustrated by OCR’s inability to read some fonts and formats. It also works as a lens through which to view the evidence in unfamiliar ways, see relationships that escaped notice and better understand your client’s data universe while framing filtering strategies.

I admire the way EDD-savvy Laura Kibbe, (former) in-house counsel for pharmaceutical giant Pfizer, Inc., uses concept searching.  She understands the peril of using it to filter data and won’t risk having to explain to the court how concept searching works and why it might overlook discoverable documents.  Instead, Laura uses concept searching to brainstorm keywords for traditional word searches and then uses it again as a way to prioritize her review of harvested information.

For producing parties inclined to risk use of concept searching as a filtering tool, inviting the requesting party to contribute keywords and concepts for searching is an effective strategy to forestall finger pointing about non-production.  The overwhelming volume and the limitations of the tools compel transformation of electronic discovery to a collaborative process.  Working together, both sides can move the spotlight away from the process and back onto the merits of the case.

Looking at this article a decade later, I’m chagrined that I did not better understand the nascent technology that evolved to now be called “TAR” and “Predictive Coding;” but, I’m proud to have issued this challenge early in 2005: “How has each retrieval system performed against the National Institute of Standards and Technology text retrieval test collections?”  Few in the litigation sphere had any notion that there was such a thing as NIST TREC, and no one had publicly suggested a need for a legal track.  Happily, the cudgel was taken up by Jason Baron and Douglas Oard and, in 2006, the TREC Legal Track was born.  All the credit belongs with those who executed on the idea, whatever its genesis.

In the ensuing years, predictive coding tools and workflows have come to incorporate the features and workflows I lamented as absent from the implementation I’d seen. “Feedback loops,” as I crudely termed them, and continuous re-assessment of the collection by active machine learning are commonplace and much esteemed today. However, I’ve moved away from the conviction that we need Daubert-style affirmation of technology-assisted review tools.  I got that wrong.  The better approach–and one I’m gratified to have advocated so long ago–is to require a healthy level of transparency of process and, where feasible, a high degree of collaboration. 

The company I spoke of disappeared from the scene just a year after publication.  To the extent its technology is baked into some surviving tool, here’s hoping its only used against really large collections. 

Finally, I wince at my pretense at precocity in working in how I’d read Alexander Solzhenitsyn and Joyce Carol Oates at tender ages.  I’m not sure who I was trying to impress; but, what a lame effort it was!