Earlier this year, I delivered the keynote address for a corporate event in Canada, I called the talk, “Spoiled and Deluded: Ugly Truths about Electronic Search.” I lamented how our happy experience with Google and online legal research has left us woefully unprepared (“spoiled”) for the extreme difficulty of search in e-discovery, then dashed a few misconceptions about the efficacy of searching ESI in accepted ways (“deluded”). Dear Reader, we need to be brutally frank about search; because in a world where the organization of information has gone the way of the typewriter and file room, effective, efficient search is something we cannot manage without.
Search has two non-exclusive ways to fail: your query will not retrieve the information you seek and your query will retrieve information you weren’t seeking. The measure of the first is called “recall,” and of the latter, “precision.” We want what we’re looking for (high recall) and only what we are looking for (high precision).
Recall and Precision aren’t friends. In e-discovery, they’re barely on speaking terms. Every time Recall has a tea party, Precision crashes with his biker buddies and breaks the dishes.
It’s easy as pie to achieve a high recall of responsive information in e-discovery. You simply grab it all: 100% of the data = 100% recall. But, if only one out of a hundred items is what you seek, your precision stinks–it’s just 1%. You’ll look at 99 irrelevant documents for each one worth reviewing. Some call this The Practice of Law, and most lawyers mistakenly regard it as the safest course, lest a party fail to produce something or produce something that should have been withheld.
Because it’s time consuming, it’s expensive. Worse, it doesn’t work very well. People make assessment errors; and making lots of assessments, they make lots of errors. My friend and fellow commentator, Ralph Losey, lately blogged about the shortcomings of search and review calling them “dark secrets.” Don’t miss Ralph’s posts, but know that he has a penchant for revealing secrets that are “secret” in the same way that the square root of 256 is “secret.” Most won’t know it’s 16 off the top of their head; but like the problems of search, it’s easy to figure out if you’re even slightly curious! Kudos to Ralph for using the ploy of revealing “secrets” to inspire curiousity.
The errors we make in search can be subtle and hypertechnical, but they usually aren’t. Most mistakes I see in keyword search are of the boneheaded variety. If we eliminate the dumbest mistakes, we improve the quality of e-discovery and markedly trim its cost. Search will ever be a battle between Recall and Precision, but avoiding boneheaded errors will limit casualties.
I write today to feature a few such mistakes and encourage you to suggest a few of your own in the comments. I hope to turn this dialogue into a longer article or column.
Boneheaded Mistake 1: Searching for a custodian’s name or e-mail address in the custodian’s e-mail
If you run a list of search terms including a custodian’s name or e-mail address against their own e-mail, you should expect to get hits on all messages, rendering the search useless. I know some of you are saying, “Craig, no one’s that boneheaded!. I say, “Wanna bet?” I see this mistake with regularity. I see it done by big firms touting their e-discovery expertise. I see it done by plaintiffs and defendants. I see vendors content to run these searches without flagging the error. Ask yourself: how often are the proposed search term lists exchanged between counsel carefully broken out by particular custodians or forms of ESI to be searched?
Boneheaded Mistake 2: Assuming the Tool can Run the Search
Unless you plan to read everything, you can’t search ESI for keywords without a search tool; and, if you’ve ever tried to drive a screw with a hammer, you know that tools do some tasks better than others, and some tasks they don’t do at all. Why can’t you use Google to know what’s in your fridge? Because the information isn’t online…yet. Every ESI search tool has limitations: The data may not have been collected, or it wasn’t indexed or the search syntax is wrong. Most e-discovery searches are run against an index of the words in the ESI; but, text indexers don’t index information that isn’t text (like pictures of words that haven’t been run through an optical character recognition process). They don’t index text they can’t access, like encrypted documents or documents encoded in unfamiliar ways. Plus, they don’t index parts of speech called “noise” or “stop” words deemed so common they’ll gum up the works. I call this the “To Be or Not to Be” problem, because all of the words in Hamlet’s famous question tend not to be indexed in e-discovery.
A related mistake is using the wrong or an unsupported search syntax. Not every search tool supports every common feature of search (e.g., wildcard characters, Boolean constructs, stemming, proximity searches or regular expressions), and not every tool uses the same methods or characters to deploy the same features. If you’re not certain how the search tool processes *, !, ?, /w and %, don’t assume they work as you imagine.
Boneheaded Mistake 3: Not Testing Searches
Much of what distinguishes a mistake as boneheaded is the ease with which it could have been avoided. When a party to a lawsuit once proposed the letter “S” be used as a search term, I didn’t need to test it to know that it was a boneheaded choice. But what about all those terms that routinely occur in file paths or are inevitably encountered in profusion within ESI having nothing to do with the case? If you don’t know if the list of keywords you’re about to run includes some of these terms, what is the boneheaded thing to do? Right! Run them against your entire ESI collection without testing them first!
Even search terms that appear bulletproof can surprise you. Test your searches to be sure they perform as expected.
Boneheaded Mistake 4: Not Looking at the Data!
How much chatter about whether it’s raining outside will you listen to before looking out the window? Don’t just natter on about the quantity of hits to evaluate your search; check the quality of the hits. Look at the data! Fifteen minutes spent looking at the data can eliminate weeks or months of reviewing crappy results and a zillion dollars spent in motion practice.
Boneheaded Mistake 5: Ignoring the Exceptions List
It’s the rare e-discovery effort where everything processes without exception. There will typically be hundreds or thousands of items that are encrypted, corrupt, unrecognized or unreadable. A report of these exceptions is usually generated during processing. Too often, these exceptions are forgotten soon after they’re identified or are misclassified as benign. It’s a calculated risk to decide that the exceptional items can be ignored; but, to forget these exceptions exist is a boneheaded mistake.
There are five boneheaded mistakes to prime the pump. Now, how about sharing a few of your own?
P.S. Testing terms? Looking at data? To do these things, you need a desktop tool that makes it possible. better still, one that’s dirt cheap and extraordinarily powerful. If so, don’t miss this important post. Time is running out!
Ed Fiducia said:
Bonehead Mistake 6: Assuming That Deduplication Solves My Problem
While a technical discussion of deduplication methods is too deep to go into here, suffice to say that both MD5 and SHA1 have their limitations. They will, of course, find truly duplicative files and are trustworthy to that end from a statistical standpoint. The rub is the definition of a truly duplicative file. Email header info? Addressee information? Documents that have been converted from one format to another… say Word to PDF…. will not deduplicate. We are still left with thousands upon thousands of Near Duplicates that must be identified and reviewed. This leads to not only a dramatic increase in review costs, but a dramatic increase in the probability that documents will be coded inconsistently. Spend more money, get worse results. Not a good combination. Which leads me to suggest….
Bonehead Mistake 7: Reviewing Fifty Custodians When Five Will Do
Preserve everything? You bet! Review everything? Not in my book.
The knee jerk reaction is to law the blame on the Plaintiff’s attorneys who love to ask for everything. IMHO, equal responsibility is held by the Defense attorneys who fail to negotiate this process from the very beginning in the Meet and Confer. As a service provider, you’d think that I would be in the Process and Review Everything Camp, but over the past 18 years I have seen case after case that demonstrates to me that if the scope of ediscovery was limited at the beginning… with appropriate caveats to allow for additional discovery requests if the evidence points to it… everybody wins.
Ed Fiducia
LikeLike
Craig Ball said:
Thanks, Ed. exactly the sort of contribution I’m looking for. Your #6 might also be thought of as, “Believing there’s one magical technology that does it all.” Hash de-dupe is an excellent technology, but as you ably note, it only goes so far and no farther until you add in refinements like shingling or segmented hashing that go a little farther but don’t solve all problems.
Your #7 is brilliant, and so true. I get far more, far faster from a deep dive into a few key custodians’ e-mail than a mass amalgamation of dozens or (shudder) hundreds of people’s stuff grabbed from everywhere.
LikeLike
Dave Swider said:
Here’s one we see pretty often: Searching for names without anticipating variations encountered during actual usage. As an example, we’ll see a search for a name with many potential variations such as “Robert Smith”. No variations are specified by the client; no Rob, no Bob, no Bobby, no Robby, not even an email address.
Similarly, we’ll be requested to search for only the complete firm name: all 5 names as an exact string. No domain search, no [first] w/2 [last].
Similarly, we’ll see use of a wildcard or terms that will be far too expansive or terms that are created and applied without any knowledge of the documents in the case. I worked on a metals manufacturing case that involved laying one material on top of another in a process called “deposition”. Guess what term appeared on the pot. priv terms list? Common offender in groundwater cases: “well”.
And yes, “Not looking at the data” can be a huge failure point. I can’t count how many times I’ve had counsel remove terms due to no more analysis than “that’s too many hits”.
That said, I think the number one boneheaded move by legal staff is simply not bothering to understand how data works and how they can best apply tools that will make their outcomes better. Our best clients are those that treat data not like documents, but like data.
LikeLike
craigballCraig Ball said:
Thanks, Dave. You really nailed a few more of them brilliantly. Great anecdote abut the term, “deposition.” Love that point about looking at data like docs, not data. How do we begin to explain the difference to those who don’t want to know?
LikeLike
Dave Swider said:
I’ve used an explanation like “imagine a box of documents that would magically get rid of duplicates or arrange email into conversations. How amazing would that be? How about if we could ask the documents which ones contained specific terms and they’d raise their little paper hands for us?”
It’s pretty pedantic, but it gets the point across – to the right audience.
LikeLike
Marc Hirschfeld said:
Here is one that I never see attorneys talk about and negotiate….neglecting to run searches through Filenames and File Folders. I often find a tressure trove of information when the folder which contains most of the relevant information contains a search word but some of the documents don’t. It is as if the user pre identified these documents as relevant but because the filename and folder weren’t included in the index, they arent searched.
LikeLike
Craig Ball said:
Thanks, Mark. That’s odd, because it should hit on the message’s file path unless searched discretely (i.e., separated by spaces). You raise an important issue when parties elect to search archived and journaled mail because, if grabbed before it hits the end user’s mailbox, messages don’t hold any information about the end user’s characterization by foldering. I’m not sure I’d call that boneheaded as it comes in more as an EDD 201 course subject, don’t you think? Still, a useful contribution. Keep ’em coming!
LikeLike
Marc Hirschfeld said:
I just ran a test on Law prediscovery which uses DtSearch in the backend. I ran a search for a discrete word in a file path of a current matter I am working on, namely “corporate documents” and it did not show in my results unless the word was in the full text of the document.
LikeLike
Debbie Westwood said:
Bonehead Mistake #8: Failing to appreciate how much raw data is involved in even “small” litigation matters.
There is often a complete failure of imagination of what (for example) 1TB of data really means for the traditional litigation workflow. When you can go and pick up a 2TB drive for a couple of hundred dollars or less, and it’s the size of a paperback book, there is a kind of TARDIS effect – surely nothing that small can contain something so big!
It’s not entirely their fault – we swim in data, have 500GB drives in our laptops, 64GB on our iPads, 20GB (or was it 50GB?) in our Cloud Drive, or SkyDrive, or Dropbox. Everyone is used to having GBs of data at their fingertips. But of course we are not often asked to review the contents of our personal laptops, and often the drives are not full, or are full of movies, music and other large media files. Nor do we really have to actively manage all that data.
I think that experience of “Big Personal Data”, which does not have to be reviewed, managed, or categorized in any meaningful way, tends to infect some people’s understanding of what “Big Litigation Data” means in terms of workflow, cost, and process management.
LikeLike
Craig Ball said:
Thanks, Debbie. Very astute observations, and I love the reference to the TARDIS, Ironically, I see an opposing sentiment as well–just as boneheaded in its way. I get frustrated when people speak of data as being impossibly large, often bolstered by a preposterous page equivalency. The cognitive disconnect is that people will confidently state that a user’s data is equal to a billion printed pages, never considering that no human being generates a billion pages of personal work product in a lifetime. As you wisely note, it’s not the size of the media, it’s the size of the files and file elements. One hundred gigs of movies, music or geophysical imaging is easy to manage. A gigabyte of short text messages may be enormously challenging.
LikeLike
Debbie Westwood said:
Ah, this is a variation on your bonehead mistake #4: not looking at the data. Not only should your results (or output) be verified at each step in your eDiscovery workflow, it’s also helpful to know what you’re starting with *before* you get it into your workflow.
LikeLike
Tinzing said:
Great Topic. Here to share one from outsider’s experience.
Bonehead Mistake #9: Providing a litigation hold notice to client and expect client to know the definition of what relevant is. Dreading to pick up the phone to call the other side’s attorney who wants ESI but all ESI in pdf format. Imagine producing more than 10GB data in pdf format because it is the other side’s preferred choice.. You throw in all the technical boiler plate in the request but ultimately you like static pdf images. I just laugh out loud and picturing the other side on a printer printing all this data… 🙂
LikeLike
The Electronic Discovery Attorney said:
Craig I would add this mistake to your list. It is focused upon the review portion of the eD process: You can’t just take contract attorneys and close them in a room with computers to review documents. They require supervision and guidance. That means somebody who is knowledgeable about the matter has to check their work. This should be the policy whether a firm is directly managing the contractors, or if they are being managed by a third party vendor.
LikeLike
Dave Swider said:
Great comment regarding review. Another potential boneheaded mistake: failure to do sampling on the review sets. In the event there’s an inadvertent release of privileged documents, being able to point to some statistical tests that were run on the production is likely more useful than “we checked everything twice”.
Similarly, after running potential privilege terms, it’s a good idea to run samples on documents not containing potential privilege terms to see if there are other names, domains or terms that weren’t included in the original term set.
LikeLike
Pingback: eDiscovery Mistakes Courtesy of Craig Ball » Scope 2.0 - a new perspective
Ann Marie Gibbs said:
Another review oversight we see is a failure to “update” the review set when a “false hit” is running up the review bill. This relates to the mistakes where a client declines to accept excellent advice on search selection criteria. If you can’t get them to understand the problem on the front end you have a second bite at the apple on the back-end.
LikeLike
Bill Onwusah said:
How about searching for a term that shows up in the footer of every single document produced by the organisation? Such as the firms’ name?
LikeLike