I just returned from Santa Fe where I spoke on a panel with Judges Paul Grimm and Rebecca Pallmeyer at the always excellent ALI Current Developments in Employment Law program. I opened our sessions with a presentation I call “Spoiled and Deluded: The Shakespearean Tragedy of Search in E-Discovery.” The presentation addresses the discontinuity between what lawyers believe their search tools can accomplish and the practical limits of same.
While I was explaining the role of stop words in indexed search and lamenting what I call the “to be or not to be” problem” (i.e., the inability of some text indexing tools to find that most famous of English language phrases because its constituent words are often omitted by text parsers), Judge Pallmeyer stopped me and said, “Is that true?”
When a federal district judge pointedly asks you if what you are telling the audience is true, it’s an opportune time to catch your breath and collect your thoughts before responding.
“Yes, Judge,” I answered, “It’s true.”
She countered, incredulously, “But surely I can find ‘to be or not to be’ if I put it in quotes, right?”
“No, Your Honor,” I replied. “If it’s been excluded from the index, no search will find what’s not there to be found.”
Of course, not every search tool uses the same stop words in the same way, nor are all similarly hamstrung by hyphens, diacriticals, numeric content and other common search pitfalls. But it got me thinking about the value of a standard, freely-available corpus for testing search tools, something anyone can pull into their tool of choice and then run baseline searches against to assess whether the tool finds all instances of all the queried words and phrases. That got me compiling a list of forty words and phrases, each word or phrase appearing one or more times within ten of the most common file types seen in e-discovery. These were collected in a compressed Zip file and can be downloaded from here. The file is less than one megabyte in size, holds ten files named “A” through “J” and its MD5 hash value is 3CA3D949DA69ED06F61A007B8D080CD5. A RTF list of the search terms can be found here. The 40 words and phrases chosen are:
- 1564 – 1616
- Area 51
- Do That to Me One More Time
- Everybody Loves Somebody Sometime
- To be or not to be
- All’s well that ends well
- Veni, vidi, vici
- Ask not what your country can do for you
- Lord, is it I?
- I never met a man I didn’t like
- Able was I ere I saw Elba
- Two heads are better than one
- Come and take it
- Between us, yes or no?
- Any way out of this?
- What is she into?
- Zoë Baird
- Café Lattè
- Annuit cœptis
- E Pluribus Unum
- Plan B
- Kiss my A**
I grant it’s an oddball olio. The first ten search terms were obscure things I was required to memorize for my 7th grade General Language class at Horace Mann School in Riverdale, N.Y. I had to know a “zarf” is an ornamental hot coffee cup holder, a “strigil” is a skin scraper, “triskaidekaphobia” is the fear of the number 13 and William Shakespeare lived from 1564-1616. I included these to get some use out of them after forty-odd years cluttering my brain.
But there’s a method to my madness in the other selections, which reflect, inter alia, famous numeric values and historic phrases, titles of four pop songs and two phrases from the back of a U.S. dollar. I even included “Come and Take It,” from an early Texas flag. Whether because of the presence of stop words excluded from many indexes or the inclusion of diacriticals and other confounding features, these phrases are designed to give fits to search tools that employ indexed search.
Just for diabolical fun, I’ve also hidden the search terms in various forms and within various areas of the documents which your search tool may or may not detect. Some are embedded as images. Some fill metadata fields. Some were simply made harder for a reviewer to see. I’ve also corrupted the header of one document and password-protected another. The measure of your tool’s performance isn’t just that it finds a single instance of the forty words and phrases in each document, but how many instances of each it correctly identifies across all documents using the specified search terms. Here’s an occurrence matrix listing the minimum number of discrete occurrences of each search term appearing in each of the ten lettered documents in the collection:
I concede that this testing corpus is far from optimum in terms of putting common e-discovery search and indexing tools through their paces. My hope is that it prompts others to build and distribute better ones.
If your tool of choice falls far short of catching all the occurrences in all the files when deployed with customary settings, even this homespun effort should prompt you to ask why that might be afoot and consider what you’re missing. Happy hunting!