I just returned from Santa Fe where I spoke on a panel with Judges Paul Grimm and Rebecca Pallmeyer at the always excellent ALI Current Developments in Employment Law program. I opened our sessions with a presentation I call “Spoiled and Deluded: The Shakespearean Tragedy of Search in E-Discovery.” The presentation addresses the discontinuity between what lawyers believe their search tools can accomplish and the practical limits of same.
While I was explaining the role of stop words in indexed search and lamenting what I call the “to be or not to be” problem” (i.e., the inability of some text indexing tools to find that most famous of English language phrases because its constituent words are often omitted by text parsers), Judge Pallmeyer stopped me and said, “Is that true?”
When a federal district judge pointedly asks you if what you are telling the audience is true, it’s an opportune time to catch your breath and collect your thoughts before responding.
“Yes, Judge,” I answered, “It’s true.”
She countered, incredulously, “But surely I can find ‘to be or not to be’ if I put it in quotes, right?”
“No, Your Honor,” I replied. “If it’s been excluded from the index, no search will find what’s not there to be found.”
Of course, not every search tool uses the same stop words in the same way, nor are all similarly hamstrung by hyphens, diacriticals, numeric content and other common search pitfalls. But it got me thinking about the value of a standard, freely-available corpus for testing search tools, something anyone can pull into their tool of choice and then run baseline searches against to assess whether the tool finds all instances of all the queried words and phrases. That got me compiling a list of forty words and phrases, each word or phrase appearing one or more times within ten of the most common file types seen in e-discovery. These were collected in a compressed Zip file and can be downloaded from here. The file is less than one megabyte in size, holds ten files named “A” through “J” and its MD5 hash value is 3CA3D949DA69ED06F61A007B8D080CD5. A RTF list of the search terms can be found here. The 40 words and phrases chosen are:
- Zarf
- Strigil
- Xyster
- Bis
- Sporran
- Duniwassal
- Hidalgo
- Triskaidekaphobia
- Pixilated
- 1564 – 1616
- 3.14159
- 911
- 9/11
- WD-40
- Area 51
- 867-5309
- R-E-S-P-E-C-T
- Do That to Me One More Time
- Everybody Loves Somebody Sometime
- To be or not to be
- All’s well that ends well
- Veni, vidi, vici
- Ask not what your country can do for you
- Lord, is it I?
- I never met a man I didn’t like
- Able was I ere I saw Elba
- Two heads are better than one
- Come and take it
- Between us, yes or no?
- Any way out of this?
- What is she into?
- Résumé
- Resumé
- Zoë Baird
- Café Lattè
- Annuit cœptis
- E Pluribus Unum
- Plan B
- C.V.
- Kiss my A**
I grant it’s an oddball olio. The first ten search terms were obscure things I was required to memorize for my 7th grade General Language class at Horace Mann School in Riverdale, N.Y. I had to know a “zarf” is an ornamental hot coffee cup holder, a “strigil” is a skin scraper, “triskaidekaphobia” is the fear of the number 13 and William Shakespeare lived from 1564-1616. I included these to get some use out of them after forty-odd years cluttering my brain.
But there’s a method to my madness in the other selections, which reflect, inter alia, famous numeric values and historic phrases, titles of four pop songs and two phrases from the back of a U.S. dollar. I even included “Come and Take It,” from an early Texas flag. Whether because of the presence of stop words excluded from many indexes or the inclusion of diacriticals and other confounding features, these phrases are designed to give fits to search tools that employ indexed search.
Just for diabolical fun, I’ve also hidden the search terms in various forms and within various areas of the documents which your search tool may or may not detect. Some are embedded as images. Some fill metadata fields. Some were simply made harder for a reviewer to see. I’ve also corrupted the header of one document and password-protected another. The measure of your tool’s performance isn’t just that it finds a single instance of the forty words and phrases in each document, but how many instances of each it correctly identifies across all documents using the specified search terms. Here’s an occurrence matrix listing the minimum number of discrete occurrences of each search term appearing in each of the ten lettered documents in the collection:
I concede that this testing corpus is far from optimum in terms of putting common e-discovery search and indexing tools through their paces. My hope is that it prompts others to build and distribute better ones.
If your tool of choice falls far short of catching all the occurrences in all the files when deployed with customary settings, even this homespun effort should prompt you to ask why that might be afoot and consider what you’re missing. Happy hunting!
Andy Wilson (logikcull.com) said:
Challenge accepted Craig. Just ran this set through logikcull.com without any modifications to it. Here’s the output in CSV, PDF, TEXT, and NATIVE:
https://dl.dropboxusercontent.com/u/5590447/logikcull/ZARFBARF0001.zip
Logikcull didn’t fix the corrupted file or break the PDF password, but I think it did fairly well. Check out the load file and text files. Noticed you dumped a lot of the keywords into the metadata fields and speaker notes. Clever.
If you’re curious, Logikcull is built on a distributed lucene index, so it abides by the lucene indexing rules (with slight mods for eDiscovery purposes). But the text and metadata extracted are done mostly by native apps, like MS Word, etc. And the OCR text for docs H, I, J is done by ABBYY.
Great test. Thanks for posting.
LikeLike
John Tredennick said:
Thank you Craig. This was another great post.
We use the MarkLogic search engine and it does not have stop words. You can search for “The” or “To be or not to be” with our engine.
I loaded up the files and text just for fun and ran our PowerSearch utility against it. Here is a link to the report http://bit.ly/160EyfM
You can see that we hit on all the terms in all of the documents. We don’t list the hit counts in this report but you can see how many instances hit when you view the preview with highlights enabled.
We did not try to break the PDF encryption nor correct the corrupted Word file. We are not a forensics shop and would normally look to our partners for this service if the client requested it.
This was a great service including the hidden comments and white one-point font text.
LikeLike
Greg Buckles said:
Nice job Craig. We have an evolving set of spiked validation files that our clients use for acceptance and validation testing. Great work and not surprised that most of the dtSearch based applications have not jumped up to proclaim their results. There are quite a few alternative index systems (like Catalyt’s Mark Logic XML index) that do not use stop words, but the traditional indexes (dtSearch, ISYS, Lucene, SQL, etc) use them for efficiency. Trust but VERIFY!
LikeLike
Pingback: Finding Handwritten Documents from among Thousands of Scans
Pingback: Finding Handwritten Documents from among Thousands of Scans | Catalyst E-Discovery Search BlogCatalyst E-Discovery Search Blog
Pingback: 何千というスキャン済みファイルの中から手書き文書を探し出すには | Catalyst Repository SystemsCatalyst Repository Systems