Acrobat to the Rescue: Searching Unsearchable Productions

In a perverse irony, lawyers often ‘brag’ about how little they know about information technology; but in situations where admitting confusion could help them, they clam up. Abraham Lincoln said, “Better to remain silent and be thought a fool than to speak out and remove all doubt.” But with respect to problems in electronic discovery, it’s foolish to stay silent.

Sadly, many requesting parties are flummoxed by what’s produced to them. Rather than confess their confusion, they suffer in silence, opening or printing TIFF images one page at a time with nary a clue how to search what they’ve received. And when a production arrives broken—lacking some essential element required for completeness or functionality—the silent majority often don’t know what they’re missing. Instead, they laboriously flail away at the evidence, hoping to turn up something useful. It’s a painful and unnecessary ordeal.

Case in point: a client received a production of about 5,000 documents; mostly e-mail messages, all produced as Adobe Portable Document Files or PDFs. Though the documents derived from inherently searchable electronic originals, all the PDFs were created without a searchable text layer, and no extracted text or any fielded data were furnished in accompanying load files. Ouch!

E-discovery denizens reading this will grasp the deviousness of the production. It ruthlessly destroys any ability to search or sort the documents electronically and runs afoul of the Federal mandate stating, “If the responding party ordinarily maintains the information it is producing in a way that makes it searchable by electronic means, the information should not be produced in a form that removes or significantly degrades this feature.” Comments to Rule 34(b) of the Federal Rules of Civil Procedure.

Innocent mistake? Hardly. The producing party is a Fortune 50 corporation with a storied history of discovery abuse. It’s not their first rodeo.

The producing party surely knows that it will have to supply a replacement production, if sought; but, it also knows that most requesting parties won’t raise a ruckus for fear that an objection will prompt a humiliating, “apparently you don’t understand how to use what we gave you.” With the lack of e-discovery competence extant, most opponents will let it pass unaware. Ignorance is bliss, more so when you can take advantage of the ignorant.

But stripping out searchability and holding back load files is advantageous even when sprung on a savvy opponent like my client. It buys time. Depositions must be put off and discovery deadlines or trial dates moved. Opponents squander resources fiddling with the broken production, drafting motions and hiring experts. It’s a tactic that rarely engenders sanctions or cost-shifting because few judges are going to punish a producing party who agrees to promptly supplement and supply the missing data. Every dog gets one bite…per lawsuit.

So, if you’ve received a production like the brain dead PDFs mentioned, how do you muddle through and deny your opponent the benefit of such delaying tactics? There’s no pat answer, but I’ll describe the quick-and-dirty approach I took to assist a lawyer who, on the eve of depositions, said, “I’ve just got to go forward with what I’ve got.”

If you’re stuck with unsearchable document images, there are three things you can do to add electronic searchability:

You can obtain the native source document or a near-native counterpart;
You can obtain extracted text and the requisite load file data that pairs the text with the images; or,
You can run optical character recognition (OCR) against the images to extract text.

The third option is the only one you can undertake without obtaining further production from the other side, so it was the only option here.[1]

For the most part, the PDFs produced held clean text. That is, because they derived from electronic originals, there were few handwritten annotations, skewed scans, funky fonts or other characteristics to confound OCR. OCR is error-prone at its best; but, it performs abysmally on anything but clean text images.

Once they had the extracted text of the documents in an electronic format, my clients would need a means to pair the extracted text with the correct page image and to search the text. If the mechanism employed indexed the text so as to speed search and supported Boolean and proximity searching, even better.

So, I turned to Adobe Acrobat. The old version 9 Pro edition of Acrobat on my machine is up-to-date enough to create Acrobat Portfolios, run OCR against the contents and even optimize the index for speedier search. It also supports Boolean and proximity searching in a simple-to-use interface that includes a preview mechanism and a basic way to annotate notable documents.

While you need Adobe Acrobat versions 9, 10 or 11 to create a portfolio the recipient of the portfolio just needs the ubiquitous, free Acrobat Reader application to open, view and search it. A PDF Portfolio supports a simple browser-style viewer format in Acrobat Reader, so the documents are very quick to peruse.

Here, I need to reiterate the key difference between Adobe Acrobat products that just seems to stymie so many. Adobe gives away a program called Adobe Reader. It reads PDF formats, but it doesn’t create them. Repeat: it doesn’t create PDFs or Portfolios. It just reads them. It’s called “Reader.” Why? Because IT DOESN’T CREATE PDFs. It’s free, so enjoy what it does, which is read PDFs. Only.

Adobe sells products called Acrobat (so named because you have to perform gymnastics to get people to understand that the Reader product just reads PDFs). The Acrobat products create PDFs, including Portfolios, from Version 9 forward. This is how Adobe makes money: free reader, $350 writer.

But like most law offices, you already have a copy of the Adobe Acrobat program. The writer, not the…oh, never mind.

To create the searchable Portfolio from almost 5,000 non-searchable PDFs comprising 1.7GB of data, I began by copying the PDFs I wanted to make searchable into a separate folder. Next, I ran Adobe Acrobat and selected “Create PDF Portfolio” from the File menu. The Edit PDF Portfolio window seen below opened.

OCR_1

I then selected “Add Existing Folder” from the bottom of the window and pointed the program to the folder I’d filled with unsearchable PDFs. Acrobat began assembling the Portfolio from the files. It took only a few minutes to ‘bind’ the documents into a virtual notebook; however, what I had wasn’t yet searchable.

The next step was to run optical character recognition against all the documents in the Portfolio. Adobe Acrobat has a built-in basic OCR capability. From the Document menu, I selected OCR Text Recognition>Recognize Text in Multiple Files Using OCR. The dialog box that appeared allowed me to Add Files > Add Open Files. As I’d not yet saved it, the portfolio in progress was called “Portfolio1.pdf” by default. I selected it and my Output Options; then, I left for dinner because it would take hours for Acrobat to extract text from an estimated 30-40 thousand page images using optical character recognition.

Before you vendors reading this add, “Our tool would be better for this,” please remember that the goal here was fast and cheap. Your wares cost more than free and carry a steeper learning curve than an application law firms already have and use. Adobe Acrobat doesn’t deliver the benefits of applications purpose-built for e-discovery; but, it’s the butter knife that serves as a decent screwdriver in a pinch.

When the OCR engine completed its work, all of the documents in the collection were now text searchable…sort of. Text in uncommon typefaces or unclear to the OCR engine was rendered incorrectly or not at all. Gray scale content remained largely unsearchable. What emerged was far more utile than what was produced, but fell short of what should be exchanged in e-discovery.

Searching was slow because each PDF in the portfolio had to be searched one-by-one. To speed search, the next step was to generate an index for the contents of the portfolio. From the Advanced menu, I selected Document Processing, set my parameters and generated an index.[2] I let this run for a few hours more until completion.

Now, I had something I could give my client to enable his team to run text and proximity searches against the collection,[3] even if the only tool they had to use was a free copy of Acrobat Reader. It’s even feasible for reviewers using Acrobat to add tags in each document’s description field (or in a custom field added by the reviewer) and sort by those fields and tags. A lagniappe of the process is that, by consolidating the PDFs into a Portfolio, they’re compressed and stored more efficiently. Even with the added text, the searchable Portfolio is one-third the collective size of the documents it holds.

My client can now prepare for depositions. Acrobat rode to the rescue; yet, the Portfolio workaround detailed here is far from optimum. It’s triage: quick, low cost and preferable to having no review platform and no ability to search the production, but no substitute for a proper production.

[1] I suppose you could have typists recreate all of the text in the documents manually; but, I shudder to think what that would cost.

[2] In Acrobat 10 and 11, look for this option in the Tools menu.

[3] For Boolean and proximity searches, use the Advanced Search dialogue box. If you have trouble getting the Advanced Search box to appear (as I did with Acrobat 9), try this: Open Acrobat, then open the Advanced Search dialogue box and only then open the Portfolio file. The window stays open and supports the advanced search options.

4 thoughts on “Acrobat to the Rescue: Searching Unsearchable Productions”

Pingback: Acrobat to the Rescue: Searching Unsearchable Productions | @ComplexD
Ralph Losey said:

July 23, 2013 at 6:34 AM

Good article from a technical and practical standpoint. Please note the footnote links did not work for me. Also, you do not mention the name of the Fortune 50 co., that supposedly has a routine bad faith attitude and actions towards e-discovery. How come? Kind of impunes the integrity of all 50 cos when you do that. This kind of thing, intentional violation of the rules is very mean spirited and unethical IMO, and can be fixed by courts if handled properly. You dont get one free intentional violation in the courts I’m in.

So now to the the respectful criticism of your article. The article intor makes it seem like a common practice by bigs cos in law suits to engage in this kind of behavior (altering documents to remove metadata and making them difficult to use, and in your example, timing a production so they cant prepare for a depo, etc.), which, IMO, is sanctionable behavior by any attorney or party (see Bray & Gillespie where both the P and his attorneys were sanction for this conduct).

I have only run into intentionally stripped productions a few times, and it was always by the small to medium sized plaintiffs. I have never heard of a large co. doing this (Bray & Gillespie was a small Central Florida developer), and as an attorney who has primarily done defense work, I the suggestion that this is a common practice by the big bad corp world and unethical defense lawyers. It is not, at least not in my neighborhood or corps I deal with. They try hard to do the right thing, and are often faced with plaintiffs who do not try hard, and do not even try. Instead, as your article correctly notes, they brag about their ignorance of all things tech. How many P’s lawyers attend the CLEs you tirelessly teach at? How many defense counsel? Who is trying here and who is not?

LikeLike
- craigball said:
  
  July 23, 2013 at 11:17 AM
  
  Dear Ralph:
  
  Thanks for the comment. You certainly are reading more into what I’ve said than the words I used. Methinks the Losey doth protest too much. 😉 The object of the piece is not an indictment of the Fortune 50 or Fortune 500. I speak of a particular state of affairs that actually occurred in my life this week. When do I ever name my clients or their opponents in my writings? The educational purpose is rarely enhanced by naming names.
  
  I said this defendant has a “storied history of discovery abuse.” I didn’t say e-discovery abuse. I can’t speak to their e-discovery actions, abusive or otherwise, beyond what I saw with my own eyes. There is no mention of “big bad corporations” or unethical defense counsel,” nor any suggestion of same. You read that into what I actually wrote, just as you seem to equate “requesting party” with “plaintiff,” before you tar the plaintiff’s bar with an awfully big brush. Why single out the north or south sides of the dockets? Discovery abuse cuts both ways, and neither side has a lock on the moral high ground.
  
  Despite your experience, Ralph, I do see what I described in this post in courts all over the country, including within Florida (I can name names, but not here). Sanctions are a high hurdle, and one often must catch a party in shenanigans more than once to secure more than a slap on the wrist. Note how many reported e-discovery sanctions cases require about three strikes before you’re out. [Curative measures are not sanctions, as the proposed new rules make clear.] Sanctions are hard to get unless you catch someone red-handed in intentional misconduct…and maybe that’s not such a bad thing.
  
  Please appreciate that your practice and mine are very different. I work for both plaintiffs and defendants and for both labor and management. Most often, I serve only the Court as a Master or neutral. I’m typically brought in when something is wrong. Accordingly, I see more abusive practice and more serious abuse. I expect oncologists see a lot of cancer.
  
  As to who is trying out there? When it comes to the quality and efficiency of electronic discovery, neither side of the docket should be lining up for pats on the back just yet. Bragging about a lack of information technology competence seems more a function of age than whether one serves plaintiffs or defendants.
  
  Again, thanks for your comment. At least we agree that the sort of thing I describe has got to stop and should be sanctioned when seen.
  
  LikeLike
Pingback: Using Acrobat to search PDFs in E-discovery - Paperless Chase

Ball in your Court

~ Musings on e-discovery & forensics.

Acrobat to the Rescue: Searching Unsearchable Productions

4 thoughts on “Acrobat to the Rescue: Searching Unsearchable Productions”

Share this:

Related

4 thoughts on “Acrobat to the Rescue: Searching Unsearchable Productions”