Degradation: How TIFF+ Disrupts Search

Recently, I wrote on the monstrous cost of TIFF+ productions compared to the same data produced as native files. I’ve wasted years trying to expose the loss of utility and completeness caused by converting evidence to static formats. I should have recognized that no one cares about quality in e-discovery; they only care about cost. But I cannot let go of quality because one thing the Federal Rules make clear is that producing parties are not permitted to employ forms of production that significantly impair the searchability of electronically stored information (ESI).

In the “ordinary course of business,” none but litigators “ordinarily maintain” TIFF images as substitutes for native evidence When requesting parties seek production in native forms, responding parties counter with costly static image formats by claiming they are “reasonably usable” alternatives. However, the drafters of the 2006 Rules amendments were explicit in their prohibition:

[T]he option to produce in a reasonably usable form does not mean that a responding party is free to convert electronically stored information from the form in which it is ordinarily maintained to a different form that makes it more difficult or burdensome for the requesting party to use the information efficiently in the litigation. If the responding party ordinarily maintains the information it is producing in a way that makes it searchable by electronic means, the information should not be produced in a form that removes or significantly degrades this feature.

FRCP Rule 34, Committee Notes on Rules – 2006 Amendment.

I contend that substituting a form that costs many times more to load and host counts as making the production more difficult and burdensome to use. But what is little realized or acknowledged is the havoc that so-called TIFF+ productions wreck on searchability, too. It boggles the mind, but when I share what I’m about to relate below to opposing counsel, they immediately retort, “that’s not true.” They deny the reality without checking its truth, without caring whether what they assert has a basis in fact. And I’m talking about lawyers claiming deep expertise in e-discovery. It’s disheartening, to say the least.

A little background: We all know that ESI is inherently electronically searchable. There are quibbles to that statement but please take it at face value for now. When parties convert evidence in native forms to static image forms like TIFF, the process strips away all electronic searchability. A monochrome screenshot replaces the source evidence. Since the Rules say you can’t remove or significantly degrade searchability, the responding party must act to restore a measure of searchability. They do this by extracting text from the native ESI and delivering it in a “load file” accompanying the page images. This is part of the “plus” when people speak of TIFF+ productions.

E-discovery vendors then seek to pair the page images with the extracted text in a manner that allows some text searchability. Vendors index the extracted text to speed search, a mapping process intended to display the page where the text was located when mapped. This is important because where the text appears in the load file dictates what page will be displayed when the text is searched and determines whether features like proximity search and even predictive coding work as well as we have a right to expect. Upshot: The location and juxtaposition of extracted text in the load file matters significantly in terms of accurate searchability. If you don’t accept that, you can stop reading.

Now, let’s consider the structure of modern electronic evidence. We could talk about formulae in spreadsheets or speaker notes in presentations, but those are not what we fight over when it comes to forms of production. Instead, I want to focus on Microsoft Word documents and those components of Word documents called Comments and Tracked Changes; particularly Comments because these aren’t “metadata” by any stretch. Comments are user-contributed content, typically communications between collaborators. Users see this content on demand and it’s highly contextual and positional because it is nearly always a comment on adjacent body text. It’s NOT the body text, and it’s not much use when it’s separated from the body text. Accordingly, Word displays comments as marginalia, giving it the power of place but not enmeshing it with the body text.

But what happens to these contextual comments when you extract the text of a Word document to a load file and then index the load files?

There are three ways I’ve seen vendors handle comments and all three significantly degrade searchability:

First, they suppress comments altogether and do not capture the text in the load files. This is content deletion. It’s like the content was never there and you can’t find the text using any method of electronic search. Responding parties don’t disclose this deletion nor is it grounded on any claim of privilege or right. Spoliation is just S.O.P.

Second, they merge the comments into the adjacent body text. This has the advantage of putting the text more-or-less on the same page where it appears in the source, but it also serves to frustrate proximity search and analytics. The injection of the comment text between a word combination or phrase causes searches for that word combo or phrase to fail. For example, if your search was for ignition w/3 switch and a four-word comment comes between “ignition” and “switch,” the search fails.

Third, and frequently, vendors aggregate comments and dump them at the end of the load file with no clue as to the page or text they reference. No links. No pointers. Every search hitting on comment text takes you to the wrong page, devoid of context.

Some of what I describe are challenges inherent to dealing with three-dimensional data using two-dimensional tools. Native applications deal with Comments, speaker notes and formulae three-dimensionally. We can reveal that data as needed, and it appears in exactly the way witnesses use it outside of litigation. But flattening native forms to static images and load files destroys that multidimensional capability. Vendors do what they can to add back functionality; but we should not pretend the results are anything more than a pale shadow of what’s possible when native forms are produced. I’d call it a tradeoff, but that implies requesting parties know what’s being denied them. How can requesting party’s counsel know what’s happening when responding parties’ counsel haven’t a clue what their tools do, yet misrepresent the result?

But now you know. Check it out. Look at the extracted text files produced to accompany documents with comments and tracked changes. Ask questions. Push back. And if you’re producing party’s counsel, fess up to the evidence vandalism you do. Defend it if you must but stop denying it. You’re better than that.

7 thoughts on “Degradation: How TIFF+ Disrupts Search”

Michelle F. said:

January 15, 2020 at 2:52 PM

So how do you handle redacting privileged and sensitive information of items such as Outlook mailboxes and messages? We can upload them to NUIX web reviewer, redact and bates number, but then it spits out in Adobe format.

LikeLike

- craigball said:
  
  January 15, 2020 at 3:01 PM
  
  How do you handle redacting privileged and sensitive information? The answer is you do it in reasonable ways that work. I almost always agree that the producing party can redact using static images with non-redacted text restored, if that’s what they want to do. Most cases involve limited redaction and so I work with my opponents to find low-cost solutions that balance the legitimate needs of producing parties to shield privileged information against the legitimate needs of a requesting party to discover non-privileged content. There are ways to redact natively, but I don’t push those unless the producing party wants to use them. I don’t let the redaction tail wag the production dog; which is to say, I don’t let the need to redact a modest percentage of a collection trigger undue expense and degraded search for the rest of the production.
  
  LikeLike
  
Eric Beard said:

February 11, 2020 at 9:22 AM

Craig,

What are your thoughts on those vendors that are moving away from the traditional TIF/JPG static delivery and instead converting native evidence to searchable PDFs? Will the PDF replace the TIF/JPG standard, especially with the scaling / compression improvements and the ability to convert direct color for color? It seems that with this approach, the extracted text search, is embedded into the PDF, and therefore the mismatch of extracted text vs image location no longer applies.

As for the comments and track changes, this can still be an issue, but why not produce those word documents natively, along with the converted / branded file? Until the review applications have a way to store the metadata / extracted content and mimic it from those Microsoft documents vs the text dump that it currently creates, providing the native file seems the ideal solution when it comes to search accuracy.

LikeLike

Ethan said:

April 21, 2020 at 1:52 PM

I like this thinking, but it’s also worth taking a threshold (admittedly upstream) idea into consideration too – this ‘2-D’ extracted text ‘deficiency’ is less an artifact of production format and more just a (valid) complaint about the present state and variability of text extraction in discovery software.
A thought experiment may help – will the resulting extracted text be any different between a native production subsequently loaded into discovery software XYZ, and one processed with discovery software A and transferred as a Tiff+load? In either case, receiving counsel is looking at text that’s processed by discovery software XYZ, and in either case subject to how and where discovery software XYZ integrates the comments and edits.

LikeLike

- craigball said:
  
  April 21, 2020 at 2:04 PM
  
  I take your point in terms of the structure of the index used for search; however, there’s a material difference in terms of how the information item under review may appear to the reviewer and hence it’s characterization as responsive or not. A reviewer of a TIFF+ production tends not to see the comments in context when reviewing the TIFF images. A reviewer using a proper native viewer should be able to see comments. Reviewers for a producing party have (or should have) the option to see the comments in their proper place. A requesting party getting TIFF + has no option to see it at all.
  
  LikeLike
  
Pingback: Potential Drawbacks of Non-Native Disclosure – The eDiscovery Channel
realrecords said:

June 29, 2020 at 8:00 AM

Excellent article, clearly stated as always. A main point, or “mantra” I hold that, if not common knowledge should be, is enforced here – “content without context is just noise”. Losing context in load files or discovery collections of any type simply defeats the concept of defensible discovery.

Aaron Taylor
GreenLight Discovery LLC

LikeLike

Ball in your Court

~ Musings on e-discovery & forensics.

Degradation: How TIFF+ Disrupts Search

7 thoughts on “Degradation: How TIFF+ Disrupts Search”

Leave a comment Cancel reply

Share this:

Related

7 thoughts on “Degradation: How TIFF+ Disrupts Search”

Leave a comment Cancel reply