Be honest. Wouldn’t you love to stick it to the plaintiffs? Wouldn’t your corporate client or carrier be ecstatic if you could make litigation much more expensive for those greedy opportunists bringing frivolous suits and demanding discovery? What if you could make discovery not just more costly, but make it, say, five times more costly, ten times more costly, than it is for you? Really bring the pain. Would you do it?
Now that I have your attention–and the attention of plaintiffs’ counsel wondering if they’ve stumbled into a closed meeting at a corporate counsel retreat—I want to show you this is real. Not just because I say so, but because you prove it to yourself. You do the math.
Math! You didn’t say there would be math!
Stop. You know you’re good at math when the numbers come with dollar signs. Legendary Texas trial lawyer W. James Kronzer used to say to me, “I’m no good at math, Herman; but I can divide any number by three.” That was back when a third was the customary contingent fee.
Even after you do the math, you’re not going to believe it; instead, you’ll conclude it can’t be true. Surely nothing so unjust could have escaped my notice. Why would Courts allow this? How can I be such a sap?
The real question is this: What am I going to do about it?
First, a few facts to get on the same page.
FACT 1: Requesting Parties May Specify a Form or Forms of Production
When you pursue discovery, you may call what you seek “documents” and mentally equate them to paper records, but it’s electronically stored information (ESI). ESI must be produced in specified forms of production, either in native forms (being the form stored and used in the ordinary course of business) or in a static image format (a black and white screenshot of each page called a TIFF image plus a load file or files holding text and metadata). There are also near-native forms of production, such as when e-mailboxes are produced as individual messages called MSGs or EMLs.
If you’re thinking, ‘No, we just print it out and deal with it on paper,’ then this essay isn’t for you. All the best, but this is for lawyers handling cases with lots of information; lawyers with thousands or millions of documents and messages to plow through.
The Federal Rules of Civil Procedure and most states’ rules empower a requesting party to specify the form or forms in which electronically stored information is to be produced. FRCP Rule 34(b)(1)(C). If a requesting party fails to specify a form of production (and you should NEVER fail to specify), a producing party must supply ESI in the form or forms in which it is ordinarily maintained or in a reasonably usable form or forms. FRCP Rule 34(B)(2)(e)(ii)
Fact 2: Most E-Discovery Service Providers are “In the Cloud” and Charge by Data Volume
Whether in native or static image format, ESI must be processed (“ingested”) and hosted to be searchable and reviewable. [Again, if your reaction is, ‘we just print it out’ or ‘we don’t search it electronically,’ this isn’t for you.] Native forms are processed to extract their text and metadata, then indexed for search. TIFF and load file productions are indexed for search and processed to pair the page images with text and metadata. Either way, you pay a vendor to prepare the production for viewing and then pay a recurring “hosting” charge for online access to the production. The fees charged are based on the volume of data processed and/or hosted. More data costs more money. If you receive 10 times as much data, you pay a commensurate amount more to ingest and host. Vendors usually assess hosting fees as a monthly subscription, so the more data they host for you, the more you pay every month for the life of the case.
It’s no accident that vendor charges are opaque and vary wildly; but, at the bottom line, the rule is more data, more dollars.
Fact 3: TIFF images of native files are much larger than the native files.
More data isn’t the same thing as more information because not all electronic forms of information are equally efficient. When you convert native forms to static images and load files you explode the size of production by many multiples, and static productions come burdened by the further cost of impaired searchability, diminished functionality and lost color, animation and rich media. With TIFF, you get less and pay more. Not 50% or 100% more; perhaps a thousand percent more and beyond. This is notably the case for Word documents, PowerPoint presentations, Excel spreadsheets and collections of e-mail messages and attachments–the native forms at the heart of electronic discovery. The difference is genuine, material and carries a big bottom-line cost.
That’s a categorical statement, and some will immediately search for an exception. They will wonder, is it possible to fashion a native file larger than its TIFF counterpart? You could certainly construct a PowerPoint or Word document so laden with hi-resolution color photos, sound and video that, once you strip away the rich content, a static black & white image would occupy a size smaller than the native. But is a TIFF shorn of sound, video and color comparable, and is such a file representative of most collections produced in e-discovery? An emphatic “no,” on both counts, and it’s not an apples-to-apples” comparison.
A production must be reasonably usable. The TIFF without sound and video isn’t. When you add back the rich media and produce with extracted sound and video files, the TIFF production is indeed larger than the native, and more unwieldy.
You Do the Math
If you have an e-discovery processing tool available to you, it’s easy to process a collection of native Word files to a TIFF+ production then compare the collective file size of the TIFF images and extracted text to the collective size of the native files. Once you see how many times bigger the TIFF+ set is versus the native, you’ll understand why requesting parties pay so much more to load and host TIFF+ productions over native productions.
If you don’t have a processing tool, you’ll need a way to generate monochrome Group IV 600 dpi TIFF images like those produced in e-discovery. You could use an online file conversion tool, or you could save the DOCX file as a PDF and then use Adobe Acrobat to convert the file to a series of single page TIFF images. For a proper comparison, generate the TIFF images as single-page monochrome Group IV images at 600 dpi resolution. No multipage or color TIFF sets.
I used both approaches, processing a 540KB Word document grabbed online in Nuix Workstation 8, and also pulling the file into Adobe Acrobat (Create PDF from File), then using Acrobat to Save As TIFF (with Save as TIFF Settings configured to CCITT G4, Grayscale LZW, Colorspace: Monochrome and Resolution 600/pixels/inch as in figure at right). The Nuix-generated TIFF set was 4.83MB and the Adobe-generated set was 4.95MB. So, one TIFF set was 8.9 times larger and the other was 9.2 times larger than the native. You can find the same file I used at this link, but you don’t need to use the same file. You can use your own files or files from real cases. The point is to test it to your satisfaction using methods and evidence unassailable to you.
I use an exercise like the following in my teaching. Why don’t you try to solve it with your own tools?
Exercise: Calculate the Cost Difference Flowing from Alternate Forms of Production
There may be many variables that go into computing the cost of vendor services for e-discovery, and the charges for ingestion, processing, hosting and export are just parts of a more complicated puzzle. The purpose of this exercise is to gauge the difference that forms of production may make as a component of overall cost.
Problem: You are a requesting party in a federal case, and you have made a timely, compliant and unambiguous written request for production of responsive information in native and near-native forms. You have expressly requested that Microsoft Word documents be produced in their native .DOC or .DOCX formats. Your opponent instead produces Word documents to you as multiple .TIFF image files accompanied by a load file containing the extracted text from each document. When you object, your opponent counters that “this is what they always do” and that “TIFF plus load file is reasonably usable, so the Rules gave them the right to substitute TIFFs for natives.”
Assume that your opponent has produced 10,000 different Word documents which (for ease in making the calculation) are all exactly the same size as the native and converted file size for the file found at this link (a 540KB Word version of the Zubulake V opinion). Thus, the aggregate loaded and hosted native volume is 5.4GB (gigabytes). Further, assume that none of the documents are privileged or require redaction. None are hash-matching duplicates of any other items produced.
You’ve contracted with an e-discovery service provider to load and host the documents produced so you can review and tag the documents for use in the case. The service provider charges by the gigabyte to load and host the data month-to-month. The provider proposes to charge $20/gigabyte/month for loading and hosting the data
Any fraction of a gigabyte will be rounded up to a full gigabyte when calculating charges. The extracted text file size for the single file is 126KB or 1.26GB for all 10,000 files.
You intend to approach the Court to compel your opponent to produce the documents in the form you designated, and in addition to raising issues of utility, completeness and integrity, you want to determine whether the form produced to you will prove more expensive to load and host for the one-year period you expect to have the data online.
Question: If you accept the production in TIFF and load file, approximately how much more will it cost you over twelve months versus the same production in native forms?
How to Solve this Problem:
Step 1: Normalize the file sizes. Because the prices are quoted in gigabytes, you will want to express all data volumes in gigabytes, rather than as kilobytes or megabytes.
Remember: A kilobyte is one thousand bytes. A megabyte is one thousand kilobytes and a gigabyte is one thousand megabytes.
Step 2: Calculate the cost of Native Production using normalized values:
Native Production: Ten thousand files, each 540KB in size, is 5,400,000KB or 5.4GB. As noted above, productions in any format are processed to create a searchable index of extracted text; so, we must also add the extracted text for all the files, which will be ten thousand times 126KB or 1.26GB. An index is typically more compact than the aggregate extracted text, but I’ll use the aggregate value to assure a conservative comparison. The cost to load and host for one year would be:
Load and Host (5.4GB PLUS 1.26GB, rounded up to 7GB at $20.00/GB/month x 12 months) = $1,680.00
Step 3: Calculate the cost of TIFF and Text Load File Production using normalized values:
TIFF Plus Production: Ten thousand TIFF image sets, each 4.83 MB in size, is 48.3GB. We again add the extracted text for all the files, 1.26GB. Any fraction of a gigabyte must be rounded up to the next whole gigabyte. Consequently, the value we use for the aggregate size to load and host TIFF+ is the sum of 48.3GB plus 1.26GB rounded up to the next whole gigabyte or 50 gigabytes.
Load and host (50GB at $20.00/GB/month x 12 months) = $12,000.00
The cost difference would be ($12,00.00 less $1,680.00) = $10,320.00.
Producing in TIFF costs the requesting party seven times as much as the same data produced as native files.
This is an ultraconservative example. The exemplar file has no tracked changes or embedded comments and 10,000 documents is hardly a big production set. The differential is typically much larger (fifteen times more is commonplace) and will run to hundreds of thousands of dollars in large matters like MDL cases.
Cost is cause enough to demand production in native forms, but when an opponent produces in native formats, you’re getting what the other side used in the ordinary course of business. It’s the real evidence. It’s a form witnesses recognize. It’s complete and utile. Crucially, you can convert native forms to other forms–including static image formats–for those times you may want alternative formats.
But it doesn’t work both ways. You can’t convert TIFF images back to native originals. Not really. You can’t slim bloated static images down to svelte native forms. You can’t restore animations, color, formulas, tracked changes and comments, application metadata or hash values.
With TIFF productions, you’re stuck. You must pay vendors to ingest and host at grossly inflated data volumes. You have no choice. It’s like buying a car and the dealer delivers it encased in a block of concrete. You’re not going anywhere.
When you focus on facts, TIFF+ productions are less for more. Not a bit more, many multiples more. Real out-of-pockets dollars. So, if you want to fly in the face of FRCP Rule One’s goal of just, speedy and inexpensive litigation and really do the other side dirty, here’s the way. And as for all you requesting parties letting the other side do this to you, isn’t it time to stop the joke being on you?
Andrew Caspersonn said:
I whole heartedly agree! Whenever we receive productions with TIFs I shudder.
How do you feel about production of PDF files created straight from the native and stamped with a document ID. These are often of similar size to the natives and while don’t have the same utility as a native, they are easier to distribute, are less likely to be accidently amended and can be tracked back to the database or original source using the document ID.
We have done this for a while in Australia and while we may do some productions in native, the majority are PDF based.
I wholeheartedly agree that PDF productions are superior to TIFF in every important way. They are nowhere near as bloated as TIFF images and can transport embedded searchable text. In fact, they can carry binary content, so could even deliver the native file, too. PDF poses issues for embedded comments, search of same, threading of messaging, spreadsheets and overall utility but not the debacle of TIFF, to be sure.
Garrett Discovery (@garrettdiscover) said:
Craig – Great work on this article. I see this all of the time and it is another game in eDiscovery designed to inflate costs which drives me nuts because the cost of eDiscovery is not going down.
While i stopped to read your article, I was in the middle of writing a long article about the push by eDiscovery vendors to host all files in the cloud including forensic images of computers and phones. It feels that you and a few others are the only ones promoting ways to reduce cost and burden.
It feels as if no one is talking about culling ESI prior to hosting in the cloud using filters such as by file type, date parameter, custodian and keywords. It escapes me as to why anyone would promote hosting an entire forensic image of a computer in the cloud that contains mostly irrelevant ESI. As an example, I recently had two vendors quote a large municipality to host 15 custodians ESI email for 285k and 315k for six months for a grand total of 4.1TB. I found it completely a waste as the ESI needed was only six months worth totaling 34GB. After excluding certain emails that were clearly spam and automated notifications we were able to filter the data to 14GB of data using a non cloud eDiscovery system. Hosting the 14GB cost around $1500 for six months in the cloud and $900 in consulting time to thin the ESI. It is horrible that some of the industry leaders are not promoting ways to reduce the cost and instead want you to host as much as you can and index in their cloud software and charge by the gig.
Kudos on the hard work
Bharat Chovatiya said:
A great article from you as always.
A math correction suggested: You wrote “A kilobyte is one thousand bytes. A megabyte is one thousand kilobytes and a gigabyte is one thousand megabytes”
It is actually A kilobyte is 1024 bytes. A megabyte is 1024 kilobytes and a gigabyte is 1024 megabytes.
I take your point; but I’m speaking of decimal values for ease of computation, and I’ve used the accepted terminology for decimal byte values. The binary units you describe are called kibibytes, mebibytes and gibibytes. I’ve sometimes used the terms interchangeably, hopefully only when the distinction is unimportant.
Erin Rubenstein said:
Love this article, Craig. It’s amazing how powerful the argument, “but this is how we’ve always done it” is. Lawyers generally aren’t technologists, particularly old judges. Dollars and cents at least pulls everyone’s attention.
I’ve worked with several clients who are convinced that they’ll unintentionally provide sensitive information in native productions. They feel that TIFFs are safer. Do you have any thoughts on any the state of native production adjunct technology (e.g., native redaction tools including spreadsheets, audio, and video content)?
No thoughts on native redaction beyond those I’ve written about here a year ago. https://craigball.net/2019/04/23/mueller-mueller-more-e-discovery-lessons-from-bill-and-bob/
The risk of unintentionally providing sensitive information using native production is a strawman because you’re not entitled to cleanse content from evidence without legal justification and disclosure. Using proper tools, it’s not a concern at all, anymore than it is reasonable to deny requesting parties native forms because they can’t be “trusted” to use them like grownups. You simply don’t do reviews with native applications.
In a recent article, I offered the following:
If ESI were like paper, you could open each item in its associated program (its native application), review the contents and decide whether the item is relevant or privileged. But, ESI is much different than paper documents in crucial ways:
• ESI collections tend to be exponentially more voluminous than paper collections;
• ESI is stored digitally, rendering it unintelligible absent electronic processing;
• ESI carries metainformation that is always of practical use and may be probative evidence;
• ESI is electronically searchable while paper documents require laborious human scrutiny;
• ESI is readily culled, filtered and deduplicated, and inexpensively stored and transmitted;
• ESI and associated metadata change when opened in native applications;
These and other differences make it impractical and risky to approach e-discovery via the piecemeal use of native applications as viewers. Search would be inconsistent and slow, and deduplication impossible. Too, you’d surrender all the benefits of mass tagging, filtering and production. Lawyers who learn that native productions are superior to other forms of production may mistakenly conclude that native production suggests use of native applications for review. Absolutely not! Native applications are not suited to e-discovery, and you shouldn’t use them for review. E-discovery review tools are the only way to go.
Doug Austin said:
Great article as always, Craig! For those lawyers who got into the business with the understanding that “they said there would be no math”, it’s notable just how important it is to keep costs manageable — especially on the plaintiff’s side of the ledger.
One adjustment, however: I believe that most hosting providers still extract and host the text for the native files as well (at least CloudNine does) to support a common format for hit highlighting within their platform — it’s just easier that way for users. So, in your example, that would add another net total GB to the native hosting — 5.4 GB plus 1.26 GB = 6.66 GB (yikes!), which would round to 7 GB. Total hosting would be 7 GB x $20 x 12 = $1,680 for the year.
So, slightly less of a difference, but still certainly a notable difference in hosting charges with TIFF+ productions.
Agree and accept the correction gratefully. I wrote that both methods entail the creation of indices but failed to include the byte volume of same on the hosted side in both calculations. I’m fixing it now, and thank you again for keeping me on track.
Pingback: Degradation: How TIFF+ Disrupts Search | Ball in your Court