I got a call from a lawyer I don’t know on Sunday evening. He reported that he’d received production of ESI from a financial institution and spent the weekend going through it. He’d found TIFF images of the pages of electronic documents, but couldn’t search them. He also found a lot of “Notepad documents.” He’d sought native production, so thought it odd that they produced so many pictures of documents and plain text files.
As it’s unlikely a bank would rely on Windows Notepad as its word processor, I probed further and learned that that the production included folders of TIFF images, folders of .TXT files (those “Notepad documents”) and folders of files with odd extensions like .DAT and .OPT. My caller didn’t know what to do with these.
By now, you’ve doubtlessly figured out that my caller received an imaged production from an opponent who blew off his demand for native forms and simply printed to electronic paper. The producing party expected the requesting party to buy or own an old-fashioned review tool capable of cobbling together page images with extracted text and metadata in load files. Without such a tool, the production would be wholly unsearchable and largely unusable. When my caller protests, the other side will tell him how all those other files represent the very great expense and trouble they’ve gone to in order to make the page images searchable, as if furnishing load files to add crude searchability to page images of inherently searchable electronic documents constitutes some great favor.
It brings to mind that classic Texas comeback, “Don’t piss in my boot and tell me it’s raining.”
It also reminds me that not everyone knows about load files, those unsung digital sherpas tasked to tote metadata and searchable text otherwise lost when ESI is converted to TIFF images. Grasping the fundamentals of load files is important to fashioning a workable electronic production protocol, whether you’re dealing with TIFF images, native file formats or a mix of the two. I’ve been wanting to write about load files for a long time, but avoided it because I just hate the damn things! So, this post is a load (file) off my mind.
In simplest terms, load files carry data that has nowhere else to go. They are called load files because they are used to load data into, i.e., to “populate” a database. They first appeared in discovery in the 1980s in order to add a crude level of electronic searchability to paper documents. Then as now, paper documents were scanned to TIFF image formats and the images subjected to optical character recognition (OCR). Unlike Adobe PDF images, TIFF images weren’t designed to integrate searchable text; consequently, the text garnered using OCR was stored in simple ASCII text files named with the Bates number of the corresponding page image. Compared to paper documents alone, imaging and OCR added functionality. It was 20th century computer technology improving upon 19th century printing technology, and if you were a lawyer in the Reagan-era, this was Star Wars stuff.
Metadata is “data about data.” While we tend to think of metadata as a feature unique to electronic documents, paper documents have metadata, too. They come from custodians, offices, files, folders, boxes and other physical locations that must be tracked. Still more metadata takes the form of codes, tags and abstracts reflecting reviewers’ assessments of documents. Then as now, all of this metadata needs somewhere to lodge as it accompanies page images on their journey to document review database tools (a/k/a “review platforms”) like Concordance or Summation–venerable products that survive to this day. This data goes into load files.
Finally, we employ load files as a sort of road map and as assembly instructions laying out, inter alia, where document images and their various load files are located on disks or other media used to store and deliver productions and how the various pieces relate to one-another.
So, to review, some load files carry extracted text to facilitate search, some carry metadata about the documents and some carry information about how the pieces of the production are stored and how they fit together. Load files are used because neither paper nor TIFF images are suited to carrying the same electronic content; and if it weren’t supplied electronically, you couldn’t load it into review tools or search it using computers.
Before we move on, let’s spend a moment on the composition of load files. If you were going to record many different pieces of information about a group of documents, you might create a table for that purpose. Possibly, you’d use the first column of your table to give each document a number, then the next column for the document’s name and then each succeeding column would carry particular pieces of information about the document. You might make it easier to tell one column form the next by drawing lines to delineate the rows and columns, like so:
Those lines separating rows and columns serve as delimiters; that is, as a means to (literally) delineate one item of data from the next. Vertical and horizontal lines serve as excellent visual delimiters for humans, where computers work well with characters like commas, tabs and such. So, if the data from the table were contained in a load file, it might appear as follows:
Note how each comma replaces a column divider and each line signifies another row. Note also that the first or “header” row is used to define the type of data that will follow and the manner in which it is delimited. When commas are used to separate values in a load file, it’s called (not surprisingly) a “comma separated values” or CSV file. CSV files are just one example of standard forms used for load files. More commonly, load files adhere to formats compatible with the Concordance and Summation review tools. Concordance load files typically use the file extension DAT and the þ¶þ characters as delimiters, e.g.:
Concordance Load File
Summation load files typically use the file extension DII, but do not structure content in the same way as Concordance load files; instead, Summation load files separate each record like so:
Summation Load File
; Record 1
@C ENDDOC 0000004
@C PGCOUNT 4
@C AUTHOR J. Smith
; Record 2
@T 0000005 @DOCID 0000005 @MEDIA eDoc @C ENDDOC 0000005 @C PGCOUNT 1 @C AUTHOR R. Jones @DATESAVED 02/03/2013 @EDOC \NATIVE\Memo.docx ; Record 3 @T 0000006 @DOCID 0000006 @MEDIA eDoc @C ENDDOC 0000073 @C PGCOUNT 68 @C AUTHOR H. Block @DATESAVED 04/14/2013 @EDOC \NATIVE\Taxes_2013.xlsx ; Record 4 @T 0000074 @DOCID 0000074 @MEDIA eDoc @C ENDDOC 0000089 @C PGCOUNT 15 @C AUTHOR A. Dobey @DATESAVED 05/25/2013 @EDOC \NATIVE\Policy.pdf
Just as placing data in the wrong row or column of a table renders the table unreliable and potentially unusable, errors in load files render the load file unreliable, and any database it populates is potentially unusable. Just a single absent, misplaced or malformed delimiter can result in numerous data fields being incorrectly populated. Load files have always been an irritant and a hazard; but, the upside was they supplied a measure of searchability to unsearchable paper documents.
Fast forward to a post-personal computer, post-Internet era.
The overwhelming majority of documents and communications are created and stored electronically, and only the tiniest fraction of these will ever be printed. Electronic documents are inherently searchable and do things that paper documents can’t, like dynamically apply formulas to numbers (spreadsheets), animate text and images (presentations) or carry messages and tracked changes made visible or invisible at will (word processed documents). Electronic documents also have complements of information within and without called metadata that tend to be lost when electronic documents are printed or imaged. Some of this metadata has evidentiary value (e.g., date and time information) and some has organizational value (e.g., file names).
Because electronic documents are inherently electronically searchable, there’s no need to image them or use optical character recognition to extract searchable text. Moreover, there’s less need for error-prone load files to populate review tools. Despite these advantages, many lawyers prefer to approach electronic documents in the same way they handled paper documents. That is, they convert searchable electronic documents to non-searchable, non-functional TIFF images and then attempt to graft on electronic searchability by extracting text and metadata to load files.
So, why is an old, error-prone method of data transfer still used in electronic discovery? Good question; because it’s not cheaper, and it’s certainly not better. Mostly, it’s just familiar, and they have a sunk cost in outmoded tools and techniques. Why do some people still use thermal fax paper (for that matter, why do they still use fax machines)?
To be fair, there’s a lingering need for load files in e-discovery, too. Even native electronic documents have outside-the-file or “system” metadata that must be loaded into review tools; plus, there’s still a need to keep track of such things as the original monikers of renamed native files and the layout of the production set on the production media. In e-discovery, load files—and the headaches they bring–will be with us for a while; understanding load files helps ease the pain.
 ASCII is an acronym for American Standard Code for Information Interchange and describes one of the oldest and simplest standardized ways to use numbers—particularly binary numbers expressed as ones and zeroes–to denote a basic set of English language alphanumeric and punctuation characters.
Thanks for the Load File overview, it took me back several years to painful document migration projects, that had been conveniently erased from my conscious memory 🙂 And I totally agree that converting perfectly searchable documents to TIFF images (or even worse, physically printing and rescanning them) is lunacy! I have a question though: Why haven’t these review tools (and practices) moved to a more sensible XML format for load files, instead of the inherently error-prone CSVs?
The EDRM XML standard isn’t dramatically different in its operation than the Summation load file format. Both are extensible, assuming the parties settle on the tags they employ to name the fields. I think both are a tad less error prone than a tab or character delimited, serially-accessed format; but, neither are they the great leap forward as currently implemented in e-discovery.
The problem with all current load file formats is that they support an approach to production that’s like getting furniture from Ikea with missing hardware. Maybe the desk will come together, maybe it won’t. Either way, I’d rather get the complete thing I bargained for, as complete as the one in the store.
Only a few people use the EDRM standard these days, and I must tell you that it a real, real pain to actually process. You might have text embedded in a random XML field, or that might actually just be a path to the file on disk. Attachments could be automatically de-duplicated inside the load file, making it a disaster to try and reconstruct the proper document sets later. And sometimes they actually de-duplicate document sets by custodian, too. Good luck finding your documents when whoever generated the load file has made it more ‘efficient.’
Concordance metadata load files with Opticon or IPro format image load files have become the de facto standard because they’re bog simple and easy to parse with automated tools.
Nick Dovedan said:
Most review platforms do incorporate an EDRM XML load file format.
Yes, and almost nobody uses it. 🙂
Wade Peterson said:
Well written thoughts. You might be interested in a couple of articles I recently wrote for the native files project of EDRM – “Proposing ENF, a New Standard for Managing Native Files” http://www.edrm.net/archives/19649
Dear Mr. Peterson:
Thank you. I’d seen your articles and, at first, I was a bit peeved that the EDRM was trying to cast the ENF as something novel while failing to acknowledge my similar proposal circulated among EDD leaders seven years ago.
But, I ultimately pulled my head out of my ass and assumed that you simply weren’t aware of what’s been extant before you came up with what you thought was “new.” (I might have done the same thing in a burst of enthusiasm.) Still, you didn’t acknowledge the EnCase LEF format, the X-Ways Evidence Container format, the open source AFF or any of several other more complete proposals and implementations that have been around for years. We are all on the same team in seeking ways to improve e-discovery; so, receiving credit matters less than turning good ideas into standards people embrace. I applaud your efforts. The picture in this comment is just a static image from my 2006 Powerpoint that can be downloaded from http://www.craigball.com/Production_Objects.ppt I hope you find it interesting, and perhaps it will contribute something that will help your project move forward.
May I add that I think that your suggestion of embedding the encrypted original into a redacted production is novel and forward-thinking; but, I suspect it’s not likely to be something lawyers will readily accept. Fears–even wholly irrational ones–seem to drive a lot of what lawyers do in e-discovery.
Wade Peterson said:
Thank you, I was not aware of your earlier work; which is excellent. And interesting that we have similar independant conclusions. As I mentioned in the article, it’s the “dialog” that I think is important, since I agree with you that standards need to evolve to encompass this new world we live in where native files are becoming more and more the predominant preference. The native files project within EDRM is a new working group, and these additional sources will be extremely helpful.
Pingback: A Load (File) Off my Mind | @ComplexD
Craig Kraus said:
As always, you took something that could be daunting to explain even to an IT guy like myself and laid it out perfectly. I enjoyed the article and it was definitely a nice refresher on how the Load file fits into the big picture. I especially liked the in-line examples of the Concordance and Summation load files to see the difference in delimiters. We will definitely be recommending this article to our next batch of EDIS students at Bryan University! 🙂
Thanks, Craig Kraus. That’s really kind. Makes the pain of constructing those files by hand all worthwhile! Regards to the Bryan students.
Andy Wilson (logikcull.com) said:
Awesomely comprehensive article Craig! And you now owe me a tissue to wipe the blood from my eyes as they can’t stop bleeding after looking at that Summation load file. Yikes!
DIIs are pretty rare these days; most people use the Opticon format, which is much, much better. But still a load file.
Riaan Engelbrecht said:
Thanks Mr Ball,
I really enjoyed this article, and now I do understand. As a newbie to e-discovery, every little bit helps. I am one of the Bryan students.
Pingback: E-Discovery Alert: Load Files are not your friend (Craig Ball, however, is) | Paperless Chase
Laura Noble said:
How do you suggest dealing with opposing counsel who refuses to provide e-discovery to you in anything but TIFF with load files b/c that is the system (I’m assuming Summation or the like) that their firm uses? Despite my protests to provide in native file. Judge is downright hostile to e-discovery issues.
I deal with opponents whose wrongheaded postures are supported by hostile judges by losing on those points as painlessly and graciously as possible. That said, I’ve found that if I am tenacious and work hard to educate the court, even hostile judges may come around. Choose your battles wisely, and seek to get your foot in the door by focusing on production of spreadsheets in native forms.
This is great information. I am an IT Manager supporting a legal functional group and am shopping for a document review system. This helped bridge the gap between my understanding of technology and my ignorance of this topic.
Most e-discovery vendors these days are trending towards using Relativity as their review platform.
Do be aware that Relativity does not include a processing engine, so if you want to load data into it, you will need (you guessed it), load files.
Any competent e-discovery vendor must be able to handle load files, and they are very unlikely to go anywhere any time soon. Why? One word: interoperability.
There are quite a few e-discovery tools out there these days, and it is very, very difficult to integrate them all together. A lot of companies try to write their own custom solutions for moving the data around between ingestion engines, review platforms, and imaging systems. This, however, takes a lot of time and effort.
In most cases, they’re dealing with load files to do this. It’s the big reason why ETL tools exist.
Pingback: Good Questions! | Ball in your Court
Pingback: What is a Load File? - Percipient
Pingback: E-Discovery Alert: Load Files are not your friend (Craig Ball, however, is) - ernest-svenson.preview79.rmkr.net
Pingback: E-Discovery Alert: Load Files are not your friend (Craig Ball, however, is) - Paperless Chase
Pingback: Should We Be Afraid? World Domination Pt. 3 – Troy M. Dunham
Pingback: Understanding Data/Metadata Load Files |
Peter v said:
The issue isn’t so much the load files – as you mention, they are still useful even with native production, but rather, the continued use of image production over native production. Great comments. While I see some advantages to using image production, such as the static capture of the document contents, the fact that most production these days originates in electronic form really does make it seem wasteful to convert them to TIFF images.