Forms that Function

forms that functionOver the course of the last decade, it’s been a Sisyphean task to get lawyers to lay aside rigid ideas about forms of production in e-discovery and focus on selecting forms that function. 

“Forms that function.”  Forms of production that work.

Ever since the demanding class, “Architecture for Non-Architects” at Rice University, I’ve been a wannabe architect, and the battle cry, “form follows function,” my mantra.  It’s ascribed to Louis Sullivan, legendary American architect and Father of the Skyscraper.  “Form follows function” fairly defines what we think of as “modern,” and it’s a credo at the heart of the clearest idea I’ve had in a while, being that we should produce e-mail in forms that can be made to function in common e-mail client programs like Microsoft Outlook.

I don’t point to Outlook because I think it a suitable review platform for ESI (I don’t, though many use it that way).  I point to Outlook because it’s ubiquitous and, if a message is produced in a form that can be imported into Outlook, it’s a form likely to be searchable, sortable, utile and complete.  More, it’s a form that anyone can assimilate into whatever review platform they wish at lowest cost.

The criterion, “Will the form produced function in an e-mail client?” enables parties to explore a broad range of functional native and near-native forms, not just PSTs.  It an objective “acid test” to determine if e-mail will be produced in a reasonably usable form; that is, a form not too far degraded from the way the data is used by the parties and witnesses in the ordinary course.

Forms that Function retain essential features like Fielded Data, allowing users to reliably sort messages by date, sender, recipients and subject, as well as Message IDs, supporting the threading of messages into coherent conversations.  Forms that Function supply the UTC Offset Data within e-mails that allows messages originating from different time zones and using different Daylight Savings Time settings to be normalized across an accurate timeline. Forms that Function don’t disrupt the Family Relationships between messages and attachments.  Forms that Function are inherently electronically searchable.

Best of all, producing Forms that Function means that all parties receive data in a form that anyone can use in any way they choose, visiting the costs of converting to alternate forms on the parties who want those alternate forms and not saddling parties with forms so degraded that they are functionally fractured and broken.

If you are a requesting party, don’t be bamboozled by an alphabet soup of file extensions when it comes to e-mail production (PST, OST, MSG, EML, DBX, NSF, MHTML, TIFF, PDF, RTF, TXT, DAT, XML).  Instead, tell the other side, “I want Forms that Function.  If it can be imported into Microsoft Outlook and work, that form will be fine by me.

If the other side says, “We will pull all that information out of the messages and give it to you in a load file,” say, “No thanks, leave it where it lays, and give it to me in a Form that Functions!

Revisiting ‘How Many Documents in a Gigabyte?’

equalI once wrote a column titled “Page Equivalency and Other Fables.”  It lambasted lawyers who larded their burden arguments with bogus page equivalencies like, “everyone knows a gigabyte of data equates to a pile of printed pages that would reach from Uranus to Earth.”  We still see wacky page equivalencies, and “from Uranus” still aptly describes their provenance.

Back in 2007, I wrote, “It’s comforting to quantify electronically stored information as some number of pieces of paper or bankers’ boxes.  Paper and lawyers are old friends.  But you can’t reliably equate a volume of data with a number of pages unless you know the composition of the data.  Even then, it’s a leap of faith.”

So, I’m happy to point you to some notable work by my friend, John Tredennick.  I’ve known John since the emerging technology was fire and watched with awe and admiration as John transitioned from old-school trial lawyer to visionary forensic technology entrepreneur running e-discovery service provider, Catalyst.  John is as close to a Renaissance man as anyone I know in e-discovery, and when John speaks, I listen.

Lately, John Tredennick shared some revealing metrics on the Catalyst blog looking at the relationship between data and document volumes, an update to his 2011 article called, How Many Documents in a Gigabyte?  John again examines document volumes seen in the data that Catalyst receives and processes for its customers and, crucially, parses the data by file type.  As the results bear out, the forms of the data still make an enormous difference in terms of data volume.  Even as between documents we think of as being “the same” (like Word .doc and .docx formats), the differences are striking.

For example, John’s data suggests that there are almost 60% more documents in a gigabyte of Word files in the .docx format (7,085) than in a gigabyte of files stored in the predecessor .doc format (4,472).  This makes sense because the newer .docx format incorporates zip compression, and text is highly compressible data.

[One exercise I require of the law students in my E-discovery class is to look at the file header of a Word .docx file to note its binary signature, PK, characteristic of a zip-compressed file and short for Phil Katz, author of the zip compression algorithm.  For grins, you can change the file extension of a .docx file to .zip and open it to see what a Word document really looks like under the hood.  Hint: it’s in XML].

John reports a similar discrepancy between new and old Excel spreadsheet formats (1,883 .xlsx files per gigabyte versus 1,307 for .xls).  Here again, the .xlsx format builds in zip compression.

But, the results are reversed when it comes to PowerPoint presentations, with John finding that there are marginally fewer of the newer .pptx files in a gigabyte (505) than the older .ppt format files (580).  This makes sense to me because Microsoft phased out the .doc format ten years ago.  Since then, presenters have gotten better about adding visual enhancements to deadly-dull PowerPoints, and they tend to add ‘fatter’ components like video clips.  The biggest factor is that pictures are highly incompressible, and common image formats (i.e., .jpg images) have always been compressed.  Compressing data that’s already compressed tends to increase, not decrease its size.

Wisely, John speaks only of document volumes and makes no effort to project page equivalencies, not even by extrapolating some postulated ‘average-pages-per-file type.’  Anything like that would be as insupportable today as it was when I wrote about it in 2007.  Also, when you look at John’s post, note that there is no data supplied concerning TIFF images.  I’m not sure why, but I can promise you this: TIFF images are MUCH fatter files, costing far more in terms of storage space and ingestion costs than their native counterparts.  Had John added TIFF to the mix, I’m confident his weighted averages would have been much different…and far less useful–much like TIFF images as a form of production. 😉

Warm Holiday Greetings from Austin, Texas

Johnson City Christmas

T’was the night before Christmas
at Ball in Your Court.
Not a syllable’s stirring.
We’re sipping mulled port!

The chestnuts are roasting, the wassailing’s started;
Don’t look for a posting ‘til Santa’s departed.
Au revoir data hash, and adieu data mapping.
I really must dash– I’ve got to get wrapping!

Thank you, dear reader, for all the perusing.
I hope it’s been helpful (and sometimes amusing).
And thank you, dear reader, for sharing your comments.
I cherish them deeply, those kudos and laments.

merry christmas y'allSo, chide me and check me,
be quick to correct me;
I rarely get everything right.
Till next time we dish here,
I send you this wish, dear:
Merry Christmas, y’all, and to y’all a good night.

.

.

“Derogation of the Search for Truth”

search for truthIn my last post, I addressed why search terms used to cull data sets in discovery should not be protected as attorney work product.  Today, I want to distinguish an attorney’s “investigative queries” (for case assessment, to hone searches or to identify privileged content) from “culling queries” (to generate data sets meeting a legal obligation, whether conceived by an attorney, client, vendor or expert).   I contend culling queries warrant no work product protection from disclosure.

Let’s assume a producing party has a sizable collection of potentially responsive electronic information.  Producing party concludes that it would be too costly, slow or unreliable to segregate the ESI by reading everything and, instead, decides to examine just those items that contain particular words or phrases.  Keyword queries thus serve to divide the ESI into two piles: one that will be reviewed by counsel and another that no one and nothing will qualitatively review.  The latter is the “discard pile.”  Culling queries may be applied iteratively, first to collect data from the enterprise and later to cull the collection for review.  The reductive process may entail the successive use of a client’s local and enterprise search capabilities and/or a law firm’s or vendor’s search tools.

The common thread is that each lexical search mechanism serves to exclude ESI lacking certain terms from substantive review.  No one ever assesses the discards for relevance or responsiveness.

Now, if we could be confident that keyword culling worked reasonably well and that the persons who came up with keywords were lexical magicians, there’d be no need to worry over the discard pile.  We could trust that what we don’t know doesn’t hurt us.

But we do know that a hefty slug of responsive items ends up in that discard pile.  We know this because studies and experience have established that keyword search is a crude, mechanical filter.  It leaves most of what we seek behind.

Whether we are leaving behind an endurable or unendurable volume of responsive items depends on just how poorly those keywords performed.  To gauge that, we’ve got to know what queries were run. Continue reading

Transparency of Process No Peril to Work Product

I’m rarely moved to criticize the work of other commentators because, even when I don’t share their views, I applaud the airing of the issues their efforts bring.  But sometimes a proposition is just so blatantly ill-advised, so prone to unfairly tilt the litigation playing field, that any reader and every writer should stop and say, “Wait a second….”  One such article, currently running in the New York Law Journal and called No Disclosure: Why Search Terms Are Worthy of Court’s Protection, charges that judges who require disclosure of search terms “discount or misunderstand” what the authors term the “protected nature of key aspects of the e-discovery process,” namely filtering of data by use of search terms.  The authors think that disclosure of search terms used to exclude data from disclosure compromises the work product privilege and argue that judges should “recognize that a search term is more than a collection of words, rather, the culmination of an attorney’s interaction with the facts of the case.”

Espousing the sanctity of work product privilege to an audience of litigators is like saying, “I support our troops.”  It’s mom, baseball and apple pie.  It’s also popular to paint judges as addled abusers of discretion.  But let’s not let jingoism displace judgment.  Search terms are precisely what the authors claim they are not: search terms are a collection of words.  They are lexical filters.  Nothing more.

Search terms deserve no more protection from disclosure than date ranges, file types and other mechanical means employed to exclude data from scrutiny.  Search terms strip out information that will never see the light of day nor benefit from the application of lawyer judgment as to their relevance.  In that sense, search terms are anathema to the core principles of work product and warrant more, not less, scrutiny. Continue reading

Good Questions!

A Peep into our Mail BagDigging into the digital mail bag (there’s a skeuomorph for you), I received a series of thoughtful questions from a reader ready to dip his toes into the bracing waters of native production but harboring some of the same nagging concerns others raise when they ponder how to integrate native production into workflows born of legacy paper processes.

Here were the questions:

What is the exact difference between native and near-native?  My big concern with producing natively is from what I’v’e read about the risks related to metadata/privilege and how to go about using native data at depositions and in court.  Are those concerns that can be dealt with simply?  Finally, how do you go about producing natively?  I presume by uploading data to an ftp site for the parties to access?  Are there other ways?

I replied: Continue reading

It’s the Parties’ Data, Stupid!

wrongendAs the curtain comes down on 2013, I’m reflecting on where the weeks went.  This was the year of fights about forms; months spent endeavoring to persuade courts, opponents (and even my clients) that lawyers and judges have been peering into the wrong end of the telescope when it comes to forms of production. We must stop focusing on the feeble forms lawyers use for review, and concentrate on the robust forms that parties use for everything else.

In discovery and disclosure we seek information from parties and third-parties.  We want the data used and created by, for and about parties and third-parties relating to the actions they took or didn’t take.  We don’t pursue discovery/disclosure against the lawyers in the case.  If we tried, our efforts would be confounded by claims of attorney-client privilege and attorney work product.  Apart from pro se lawyers with fools for clients, attorneys aren’t parties, and attorneys aren’t witnesses.  The forms your opposing counsel uses for review shouldn’t matter.  Discovery and disclosure is party-centric, not attorney-centric.

Ask parties about the forms of ESI they use daily and it’s doubtful you’ll hear a peep about TIFF images or load files.  Parties don’t use that junk; only Luddite lawyers do.  Clients use spreadsheet programs, word processors, mail and messaging applications and databases, to name a few.  When they create, communicate and collaborate, they do it using forms geared to native applications with file extensions like .XLSX, .DOCX, .PPTX, .MSG, etc.  They choose and use functional and complete native and near-native forms.  Those are the forms witnesses consult to reconstruct events and refresh their memories.  Those are the forms witnesses recognize at deposition and in trial. Continue reading

Vote for Ride the Lightning and The Legal Geeks!

voteAlong the right gutter of this page is a blogroll with links to the contributions of other e-discovery bloggers.  Two of the best of these are written by friends; so, I’m happy to note that Sharon Nelson’s, Ride the Lightning, and Bow Tie Law blogger Josh Gilliland’s other law blog, The Legal Geeks, have both been named to the ABA Journal’s Top 100 Blawg list. Congratulations Sharon and Josh!

But now they need our help squashing the competition like pesky bugs.

If you’re like me, you’ve spent the last few days immersed in family and feasting, and you’re looking for one more reason to delay the work that’s our last hurrah of 2013.  Here it is:

Please go to http://www.abajournal.com/blawg100 and vote for Ride the Lightning in the Legal Tech category and The Legal Geeks in the For Fun category.  It won’t take a minute or cost a penny, and you will be doing a solid to two good folks who give back so much.  Hope your Thanksgiving was delicious.

Cooperation in Practice: Georgetown Institute 2013

open kimonosI’ve been buried in big data for the last week, so look forward to my annual pilgrimage to our nation’s capital for the Georgetown University Law Center’s Advanced E-Discovery Institute, starting Thursday.  Some years the GULC AEDI is good, and some years it’s great; but, every year it’s a boisterous class reunion for the Sedona Bubble Boys and Girls for whom the Institute has become an unmissable event.

Though I mourn the Institute’s waning efforts to teach the “e” in e-discovery, the 2013 Institute nonetheless retains its two finest features: the opening case update and the closing judicial round table.  No other conference brings together a bigger, brighter constellation of e-discovery “rock star” judges than Georgetown.

This year, I’m gratified to be a part of a new track dedicated to teaching cooperation in practice, offered by the GULC in conjunction with The Sedona Conference®.  It’s a taste of the two-day Sedona Conference Cooperation Training program in which I served on the faculty in Phoenix last February. Continue reading

Collecting Gmail for Preservation

I’m surprised how frequently I’m engaged to collect the contents of Gmail accounts in e-discovery, especially when the account is being collected solely for preservation, and there’s no compelling reason to entrust the task to a neutral.  I appreciate that hiring an expert offers greater assurance that the task will be approached with skill and experience, as well as that integrity of process can be supported by the testimony of someone unconnected with the client or law firm.  But, though collecting and validating the complete contents of a Gmail account can be tricky and tedious, it’s not all that difficult to do.  Happily, unless you do something really dumb, it’s unlikely that even a botched Gmail collection effort will harm the contents of the account.

For those seeking a low-cost, defensible mechanism to preserve Gmail content, this (long, dry) post lays out a detailed methodology for collection and preservation of the contents of a Gmail webmail account in the static form of a standard Outlook PST container file.  I will address various technical considerations, but few legal ones.  Whether or not the methods described in this post are legally sufficient in your case or compliant with Gmail’s terms of service is not my call, and I offer no opinions about same.

[NOTE TO READERS 10/14/14: When I wrote this post, there was not yet a backup capability built into Gmail.  Google  now makes data tools available that support the creation of a rich archive of a user’s Google content, including, Gmail, Contacts, Calendar and Google Drive.  You can find it the Archive section of https://www.google.com/settings/datatools when logged into Google and can read more about it here.]

Continue reading