Degradation: How TIFF+ Disrupts Search

broken searchRecently, I wrote on the monstrous cost of TIFF+ productions compared to the same data produced as native files.  I’ve wasted years trying to expose the loss of utility and completeness caused by converting evidence to static formats.  I should have recognized that no one cares about quality in e-discovery; they only care about cost.  But I cannot let go of quality because one thing the Federal Rules make clear is that producing parties are not permitted to employ forms of production that significantly impair the searchability of electronically stored information (ESI).

In the “ordinary course of business,” none but litigators “ordinarily maintain” TIFF images as substitutes for native evidence   When requesting parties seek production in native forms, responding parties counter with costly static image formats by claiming they are “reasonably usable” alternatives.  However, the drafters of the 2006 Rules amendments were explicit in their prohibition:

[T]he option to produce in a reasonably usable form does not mean that a responding party is free to convert electronically stored information from the form in which it is ordinarily maintained to a different form that makes it more difficult or burdensome for the requesting party to use the information efficiently in the litigation. If the responding party ordinarily maintains the information it is producing in a way that makes it searchable by electronic means, the information should not be produced in a form that removes or significantly degrades this feature.

 FRCP Rule 34, Committee Notes on Rules – 2006 Amendment.

I contend that substituting a form that costs many times more to load and host counts as making the production more difficult and burdensome to use.  But what is little realized or acknowledged is the havoc that so-called TIFF+ productions wreck on searchability, too.  It boggles the mind, but when I share what I’m about to relate below to opposing counsel, they immediately retort, “that’s not true.”  They deny the reality without checking its truth, without caring whether what they assert has a basis in fact.  And I’m talking about lawyers claiming deep expertise in e-discovery.  It’s disheartening, to say the least.

A little background: We all know that ESI is inherently electronically searchable.  There are quibbles to that statement but please take it at face value for now.  When parties convert evidence in native forms to static image forms like TIFF, the process strips away all electronic searchability.  A monochrome screenshot replaces the source evidence.  Since the Rules say you can’t remove or significantly degrade searchability, the responding party must act to restore a measure of searchability.  They do this by extracting text from the native ESI and delivering it in a “load file” accompanying the page images.  This is part of the “plus” when people speak of TIFF+ productions.

E-discovery vendors then seek to pair the page images with the extracted text in a manner that allows some text searchability.  Vendors index the extracted text to speed search, a mapping process intended to display the page where the text was located when mapped.  This is important because where the text appears in the load file dictates what page will be displayed when the text is searched and determines whether features like proximity search and even predictive coding work as well as we have a right to expect.  Upshot: The location and juxtaposition of extracted text in the load file matters significantly in terms of accurate searchability.  If you don’t accept that, you can stop reading.

Now, let’s consider the structure of modern electronic evidence.  We could talk about formulae in spreadsheets or speaker notes in presentations, but those are not what we fight over when it comes to forms of production. Instead,  I want to focus on Microsoft Word documents and those components of Word documents called Comments and Tracked Changes; particularly Comments because these aren’t “metadata” by any stretch.  Comments are user-contributed content, typically communications between collaborators.  Users see this content on demand and it’s highly contextual and positional because it is nearly always a comment on adjacent body text.  It’s NOT the body text, and it’s not much use when it’s separated from the body text.  Accordingly, Word displays comments as marginalia, giving it the power of place but not enmeshing it with the body text.

But what happens to these contextual comments when you extract the text of a Word document to a load file and then index the load files?

There are three ways I’ve seen vendors handle comments and all three significantly degrade searchability:

First, they suppress comments altogether and do not capture the text in the load files.  This is content deletion.  It’s like the content was never there and you can’t find the text using any method of electronic search.  Responding parties don’t disclose this deletion nor is it grounded on any claim of privilege or right.  Spoliation is just S.O.P.

Second, they merge the comments into the adjacent body text. This has the advantage of putting the text more-or-less on the same page where it appears in the source, but it also serves to frustrate proximity search and analytics.  The injection of the comment text between a word combination or phrase causes searches for that word combo or phrase to fail.  For example, if your search was for ignition w/3 switch and a four-word comment comes between “ignition” and “switch,” the search fails.

Third, and frequently, vendors aggregate comments and dump them at the end of the load file with no clue as to the page or text they reference.  No links.  No pointers.  Every search hitting on comment text takes you to the wrong page, devoid of context.

Some of what I describe are challenges inherent to dealing with three-dimensional data using two-dimensional tools.  Native applications deal with Comments, speaker notes and formulae three-dimensionally.  We can reveal that data as needed, and it appears in exactly the way witnesses use it outside of litigation.  But flattening native forms to static images and load files destroys that multidimensional capability.   Vendors do what they can to add back functionality; but we should not pretend the results are anything more than a pale shadow of what’s possible when native forms are produced.  I’d call it a tradeoff, but that implies requesting parties know what’s being denied them.  How can requesting party’s counsel know what’s happening when responding parties’ counsel haven’t a clue what their tools do, yet misrepresent the result?

But now you know.  Check it out.  Look at the extracted text files produced to accompany documents with comments and tracked changes.  Ask questions.  Push back.  And if you’re producing party’s counsel, fess up to the evidence vandalism you do.  Defend it if you must but stop denying it.  You’re better than that.

Don’t Let Plaintiffs’ Lawyers Read This!!

Be honest.  Wouldn’t you love to stick it to the plaintiffs?  Wouldn’t your corporate client or carrier be ecstatic if you could make litigation much more expensive for those greedy opportunists bringing frivolous suits and demanding discovery?  What if you could make discovery not just more costly, but make it, say, five times more costly, ten times more costly, than it is for you?  Really bring the pain.  Would you do it?

Now that I have your attention–and the attention of plaintiffs’ counsel wondering if they’ve stumbled into a closed meeting at a corporate counsel retreat—I want to show you this is real.  Not just because I say so, but because you prove it to yourself.  You do the math.

Math!  You didn’t say there would be math!

Stop.  You know you’re good at math when the numbers come with dollar signs.  Legendary Texas trial lawyer W. James Kronzer used to say to me, “I’m no good at math, Herman; but I can divide any number by three.”  That was back when a third was the customary contingent fee.

Even after you do the math, you’re not going to believe it; instead, you’ll conclude it can’t be true.  Surely nothing so unjust could have escaped my notice.  Why would Courts allow this?  How can I be such a sap?

The real question is this: What am I going to do about it? Continue reading

Preserving Social Media Content: DIY

Social Media Content (SMC) is a rich source of evidence.  Photos and posts shed light on claims of disability and damages, establish malicious intent and support challenges to parental fitness–to say nothing of criminals who post selfies at crime scenes or holding stolen goods, drugs and weapons.  SMC may expose propensity to violence, hate speech, racial animus, misogyny or mental instability (even at the highest levels of government).  SMC is increasingly a medium for business messaging and the primary channel for cross-border communications.  In short, SMC and messaging are heirs-apparent to e-mail in their importance to e-discovery.

Competence demands swift identification and preservation of SMC.

Screen shots of SMC are notoriously unreliable, tedious to collect and inherently unsearchable.  Applications like X1 Social Discovery and service providers like Hanzo can help with SMC preservation; but frequently the task demands little technical savvy and no specialized tools.  Major SMC sites offer straightforward ways users can access and download their content.  Armed with a client’s login credentials, lawyers, too, can undertake the ministerial task of preserving SMC without greater risk of becoming a witness than if they’d photocopied paper records.

Collecting your Client’s SMC
Collecting SMC is a two-step process of requesting the data followed by downloading.  Minutes to hours or longer may elapse between a request and download availability. Having your client handle collection weakens the chain of custody; so, instruct the client to forward download links to you or your designee for collection.  Better yet, do it all yourself.

Obtain your client’s user ID and password for each account and written consent to collect. Instruct your client to change account passwords for your use, re-enabling customary passwords following collection.  Clients may need to temporarily disable two-factor account security.  Download data promptly as downloads are available briefly.

Collection Steps for Seven Social Media Sites
Facebook: After login, go to Settings>Your Facebook Information>Download Your Information.  Select the data and date ranges to collect (e.g., Posts, Messages, Photos, Comments, Friends, etc.).  Facebook will e-mail the account holder when the data is ready for download (from the Available Copies tab on the user’s Download Your Information page). Facebook also offers an Access Your Information link for review before download. Continue reading

Privacy: A Wolf in Sheep’s Clothing?

Next week is Georgetown Law Center’s sixteenth annual Advanced E-Discovery Institute.  Sixteen years of a keen focus on e-discovery; what an impressive, improbable achievement!  Admittedly, I’m biased by longtime membership on its advisory board and my sometime membership on its planning committees, but I regard the GTAEDI confab of practitioners and judges as the best e-discovery conference still standing.  So, it troubles me how much of the e-discovery content of the Institute and other conferences is ceded to other topics, and one topic in particular, privacy, is being pushed to be the focus of the Institute in future.

This is not a post about the Georgetown Institute, but about privacy, particularly whether our privacy fears are stoked and manipulated by companies and counsel as an opportunistic means to beat back discovery.  I ask you: Is privacy a stalking horse for a corporate anti-discovery agenda? Continue reading

A Primer on Processing and a Milestone

Processing 2019Today, I published my primer on processing.  It’s fifty-odd pages on a topic that’s warranted barely a handful of paragraphs anywhere else.  I wrote it for the upcoming Georgetown Law Center Advanced E-Discovery Institute and most of the material is brand new, covering a stage of e-discovery–a “black box” stage–where a lot can go quietly wrong.  Processing is something hardly anyone thinks about until it blows up.

Laying the foundation for a deep dive on processing required I include a crash course on the fundamentals of digitization and encoding.  My students at the University of Texas and at the Georgetown Academy have had to study encoding for years because I see it as the best base on which to build competency on the technical side of e-discovery.

The research for the paper confirmed what I’d long suspected about our industry.  Despite winsome wrappers, all the leading e-discovery tools are built on a handful of open source and commercial codebases, particularly for the crucial tasks of file identification and text extraction.  Nothing evil in that, but it does make you think about cybersecurity and pricing.  In the process of delving deeply into processing, I gained  greater respect for the software architects, developers and coders who make it all work.  It’s complicated, and there are countless ways to run off the rails.  That the tools work as well as they do is an improbable achievement.  Stilli, there are ingrained perils you need to know, and tradeoffs to be weighed.

Working from so little prior source material, I had to figure a lot out by guess and by gosh.  I have no doubt I’ve misunderstood points and could have explained topics more clearly.  Please don’t hesitate to weigh in to challenge or correct.  Regular readers know I love to hear your thoughts and critiques.

I’ll be talking about processing in an ACEDS/Logikcull webcast tomorrow (Tuesday, November 5, 2019) at 1:00pm EST/10:00am PST.  I expect it’s not to late to register.

The milestone of the title is that this is my 200th blog post and it neatly coincides with my 200,000 unique visitor to the blog (actually 200,258, but who’s counting?).  When I started blogging here on August 20, 2011, I honestly didn’t know if anyone would stop by.  Two hundred thousand kind readers have rung the bell (and that’s excluding the many more spammers turned away).  I hope something I wrote along the way gave you some insight or a chuckle.  I’m intensely grateful for your attention.

By the way, if you’d like to come to the Georgetown Advanced E-Discovery Institute in Washington, D.C. on November 21-22, 2019, please use my speaker’s discount code to save $100.00.  The discount code is BALL (all caps).  Hope to see you!

Dig We Must: Get It in Writing

This isn’t a post about e-discovery per se, but it bears on process and integrity issues we face in cooperating to craft e-discovery expectations.  Still, it’s more parable than parallel.

My home in New Orleans sits at the intersection of two narrow streets built for horse and mule traffic.  It’s held its corner ground since 1881, serving as abattoir, ancestral home of a friend and now, my foot on the ground in the Big Easy.  New Orleanians are the friendliest folks.  You can strike up a spirited tête-à-tête with anyone since everyone has something to say about food, festivals, Saints football, Mardi Gras, the Sewage and Water Board and the gross ineptitude of local government in its abject failure to deliver streets and sidewalks that don’t swallow you whole or otherwise conspire to kill or maim the populace.

That’s not to say the City does nothing in the way of maintaining infrastructure.  Right now, New Orleans is replacing its low-pressure gas lines with high pressure lines.  Gas is a big deal where everyone eats red beans on Mondays, but it’s also useful for heating and, even now—still—for lighting.  So, every street must have new subterranean lines installed and new risers brought to gas meters.  I knew nothing of this until I awoke to find a crew with an excavator on my property destroying the curbs and antique brick sidewalks I’d lately installed at considerable expense. Continue reading

Apple Card: Heavy Metal

IMG_4773I just got my Apple Card and, while I hardly need another credit card, I thought readers might be curious what the fuss is about. After all, it’s just a credit card, right?

Right, but it has some fancy features that set it apart from the other plastic in your wallet or purse.  First, it’s scarily easy to obtain.  On my iPhone, it took under a minute to be issued the electronic card with a $9,000 spending limit available in Wallet.  That was Tuesday.  Thursday morning, a courier dropped off the physical card packaged in the sleek style of all Apple’s premium products.  The fun began even before it was out of the box!

IMG_4777Although my Apple Pay credit account went live in a minute, as with all physical credit cards, the Apple Card must be activated before use.  For most cards, this requires time online or a phone call where you dial or speak a lot of digits.  With the Apple Card, you just hold the colorful sleeve it comes in against your iPhone and the NFC contactless communication capability embedded in the card does the rest.  

The next surprise is that the card is crafted from laser-etched titanium, giving it a striking heft and rigidity.  Hone the edge of this baby and you’re MacGyver (or Oddjob, hat in hand).  Investing so much in the aesthetics of a credit card may seem silly; but, I confess that the, well, the beauty of the card impressed me.  Is it so wrong that something we touch several times daily be pleasing?

The next surprise is what’s not on the Apple Card versus every other card: There are no numbers.  No card number.  No CID security identifier.  No expiration date.  No signature block.  Just your name, three corporate logos, a chip and a swipe strip.  Here are photos of both sides of my Apple Card, something I’d never post for a conventional card:

IMG_4774If you want to know the card number and CID for the Apple Card, you must retrieve them in Wallet.  That’s a genuine layer of security.  By the same token, heaven help anyone who comes across a neanderthal with a carbon charge slip (anyone remember those?) who tries to rub transfer the card number.

There are some nifty usage management features, but the major marketing hook for the Apple Card is daily cash back on purchases.  How much cash back?  I’m not entirely sure because it varies.  It seems you get three percent back for purchases made from Apple and a handful of other merchants like Walgreens and Uber.  But for the most part, the cash back percentage looks to be two percent if you pay with Apple Pay.  If a merchant isn’t set up for Apple Pay, then it appears you must use the Apple Card as a conventional MasterCard, and get just one percent cash back.  That’s about the same benefit I now get with my AmEx Membership Rewards program with (in my mind) less exposure to a whopping interest charge if I’m ever late with a payment.  Too, the AmEx offers many perks to protect my purchases and travel.  Now and then, those behind-the-scenes benefits have proven really worthwhile.   I wonder whether Apple will stand behinds its card users as reliably as AmEx?

Cash back is a splendid benefit, and beats the pants off cards that don’t offer rewards and perks.  So many cards do offer mileage benefits, club access and other rewards that it’s not easy to know which one is best.  The Apple Card carries no annual fee, making it worth a try, and if you buy a lot of Apple merchandise, that instant three percent back is a no-brainer.  Maybe the Apple Card will become my principal card; maybe not.  But, I’ll tell you one thing:  that titanium card is going to be hell to cut in half should I decide to close the account.

One last thing if it’s not already clear: Only iPhone users need apply.  An Android user might be able to finagle getting the Apple Card, but the real benefits only flow from using Apple Pay.

Cryptographic Hashing: “Exceptionally” Deep in the Weeds

We all need certainty in our lives; we need to trust that two and two is four today and will be tomorrow.  But the more we learn about any subject, the more we’re exposed to the qualifiers and exceptions that belie perfect certainty.  It’s a conundrum for me when someone writes about cryptographic hashing, the magical math that allows an infinite range of numbers to match to a finite complement of digital fingerprints. Trying to simplify matters, well-meaning authors say things about hashing that just aren’t so.  Their mistakes are inconsequential for the most part—what they say is true enough–but it’s also misleading enough to warrant caveats useful in cross-examination.

I’m speaking of the following two assertions:

  1. Hash values are unique; i.e., two different files never share a hash value.
  2. Hash values are irreversible, i.e., you can’t deduce the original message using its hash value.

Both statements are wrong. Continue reading

Cryptographic Hashing: A Deeper Dive

It’s October (already?!?!) and–YIKES–I haven’t posted for two weeks.  I’m tapping away on a primer about e-discovery processing, a topic that’s received scant attention…ever.  One could be forgiven for thinking the legal profession doesn’t care what happens to all that lovely data when it goes off to be processed!  Yet, I know some readers share my passion for ESI and adore delving deeply into the depths of data processing.  So, here are a few paragraphs pulled from my draft addressing the well-worn topic of hashing in e-discovery where I attempt a foolhardy tilt at the competence windmill and seek to explain how hashing works and what those nutty numbers mean.  Be warned, me hearties, there be math ahead!  It’s still a draft, so feel free to push back and all criticism (constructive/destructive/dismissive) warmly welcomed.

My students at the  University of Texas School of Law and the Georgetown E-Discovery Training Academy spend considerable time learning that all ESI is just a bunch of numbers.  They muddle through readings and exercises about Base2 (binary), Base10 (decimal), Base16 (hexadecimal) and Base64; as well as about the difference between single-byte encoding schemes (ASCIII) and double-byte encoding schemes (Unicode).  It may seem like a wonky walk in the weeds; but the time is well spent when the students snap to the crucial connection between numeric encoding and our ability to use math to cull, filter and cluster data.  It’s a necessary precursor to their gaining Proustian “new eyes” for ESI.

Because ESI is just a bunch of numbers, we can use algorithms (mathematical formulas) to distill and compare those numbers.  Every student of electronic discovery learns about cryptographic hash functions and their usefulness as tools to digitally fingerprint files in support of identification, authentication, exclusion and deduplication.  When I teach law students about hashing, I tell them that hash functions are published, standard mathematical algorithms into which we input digital data of arbitrary size and the hash algorithm spits out a bit string (again, just a sequence of numbers) of fixed length called a “hash value.”  Hash values almost exclusively correspond to the digital data fed into the algorithm (termed “the message”) such that the chance of two different messages sharing the same hash value (called a “hash collision”) is exceptionally remote.  But because it’s possible, we can’t say each hash value is truly “unique.”

Using hash algorithms, any volume of data—from the tiniest file to the contents of entire hard drives and beyond—can be almost uniquely expressed as an alphanumeric sequence; in the case of the MD5 hash function, distilled to a value written as 32 hexadecimal characters (0-9 and A-F).  It’s hard to understand until you’ve figured out Base16; but, those 32 characters represent 340 trillion, trillion, trillion different possible values (2128 or 1632). Continue reading

Preserving Android Evidence: Return of the Clones?

When computer forensics was in its infancy, examiners collected evidence from disks by copying their contents byte-for-byte to matching, sterilized disks, creating archival and working copies called “clones.”  Cloning drives was inefficient, expensive and error prone compared to the imaging processes that replaced it.  Yet, disk cloning worked for years, and countless cases were made on forensic evidence preserved by cloning and examined on cloned drives.

Now, cloning may be coming back; not to preserve hard drives but  to collect data from mobile devices backed up online, particularly Android phones.  If I’m right, it will be only a stopgap technique; but, it will also be an effective (if not terribly efficient) conduit by which mobile data preserved online can be collected and analyzed in discovery.

Case in point: Google’s recently expanded offering of cheap-and-easy online backup of Android phones, including SMS and MMS messaging, photos, video, contacts, documents, app data and more.  This is a leap forward for all obliged to place a litigation hold on the contents of Android phones — a process heretofore unreasonably expensive and insufficiently scalable for e-discovery workflows.  There just weren’t good ways to facilitate defensible, custodial-directed preservation of Android phone content.  Instead, you had to take phones away from users and have a technical expert image them one-by-one.

Now, it should be feasible to direct custodians to undertake a simple online preservation process for Android phones having many of the same advantages as the preservation methodology I described for iPhones two years ago.  Simple.  Scalable.  Inexpensive.

But unlike the iOS/iTunes methodology, Android backups live in the cloud.  At first, I anticipate there will be no means to download the complete Android backup to a PC for analysis.  Consequently, when we must process the preserved data for litigation, we may need to first restore the data to a factory-initialized “clean” phone as a means to localize the data for collection.  That’s not to say that Google won’t eventually offer a suitable takeout mechanism; after all, Google Takeout capabilities are second to none.  But, until we can backup Android content in a way that it can be faithfully and intelligibly retrieved directly from Google, examiners may revive the tried-and-true cloning of evidence to clean devices then collecting from the restored device.  Everything old is new again.

It won’t be so bad to use this stopgap approach considering that e-discovery typically entails preservation of far more mobile sources than need ultimately be processed.  So, while backing up many online and cloning a few to clean phones certainly isn’t a perfect solution for Android evidence, it’s good enough and cheap enough that courts should give short shrift to parties claiming that preserving phone evidence is unduly burdensome or complex.  For, as my e-discovery colleagues love to say, “Perfect isn’t the standard.”  I agree.  But, neither is the standard, “we couldn’t be bothered, judge.”