• Home
  • About
  • CRAIGBALL.COM
  • Disclaimer
  • Log In

Ball in your Court

~ Musings on e-discovery & forensics.

Ball in your Court

Category Archives: E-Discovery

Yes, AI is Here. No, You’re Not Gone.

01 Thursday Aug 2024

Posted by craigball in E-Discovery, General Technology Posts

≈ 6 Comments

Yesterday, I sought to defend the value of my law school course on E-Discovery & Digital Evidence to a law Dean who readily conceded that she didn’t know what e-discovery was or why it would be an important thing for lawyers to understand.  It was a bracing experience.

My métier has always been litigation, to the point that everyone I work with sits in and around trial practice.  My close colleagues recognize that 90% of what trial lawyers do is geared to discovery and motion practice, and much of that motion practice is prompted by discovery disputes. So, hearing how a tax lawyer and academic viewed litigation was eye-opening, and troubling to the extent it impacts what’s taught to new lawyers.

Do you agree about the centrality of discovery to litigation, Dear Reader?

The Dean shared her sense that discovery is being replaced by AI and that “soon AI will handle the production of relevant information instead of lawyers.”  I replied that I expected the review phase to be abetted or supplanted by AI in the near term—that’s here—but it would be some time before all the tasks that come before review would be fully AI-enabled.

The idea that there are crucial tasks requiring lawyer intervention before review was surprising to her.  For those who don’t manage electronic discovery day-to-day, electronically stored information seems to magically appear in review tools.  But for e-discovery folks, the march through identification, preservation, collection and processing is our path, and we know that no one, and no AI, can undertake an assessment of the evidence without facing the data.

You’ve got to face the evidence to assess the evidence.

That’s axiomatic; but it’s downplayed by those shouting “AI! AI!”  As they say in these parts, “you’ve got to put the hay down where the goats can get it.”  Until AI is embedded in everything, until AI faces the data in every phone, cloud repository, storage medium and database in ways that support discovery, the goats can’t get to the hay.

The evidence in our cases is not a “collection” until it’s collected.  That doesn’t necessarily mean a copy must be made to isolate data of interest, but that remains the prevailing way that a discrete assemblage of potentially responsive ESI is marshaled before it is processed for search and review.  Not until that occurs does the evidence face human or AI review.

Continue reading →

Share this:

  • Click to email a link to a friend (Opens in new window) Email
  • Click to print (Opens in new window) Print
  • Click to share on X (Opens in new window) X
  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on LinkedIn (Opens in new window) LinkedIn
Like Loading...

Garden Variety: Byte Fed. v. Lux Vending

12 Wednesday Jun 2024

Posted by craigball in E-Discovery

≈ 9 Comments

Tags

ESI Protocols, search

My esteemed colleagues, Kelly Twigger and Doug Austin, each posted about a recent discovery decision from the Middle District of Florida, case no. 8:23-cv-102-MSS-SPF, styled, Byte Fed., Inc. v. Lux Vending LLC. and decided by United States Magistrate Judge Sean Flynn on May 1, 2024.

Kelly and Doug share their customarily first-rate analyses of the ruling insofar as its finding that the assertion of boilerplate objections serves as a waiver.  The Court spanked defendant, The Cardamone Consulting Group, LLC, for its conduct.  That’s been picked apart elsewhere, and I have nothing to add.  I write here to address a feature of the dispute that no one has discussed (and sadly, neither did the Court), being the nature of the request for production that prompted the boilerplate objection of “vague and incomprehensible.”  We can learn much more from the case than just boilerplate=waiver.

Let’s look at the underlying request:

DOCUMENT REQUEST NO. 7:

All documents and electronically stored information that are generated in applying the search terms below to Your corporate email accounts (including but not limited to the email accounts for Nicholas Cardamone, Daniel Cardamone, and Patrick McCloskey):

ByteBitcoin w/s FloridaStanton
ByteFederalBitcoin w/s trademarkBranden w/3 Tawil
Byte FederallawsuitBrandon w/3 Mintz
most w/5 trustedScott w/3 BuchananDKI
Google w/s trademarkconfusion or confusedDynamic w/5 keyword

In its Motion to Compel, Plaintiff calls this request “clear on its face, and … a garden-variety type of request for production in connection with narrowly tailored search terms.”  The Plaintiff adds, “[y]et during the parties’ meet-and-confer, and although Cardamone’s counsel claimed that she was familiar with electronic discovery, the assertion was that her client – a company that has purportedly generated hundreds of millions of dollars in connection with online advertising and electronic data – ‘did not understand what to do.’”

So, Dear Reader, would you understand what to do? You’re steeped in electronic discovery—that’s why you’ve stopped by—but is the request clear, narrowly tailored and “garden-variety” such that we can apply it to a proper production workflow?  A few points to ponder:

1. There’s nothing in the Federal Rules of Civil Procedure that prohibits a request to run specific queries against databases, and email accounts are databases.  Rule 34 requires only that the request “describe with reasonable particularity each item or category of items to be inspected.” 

Conventional requests are couched in language geared to relevance; that is, the requests seek documents and ESI about a topic.  Counsel must then apply the law and the facts to guide clients in identifying responsive information.  Counsel reviews the information gathered and decides whether it’s responsive or should be withheld as a matter of right or privilege.

Over time, the notion took hold that sifting through electronically stored information was unduly burdensome, so opposing parties were expected to work together to fashion queries–“search terms” –to narrow the scope of review.  These keyword negotiations run the gamut from laughable to laudable. They’re duels between counsel frequently unarmed with knowledge of the search tools and processes or of the data under scrutiny.  In short, they use their ginormous lawyer brains to guess what might work if the digital world were as they imagine it to be.

Here, the plaintiff cuts to the chase, eschewing a request couched in relevance in favor of asking that specific searches be run: half of them Boolean constructs employing two types of proximity connectors. 

Was this smart?   You decide.

Continue reading →

Share this:

  • Click to email a link to a friend (Opens in new window) Email
  • Click to print (Opens in new window) Print
  • Click to share on X (Opens in new window) X
  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on LinkedIn (Opens in new window) LinkedIn
Like Loading...

Girding for the E-Savvy Opponent (Revisited)

26 Friday Apr 2024

Posted by craigball in Computer Forensics, E-Discovery, Uncategorized

≈ 7 Comments

Tags

competence, disclosure, discovery, edisclosure, eDiscovery

A friend shared that she was seeing the Carole King musical, “Beautiful,” and I recalled the time I caught it twice on different visits to London in 2015 because I enjoyed it so. I reflected on why I was in London in Summer nine years ago and came across a post from the time–a post that I liked well-enough to revisit it below. I predicted the emergence of the e-savvy opponent, something that has indeed come to pass, yet less-widely or -effectively than I’d hoped (and still hope for). A new generation of e-discoverers has emerged since, so perhaps the post will be fresh (and remain relevant) for more than a few, and sufficiently forgotten to feel fresh for the rest:

(From May 12, 2015): I am in Great Britain this week addressing an E-Discovery and Information Governance conclave, joined by esteemed American colleagues and friends, Jason Baron and Ralph Losey among other luminaries.  My keynote topic opening the conference is Girding for the E-Savvy Opponent. Here is a smattering of what I expect to say.

I arrived in London from Budapest in time to catch some of the events for the 70th anniversary of VE Day, marking the hard-won victory over Germany in the war that shortly followed the war that was to have ended all wars.

As we sported poppies and stood solemnly at the Cenotaph recalling the sacrifices made by our parents and grandparents, I mulled technology’s role in battle, and the disasters that come from being unprepared for a tech-savvy opponent.

It’s said that, “Generals are always prepared to fight the last war.” This speaks as much to technology as to tactics.  Mounted cavalry proved no match for armored tanks.  Machine guns made trench warfare obsolete.  The Maginot Line became a punch line thanks to the Blitzkrieg. “Heavy fortifications?  “No problem, mein schatzi, ve vill just drive arount tem.”

In e-disclosure, we still fight the last war, smug in the belief that our opponents will never be e-savvy enough to defeat us.

Our old war ways have served so long that we are slow to recognize a growing vulnerability.  To date, our opponents have proved unsophisticated, uncreative and un-tenacious.  Oh, they make a feint against databases here and a half-hearted effort to get native production there; but, for the most part, they’re still fighting with hordes, horses and sabers.  We run roughshod over them.  We pacify them with offal and scraps.

But, we don’t think of it that way, of course.  We think we are great at all this stuff, and that the way we do things is the way it’s supposed to be done.  Large companies and big law firms have been getting away with abusive practices in e-disclosure for so long that they have come to view it as a birthright.  I am the 19th Earl of TIFF.  My father was the Royal Exchequer of Keywords.  I have more than once heard an opponent defend costly, cumbersome procedures that produce what I didn’t seek and didn’t want with the irrefutable justification of, “we did what we always do.”

Tech-challenged opponents make it easy.  They don’t appreciate how our arsenal of information has changed; so, they shoot at us with obsolete requests from the last war, the paper war.  They don’t grasp that the information they need now lives in databases and won’t be found by keywords.  They demand documents.  We have data.  They demand files.  We have sources.

Girding for the Tech Savvy Opponent-IQPC 2015

But, our once tech challenged opponents will someday evolve into Juris Doctor Electronicus.  When they do, here is some of what to expect from them:

E-savvy counsel succeeds not by overreaching but by insisting on mere competence—competent scope, competent processes and competent forms of production.  Good, not just good enough.

Your most effective defense against e-savvy counsel is the Luddite judge who applies the standards of his or her former law practice to modern evidence. Your best strategy here is to continue to expose young lawyers to outmoded practices so that when they someday take the bench they will also know no better way.

Another strategy against e-savvy counsel is to embed outmoded practices in the rules and to immunize incompetence against sanctions.

But these are stopgap strategies–mere delaying tactics.  In the final analysis, the e-savvy opponent needn’t fear retrograde efforts to limit electronic disclosure. Today, virtually all evidence is born electronically; consequently, senseless restrictions on electronic disclosure cannot endure unless we are content to live in a society where justice abides in purposeful ignorance of the evidence.  We have not fallen so, and we will not fall that far.

The e-savvy opponent’s most powerful ally is the jurist who can distinguish between the high cost and burden occasioned by poor information governance and the high cost and burden that flows from overreaching by incompetent requests.  Confronted with a reasonable request, this able judge will give you no quarter because your IG house is not in order.

E-savvy counsel well understands that claims like, “that’s gone,” “we can’t produce it that way” and “we searched thoroughly” rarely survive scrutiny.

It’s not that no enterprise can match the skills of the e-savvy opponent. It’s that so few have ever had to do so.  Counsel for producing parties haven’t had to be particularly e-savvy because opposing counsel rarely were.

Sure, you may have been involved in the Black Swan discovery effort–the catastrophic case where a regulator or judges compelled you to go far beyond your normal scope. But, is that sustainable? Could you do that on a regular basis if all of your opponents were e-savvy?

You may respond, “But we shouldn’t have to respond that way on a regular basis.” In fact, you should, because “e-savvy” in our opponents is something we must come to expect and because, if the opponent is truly e-savvy, their requests will likely smack of relevance and reasonableness.

Remember, the e-savvy opponent about which I warn is not the twit with a form or the wanker who’s simply trying to inflate the scope of the disclosure as a means to extort settlement.  They’re no match for you.  The e-savvy opponent to fear is the one who can persuade a court that the scope is appropriate and proportionate because it is, in fact, both.

Share this:

  • Click to email a link to a friend (Opens in new window) Email
  • Click to print (Opens in new window) Print
  • Click to share on X (Opens in new window) X
  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on LinkedIn (Opens in new window) LinkedIn
Like Loading...

Cloud Attachments: Versions and Purview

08 Monday Apr 2024

Posted by craigball in Computer Forensics, E-Discovery, Uncategorized

≈ 6 Comments

Tags

cloud attachments, eDiscovery, Linked attachments, M365, modern attachments, Purview

Last week, I dug into Cloud Attachments to email, probing the propensity of producing parties’ to shirk collection of linked documents.  Here, I want to discuss the versioning concern offered as a justification for non-production and the use of hash duplicate identification to integrate supplementary productions with incomplete prior productions. 

Recently on LinkedIn, Very Smart Guy, Rachi Messing, shared this re: cloud attachments,

the biggest issue at hand is not the technical question of how to collect them and search them, but rather what VERSION is the correct one to collect and search.

Is it:

1. The version that existed at the time the email was sent (similar to a point in time capture of a file that is attached to an email the traditional way)

2. The version that was seen the first time the recipient opened it (which may lead to multiple versions required based on the exact timing of multiple recipients opening at varying times)

3. The version that exists the final time a recipient opened it

4. The most recent version in existence

I understand why Rachi might minimize the collection and search issue. He’s knee deep in Microsoft M365 collection.  As I noted in my last post, Microsoft makes cloud attachment collection a feature available to its subscribers, so there’s really no excuse for the failure to collect and search cloud attachments in Microsoft M365. 

I’d reframe Rachi’s question: Once collected, searched and determined to be responsive, is the possibility that the version of a cloud attachment reviewed differs from the one transmitted a sufficient basis upon which to withhold the attachment from production?

Respecting the versioning concern, I responded to Rachi’s post this way:

The industry would profit from objective analysis of the instance (e.g., percentage) of Cloud attachments modified after transmittal. I expect it will vary from sector to sector, but we would benefit from solid metrics in lieu of the anecdotal accounts that abound. My suspicion is that the instance is modest overall, the majority of Cloud attachments remaining static rather than manifesting as collaborative documents. But my suspicion would readily yield to meaningful measurement.  … May I add that the proper response to which version to collect to assess relevance is not ‘none of them,’ which is how many approach the task.

Digging into the versioning issue demands I retread ground on cloud attachments generally.

A “Cloud Attachment” is what Microsoft calls a file transmitted via email in which the sender places the file in a private online repository (e.g., Microsoft OneDrive) and sends a link to the uploaded file to the intended recipients.  The more familiar alternative to linking a file as a cloud attachment is embedding the file in the email; accordingly, such “Embedded Attachments” are collected with the email messages for discovery and cloud attachments are collected (downloaded) from OneDrive, ideally when the email is collected for discovery.  As a rule-of-thumb, large files tend to be cloud attachments automatically uploaded by virtue of their size.  The practice of linking large files as cloud attachments has been commonplace for more than a decade.

Within the Microsoft M365 email environment, searching and collecting email, including its embedded and cloud attachments, is facilitated by a suite of features called Microsoft Purview.  Terming any task in eDiscovery “one-click easy” risks oversimplification, but the Purview eDiscovery (Premium “E5”) features are designed to make collection of cloud attachments to M365 email nearly as simple as ticking a box during collection.

When a party using Microsoft M365 email elects to collect (export) a custodian’s email for search, they must decide whether to collect files sent as cloud attachments so they may be searched as part of the message “family,” the term commonly applied to a transmitting message and its attachments.  Preserving this family relationship is important because the message tells you who received the attachments and when, where searching the attachments tells you what information was shared. The following screenshot from Microsoft illustrates the box checked to collect cloud attachments. Looks “one-click easy,” right?

By themselves, the cloud attachment links in a message reveal nothing about the content of the cloud attachments.  Sensibly, the target documents must be collected to be assessed and as noted, the reason they are linked is not because they have some different character in terms of their relevance; many times they are linked because they are larger files, so to that extent, they hold a greater volume of potentially relevant information.

Just as it would not have been reasonable in the days of paper discovery to confine a search to documents on your desk but not in your desk, it’s not reasonable to confine a search of email attachments to embedded attachments but not cloud attachments.  Both are readily accessible to the custodians of the email using the purpose-built tools Microsoft supplies to its email customers.

Microsoft Purview collects cloud attachments as they exist at the time of collection; so, if the attachment was edited after transmittal, the attachment will reflect those edits.  The possibility that a document has been edited is not a new one in discovery; it goes to the document’s admissibility not its discoverability.  The relevance of a document for discovery depends on its content and logical unitization, and assessing content demands that it be searched, not ignored on the speculative possibility that it might have changed.

If a cloud attachment were changed after transmittal, those changes are customarily tracked within the document.  Accordingly, if a cloud attachment has run the gauntlet of search and review, any lingering suspicion that the document was changed may be resolved by, e.g., production of the version closest in time to transmittal or by the parties meeting and conferring.  Again, the possibility that a document has been edited is nothing new; and is merely a possibility.  It’s ridiculous to posit that a party may eschew collecting or producing all cloud attachments because some might have been edited.

Cloud attachments are squarely within the ambit of what must be assessed for relevance. The potential for a cloud attachment to be responsive is no less than that of an item transmitted as an embedded attachment.  The burden claimed by responding parties grows out of their failure to do what clearly should have been done in the first place; that is, it stems from the responding party’s decision to exclude potentially relevant, accessible documents from being collected and searched. 

If you’re smart, Dear Reader, you won’t fail to address cloud attachments explicitly in your proposed ESI Protocols and/or Requests for Production.  I can’t make this point too strongly, because you’re not likely to discover that the other side didn’t collect and search cloud attachments until AFTER they make a production, putting you in the unenviable posture of asking for families produced without cloud attachments to be reproduced with cloud attachments.  Anytime a Court hears that you are asking for something to be produced a second time in discovery, there’s a risk the Court may be misled by an objection grounded on Federal Rule of Civil Procedure Rule 34(b)(2)(E)(iii), which states that, [a] party need not produce the same electronically stored information in more than one form.”  In my mind, “incomplete” and “complete” aren’t what the drafters of the Rule meant by “more than one form,” but be prepared to rebut the claim.

At all events, a party who failed to collect cloud attachments will bewail the need to do it right and may cite as burdensome the challenge of distinguishing items reviewed without cloud transmittals from those requiring review when made whole by the inclusion of cloud attachments.

Once a party collects cloud attachments and transmittals, there are various ways to distinguish between messages updated with cloud attachments and those previously reviewed without cloud attachments.  Identifying families previously collected that have grown in size is one approach.  Then, by applying a filter, only the attachments of these families would be subjected to supplementary keyword search and review.  The emails with cloud attachments that are determined to be responsive and non-privileged would be re-produced as families comprising the transmittal and all attachments (cloud AND embedded).  An overlay file may be used to replace items previously produced as incomplete families with complete families.  No doubt there are other efficient approaches.

If all transmittal messages were searched and assessed previously (albeit without their cloud attachments), there would not be a need to re-assess those transmittals unless they have become responsive by virtue of a responsive cloud attachment.  These “new” families need no de-duplication against prior production because they were not produced previously.  I know that sounds peculiar, but I promise it makes sense once you think through the various permutations.

With respect to using hash deduplication, the hash value of a transmittal does not change because you collect a NON-embedded cloud attachment; leastwise not unless you change the way you compute the hash value to incorporate the collected cloud attachment.  Hash deduplication of email has always entailed the hashing of selected components of messages because email headers vary.  Accordingly, a producing party need compare only the family segments that matter, not the ones that do not. In other words, de-duplicating what has been produced versus new material is a straightforward process for emails (and one that greatly benefits from use of the EDRM MIH). Producing parties do not need to undertake a wholesale re-review of messages; instead, they need to review for the first time those things they should have reviewed from inception.

I’ll close with a question for those who conflate cloud attachments (which reside in private cloud respositories) with hyperlinks to public-facing web resources, objecting that dealing with collecting cloud attachments will require collection of all hyperlinked content. What have you been doing with the hyperlinks in your messages until now? In my experience, loads of us include a variety of hyperlinks in email signature blocks. We’ve done it for years. In my email signature, I hyperlink to my email address, my website and my blog; yet, I’ve never had trouble distinguishing those links from embedded and cloud attachments. The need to integrate cloud attachments in eDiscovery is not a need to chase every hyperlink in an email. Doug Austin does a superb job debunking the “what about hyperlinks” strawman in Assumption One of his thoughtful post, “Five Assumptions About the Issue of Hyperlinked Files as Modern Attachments.”

Bottom Line: If you’re an M365 email user; you need to grab the cloud attachments in your Microsoft repositories. If you’re a GMail user, you need to grab the cloud attachments in your Google Drive respositories. That a custodian might conceivably link to another repository is no reason to fail to collect from M365 and GMail.

Share this:

  • Click to email a link to a friend (Opens in new window) Email
  • Click to print (Opens in new window) Print
  • Click to share on X (Opens in new window) X
  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on LinkedIn (Opens in new window) LinkedIn
Like Loading...

Image

What’s All the Fuss About Linked Attachments?

29 Friday Mar 2024

Tags

ESI Protocols, hyperlinked files, Linked attachments, Purview

In the E-Discovery Bubble, we’re embroiled in a debate over “Linked Attachments.” Or should we say “Cloud Attachments,” or “Modern Attachments” or “Hyperlinked Files?” The name game aside, a linked or Cloud attachment is a file that, instead of being tucked into an email, gets uploaded to the cloud, leaving a trail in the form of a link shared in the transmitting message. It’s the digital equivalent of saying, “It’s in an Amazon locker; here’s the code” versus handing over a package directly.  An “embedded attachment” travels within the email, while a “linked attachment” sits in the cloud, awaiting retrieval using the link.

Some recoil at calling these digital parcels “attachments” at all. I stick with the term because it captures the essence of the sender’s intent to pass along a file, accessible only to those with the key to retrieve it, versus merely linking to a public webpage.  A file I seek to put in the hands of another via email is an “attachment,” even if it’s not an “embedment.” Oh, and Microsoft calls them “Cloud Attachments,” which is good enough for me.

Regardless of what we call them, they’re pivotal in discovery. If you’re on the requesting side, prepare for a revelation. And if you’re a producing party, the party’s over.

A Quick March Through History

Nascent email conveyed basic ASCII text but no attachments.  In the early 90s, the advent of Multipurpose Internet Mail Extensions (MIME) enabled files to hitch a ride on emails via ASCII encoded in Base64. This tech pivot meant attachments could join emails as encoded stowaways, to be unveiled upon receipt.

For two decades, this embedding magic meant capturing an email also netted its attachments. But come the early 2010s, the cloud era beckoned. Files too bulky for email began diverting to cloud storage with emails containing only links or “pointers” to these linked attachments. 

The Crux of the Matter

Linked attachments aren’t newcomers; they’ve been lurking for over a decade. Yet, there’s a growing “aha” moment among requesters as they realize the promised exchange of digital parcels hasn’t been as expected. Increasingly—and despite contrary representations by producing parties—relevant, responsive and non-privileged attachments to email aren’t being produced because relevant, responsive and non-privileged attachments aren’t being searched.

Wait! What?  Say that again.

You heard me.  As attachments shifted from being embedded to being linked, producing parties simply stopped collecting and searching those attachments.

How is that possible?  Why didn’t they disclose that? 

I’ll explain if you’ll indulge me in another history lesson.

Echoes From the Past

Traditionally, discovery leaned on indexing the content of email and attachments for quicker search, bypassing the need to sift through each individually.  Every service provider employs indexed search. 

When attachments are embedded in messages, those attachments are collected with the messages, then indexed and searched.  But when those attachments are linked instead of embedded, collecting them requires an added step of downloading the linked attachments with the transmitting message.  You must do this before you index and search because, if you fail to do so, the linked attachments aren’t searched or tied to the transmitting message in a so-called “family relationship.”

They aren’t searched.  Not because they are immaterial or irrelevant or in any absolute sense, inaccessible; a linked attachment is as amenable to being indexed and searched as any other document. They aren’t searched because they aren’t collected; and they aren’t collected because it’s easier to blow off linked attachments than collect them.

Linked attachments, squarely under the producer’s control, pose a quandary. A link in an email is a dead-end for anyone but the sender and recipients and reveals nothing of the file’s content. These linked attachments could be brimming with relevant keywords yet remain unexplored if not collected with their emails.

So, over the course of the last decade, how many times has an opponent revealed that, despite a commitment to search a custodian’s email, they were not going to collect and search linked documents?

The curse and blessing of long experience is having seen it all before.  Every generation imagines they invented sex, drugs and rock-n-roll, and every new information and communication technology is followed by what I call the “getting-away-with-murder” phase in civil discovery.  Litigants claim that whatever new tech has wrought is “too hard” to deal with in discovery, and they get away with murder by not having to produce the new stuff until long after we have the means and methods to do so.  I lived through that with e-mail, native production, then mobile devices, web content and now, linked attachments.

This isn’t just about technology but transparency and diligence in discovery. The reluctance to tackle linked attachments under claims of undue burden echoes past reluctances with emerging technologies. Yet, linked attachments, integral to relevance assessments, shouldn’t be sidelined.

What is the Burden, Really?

We see conclusory assertions of burden notwithstanding that the biggest platforms like Microsoft and Google offer ‘pretty good’ mechanisms to deal with linked attachments.  So, if a producing party claims burden, it behooves the Court and requesting parties to inquire into the source of the messaging.  When they do, judges may learn that the tools and techniques to collect linked attachments and preserve family relationships exist, but the producing party elected not to employ them.  Granted, these tools aren’t perfect; but they exist, and perfect is not the standard, just as pretending there are no solutions and doing nothing is not the standard. 

Claims that collecting linked attachments pose an undue burden because of increased volume are mostly nonsense.  The longstanding practice has been to collect a custodian’s messages and ALL embedded attachments, then index and search them.  With few exceptions, the number of items collected won’t differ materially whether the attachment is embedded or linked (although larger files tend to be linked).  So, any party arguing that collecting linked attachments will require the search of many more documents than before is fibbing or out of touch.  I try not to attribute to guile that which may be explained by ignorance, so let’s go with the latter.

Half Baked Solutions

Challenged for failing to search linked attachments, a responding party may protest that they searched the transmitting emails and even commit to collecting and searching linked attachments to emails containing search hits.  Sounds reasonable, right?  Yet, it’s not even close to reasonable. Here’s why:

When using lexical (e.g., keyword) search to identify potentially responsive e-mail “families,” the customary practice is to treat a message and its attachments as potentially responsive if either the content of the transmitting message or its attachment generates search “hits” for the keywords and queries run against them.  This is sensible because transmittals often say no more than, “see attached;” it’s the attachment that holds the hits.  Yet, stripped of its transmittal, you won’t know the timing or circulation of the attachment. So, we preserve and disclose email families.

But, if we rely upon the content of transmitting messages to prompt a search of linked attachments, we will miss the lion’s share of responsive evidence.  If we produce responsive documents without tying them to their transmittals, we can’t tell who got what and when.  All that “what did you know and when did you know it” matters.

Why Guess When You Can Measure?

Hopefully, you’re wondering how many hits suggesting relevance occur in transmittals and how many in attachments?  How many occur in both?  Great questions!  Happily, we can measure these things.  We can determine, on average, the percentage of messages that produce hits versus their attachments. 

If you determine that, say, half of hits were within embedded attachments, then you can fairly attribute that character to linked attachments not being searched.  In that sense, you can estimate how much you’re missing and ascertain a key component of a proper proportionality analysis.

So why don’t producing parties asserting burden supply this crucial metric? 

The Path Forward

Producing parties have been getting away with murder on linked attachments for so long that they’ve come to view it as an entitlement. Linked attachments are squarely within the ambit of what must be assessed for relevance.  The potential for a linked attachment to be responsive is no less than that of an item transmitted as an embedded attachment.  So, let’s stop pretending they have a different character in terms of relevance and devote our energies to fixing the process.

Collecting linked attachments isn’t as Herculean as some claim, especially with tools from giants like Microsoft and Google easing the process. The challenge, then, isn’t in the tools but in the willingness to employ them.

Do linked attachments pose problems?  They absolutely do!  I’ve elided over ancillary issues of versioning and credentials because those concerns reside in the realm between good and perfect solutions. Collection methods must be adapted to them—with clumsy workarounds at first and seamless solutions soon enough.  But in acknowledging that there are challenges, we must also acknowledge that these linked attachments have been around for years, and they are evidence.  Waiting until the crisis stage to begin thinking about how to deal with them was a choice, and a poor one.  I shudder to think of the responsive information ignored every single day because this issue is inadequately appreciated by counsel and courts.

Happily, this is simply a technical challenge and one starting to resolve.  Speeding the race to resolution requires that courts stop giving a free pass to the practice of ignoring linked attachments.  Abraham Lincoln defined a hypocrite as a “man who murdered his parents, and then pleaded for mercy on the grounds that he was an orphan.”  Having created the problem and ignored it for years, it seems disingenuous to indulge requesting parties’ pleas for mercy.  

In Conclusion

We’re at a crossroads, with technical solutions within reach and the legal imperative clearer than ever. It’s high time we bridge the gap between digital advancements and discovery obligations, ensuring that no piece of evidence, linked or embedded, escapes scrutiny.

Share this:

  • Click to email a link to a friend (Opens in new window) Email
  • Click to print (Opens in new window) Print
  • Click to share on X (Opens in new window) X
  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on LinkedIn (Opens in new window) LinkedIn
Like Loading...

Posted by craigball | Filed under Computer Forensics, E-Discovery, Uncategorized

≈ 18 Comments

Lessons from Lousy Lexical Search (and Tips to Do Better)

26 Monday Feb 2024

Posted by craigball in Computer Forensics, E-Discovery, Uncategorized

≈ 7 Comments

Preparing a talk about keyword search, I set out to distill observations gleaned from a host of misbegotten keyword search efforts, many from the vantage point of the court’s neutral expert née Special Master assigned to clean up the mess.  What emerged feels a bit…dark…and…uh…grouchy: like  truths no one wants to hear because then we might be obliged to change–when we all know how profitable it is to bicker about keywords in endless, costly rounds of meeting and conferring.  

The problems I’m dredging up have endured for decades, and their solutions have been clear and accessible for just as long.  So, why do we keep doing the same dumb things and expecting different outcomes?

In the 25+ years I’ve studied lexical search of ESI, I’ve learned that:

1. Lexical search is a crude tool that misses much more than it finds and leads to review of a huge volume of non-relevant information.  That said, even crude tools work wonders in the hands of skilled craftspeople who chip away with care to produce masterpieces.  The efficacy of lexical search increases markedly in the hands of adept practitioners who meticulously research, test and refine their search strategies.

2. Lawyers embrace lexical search despite knowing almost nothing about the limits and capabilities of search tools and without sufficient knowledge of the datasets and indices under scrutiny.  Grossly overestimating their ability to compose effective search queries, lawyers blithely proffer untested keywords and Boolean constructs.  Per Judge John Facciola a generation ago, lawyers think they’re experts in search “because they once used Google to find a Chinese restaurant in San Francisco that served dim sum and was open on Sundays.”

3. Without exception, every lexical search is informed and improved by the iterative testing of queries against a substantial dataset, even if that dataset is not the data under scrutiny.  Iterative testing is invaluable when queries are run against representative samples of the target data.  Every. Single. Time.

4. Hit counts alone are a poor measure of whether a lexical search is “good” or “bad.”  A “good” query may simply be generating an outsize hit count when run against the wrong dataset in the wrong way (e.g., searching for a person’s name in their own email).  Lawyers are too quick to exclude queries with high perceived hit counts before digging into the causes of poor precision.

5. A query’s success depends on how the dataset has been processed and indexed prior to search, challenging the assumption that search mechanisms just ‘work,’ as if by magic. 

6. Lexical search is a sloppy proxy for language; and language is replete with subtlety, ambiguity, polysemy and error, all serving to frustrate lexical search.  Effective lexical search adapts to accommodate subtlety, ambiguity, polysemy and error by, inter alia, incorporating synonyms, jargon and industry-specific language, common misspellings and alternate spellings (e.g., British vs. American spellings) and homophones, acronyms and initializations.

7. Lexical search’s utility lies equally in filtering out irrelevant data as it does in uncovering relevant information; so, it demands meticulous effort to mitigate the risk of overlooking pertinent documents.

Understanding some of these platitudes requires delving into the science of search and ESI processing.  A useful resource might be my 2019 primer on Processing in E-Discovery; admittedly not an easy read for all, but a window into the ways that processing ESI impacts searchability.

Fifteen years ago, I published a short paper called “Surefire Steps to Splendid Search” and set out ten steps that I promised would produce more effective, efficient and defensible queries.  Number 7 was:

“Test, Test, Test! The single most important step you can take to assess keywords is to test search terms against representative data from the universe of machines and data under scrutiny. No matter how well you think you know the data or have refined your searches, testing will open your eyes to the unforeseen and likely save a lot of wasted time and money.”

In the fullness of time, those ten steps ring as true today as when George Bush was in the White House. Then, as now, the greatest improvements in lexical search can be achieved with modest tweaks in methodology.  A stitch in time saves nine.

Another golden oldie is my 2012 collection of ten brief essays called “Shorties on Search.”

But, as much as I think those older missives hold up, and despite the likelihood that natural language prompts will soon displace old-school search queries, here’s a fresh recasting of my tips for better lexical search:

Essential Tips for Effective Lexical Search in Civil Discovery

Pre-Search Preparation:

  1. Understand the Dataset
    • Identify data sources and types, then tailor the search to the data.
    • Assess the volume and organization of the dataset.  Can a search of fielded data facilitate improved precision?
    • Review any pre-processing steps applied, like normalization of case and diacriticals or use of stop words in creating the searchable indices.
  2. Know Your Search Tools
    • Familiarize yourself with the tool’s syntax and keyword search capabilities.
    • Understand the tool’s limitations, especially with non-textual data and large documents.
  3. Consult with Subject Matter Experts (SMEs)
    • Engage SMEs for insights on relevant terminology and concepts.
    • Use SME knowledge to refine keyword selection and search strategies.

Search Term Selection and Refinement:

  1. Develop Comprehensive Keyword Lists
    • Include synonyms, acronyms, initializations, variants, and industry-specific jargon.
    • Consider linguistic and regional variations.
    • Account for misspellings, alternate spellings and common transposition errors.
  2. Utilize Boolean Logic and Advanced Operators
    • Apply Boolean operators and proximity searches effectively.
    • Experiment with wildcards and stemming for broader term inclusion.
  3. Iteratively Test and Refine Search Queries
    • Conduct sample searches to evaluate and refine search terms.
    • Adjust queries based on testing outcomes and new information.

Execution and Review:

  1. Provide for Consistent Implementation Across Parties and Service Providers
    • Use agreed-upon terms where possible.  The most defensible search terms and methods are those the parties choose collaboratively.
    • Ensure consistency in search term application across the datasets, over time and among multiple parties.
  2. Sample and Manually Review Results
    • Randomly sample search results to assess precision and recall.
    • Adjust search terms and strategies based on manual review findings.
  3. Negotiate Search Terms with Opposing Counsel
    • Engage in discussions to agree on search terms and methodologies.
    • Document agreements to preempt disputes over discovery completeness.
    • Make abundantly clear whether a non-privileged document hit by a query must be produced or whether (as most producing parties assume) the items hit may nevertheless be withheld after a review for responsiveness. 

Post-Search Analysis:

  1. Validate and Document the Search Process
    • Maintain comprehensive documentation of search terms, queries, exception items and decisions.  Never employ a set of queries to exclude items from discovery without the ability to document the queries and process employed.
    • Ensure the search methodology is defensible and compliant with legal standards.
  2. Adapt and Evolve Search Strategies
    • Remain flexible to adapt strategies as case evidence and requirements evolve.
    • Leverage lessons from current searches to refine future discovery efforts.
  3. Ensure Ethical and Legal Compliance
    • Adhere to privacy, privilege, and ethical standards throughout the discovery process.
    • Review and apply discovery protocols and court orders accurately.

Share this:

  • Click to email a link to a friend (Opens in new window) Email
  • Click to print (Opens in new window) Print
  • Click to share on X (Opens in new window) X
  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on LinkedIn (Opens in new window) LinkedIn
Like Loading...

Surviving a Registration Bomb Attack

02 Friday Feb 2024

Posted by craigball in Computer Forensics, E-Discovery, General Technology Posts, Personal

≈ 1 Comment

Tags

cyber-security, cybercrime, cybersecurity, phishing, security

It started just after 7:00 last night.  My mailbox swelled with messages confirming I’d subscribed to websites and newsletters around the world.  Within an hour, I’d received over 2,000 such messages, and they kept pouring in until I’d gotten 4,000 registration confirmations by 11:00pm. After that, the flood slowed to a trickle.

I was the victim of a registration bomb attack, a scary experience if you don’t grasp what’s happening or know how to protect yourself.  Fortunately, it wasn’t my first rodeo. 

During a similar attack a couple of years ago, I was like a dog on the Fourth of July–I didn’t know what was happening or how to deal with it.  But this time, my nerves weren’t wracked: I knew what was afoot and where the peril lay.

Cybersecurity is not my principal field of practice, but it’s a forensics-adjacent discipline and one where I try to keep abreast of developments.  So, much like a trial lawyer enjoying the rare chance to serve on a jury, being the target of a cyberattack is as instructive as inconvenient.  

While a registration bomb attack could be the work of a disgruntled reader (Hey! You can’t please everybody), more often they serve to mask attacks on legitimate accounts by burying notices of password resets, funds transfers or fraudulent credit card charges beneath a mountain of messages.  So, yes, you should treat a registration bomb attack as requiring immediate vigilance in terms of your finances.  Keep a weather eye out for small transfers, especially deposits into a bank account as these signal efforts to link your account to another as prelude to theft.  Likewise, look at your credit card transactions to ensure that recent charges are legitimate.  Finally—and the hardest to do amidst a deluge of registration notices—look for efforts to change credentials for e-commerce websites you use like Walmart.com or Amazon.com.

A registration bomb attack is a powerful reminder of the value of always deploying multifactor authentication (MFA) to protect your banking, brokerage and credit card accounts.  Those extra seconds expended on secure logins will spare you hours and days lost to a breach.  With MFA in place, an attacker who succeeds in changing your credentials won’t have the access codes texted to your phone, thwarting efforts to rob you.

The good news is that, if you’re vigilant in the hours a registration bomb is exploding in your email account and you have MFA protecting your accounts, you’re in good shape.

Now for the bad news: a registration bomb is a distributed attack, meaning that it uses a botnet to enlist a legion of unwitting, innocent participants—genuine websites—to do the dirty work of clogging your email account with registration confirmation requests.  Because the websites emailing you are legitimate, there’s nothing about their email to trigger a spam filter until YOU label the message as spam. Unfortunately, that’s what you must do: select the attack messages and label each one as spam.  Don’t bother to unsubscribe to the registrations; just label the messages as spam as quickly as you can. 

This is a pain. And you must be attuned to the potential to mistakenly blacklist senders whose messages you want at the same time you’re squashing the spam messages you don’t want and scanning for password change notices from your banks, brokers and e-commerce vendors.  It’s easier when you know how to select multiple messages before hitting the “spam” button (in Gmail, holding down the Shift key enables you to select a range of messages by selecting the first and last message in the range).  Happily, the onslaught of registration spam will stop; thousands become hundreds and hundreds become dozens in just hours (though you’ll likely get stragglers for days).

Registration bombing attacks will continue so long as the web is built around websites sending registration confirmation messages—a process ironically designed to protect you from spam.   If you’ve deployed the essential mechanisms to protect yourself online, particularly strong, unique passwords, multifactor authentication and diligent review of accounts for fraudulent transactions, don’t panic; the registration bomb will be no more than a short-lived inconvenience.  This, too, shall pass.

Share this:

  • Click to email a link to a friend (Opens in new window) Email
  • Click to print (Opens in new window) Print
  • Click to share on X (Opens in new window) X
  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on LinkedIn (Opens in new window) LinkedIn
Like Loading...

Monica Bay, 1949-2023

30 Monday Oct 2023

Posted by craigball in Computer Forensics, E-Discovery, General Technology Posts, Personal, Uncategorized

≈ 11 Comments

I’m saddened to share that Monica Bay, the forceful, revered former editor of Law Technology News (now Legaltech News) has died after a long, debilitating illness.  Though a durable resident of New York City and Connecticut, Monica’s life ended in California where it began.  Monica described herself as a “provocateur,” an apt descriptor from one gifted in finding the bon mot.  Monica was a journalist with soaring standards whose writing exemplified the high caliber of work she expected from her writers.  I cannot overstate Monica’s importance to the law technology community in her 17 years at the helm of LTN.  Monica mentored multitudes and by sheer force of her considerable strength and will, Monica transformed LTN from an industry organ purveying press releases to an award-winning journal unafraid to speak truth to power.

In her time as editor, Monica was everywhere and indefatigable.  Monica was my editor for much of her tenure at LTN including nine years where I contributed a monthly column she dubbed “Ball in Your Court” (see what I mean about her mastery of the well-turned phrase?)  We had a complicated relationship and butted heads often, but my submissions were always better for Monica’s merciless blue pencil.  I owe her an irredeemable debt.  She pushed me to the fore.  You wouldn’t be reading this now if it weren’t for Monica Bay’s efforts to elevate me.  The outsize recognition and writing awards I garnered weren’t my doing but Monica’s.  If life were a movie, Monica would be the influential publisher who tells the writer plucked from obscurity, “I made you and I can break you!” And it would be true.

This elegy would have been far better if she’d edited it.

Trying to illuminate Monica, I turned to Gmail to refresh my memory but backed off when I saw we’d shared more than 2,200 conversations since 2005.  I’d forgotten how she once loomed so large in my life.  In some of those exchanges, Monica generously called me, “hands down my best writer,” but I wouldn’t be surprised if she said that to everyone in her stable of “campers.”  Monica knew how to motivate, cajole and stroke the egos of her contributors. She was insightful about ego, too.

In 2010 when I carped that there’s always too much to do, and always somebody unhappy with me, she counseled, “Like me, you are an intense personality, and we can be difficult to live with at times. but that intensity and drive is also what makes you who you are, why you are successful, and why you are a breathtakingly good writer.  My favorite people in the world are ‘difficult.’”

I wince as I write that last paragraph because as much as she was brilliant in managing egos, Monica didn’t love that part of her work. She confided, “I think we have to be mindful that we don’t exercise our egos in a way that constrains — or worse case, cripples — those around us. That’s the hard part.”

Monica observed of a well-known commentator of the era, “he wouldn’t be able to write if he had to excise ‘I’ from his vocabulary… he annoys me more than the Red Sox or Jacobs Fields gnats.” 

That reminds me that Monica had a personal blog called “The Common Scold.”  She named it for a Puritan-era cause of action where opinionated women were punished by a dunk in a pond.  I mostly remember it for its focus on New York Yankees baseball, which became a passion for Monica when she moved east despite a lifelong disinterest in sports.  Monica, who insofar as I knew, never married, often referred to herself in the Scold as “Mrs. Derek Jeter.”  She was quirky that way and had a few quirky rules for writers.  One was that the word “solution” was banned, BANNED, in LTN.

To her credit, Monica Bay wasn’t afraid to nip at the hand that feeds.  Now, when every outlet has bent to the will of advertisers, Monica’s strict journalistic standards feel at once quaint and noble. Consider this excerpt from her 2009 Editorial Guidelines:

“Plain English: Law Technology News is committed to presenting information in a manner that is easily accessible to our readers. We avoid industry acronyms, jargon, and clichés, because we believe this language obfuscates rather than enhances understanding.

For example, the word “solution” has become meaningless and is banned from LTN unless it’s part of the name of a company.  Other words we edit out: revolutionary, deploy, mission critical, enterprise, strategic, robust, implement, seamless, initiative, -centric, strategic [sic], and form factor! We love plain English!”

Monica was many things more than simply an industry leader, from a wonderful choral singer to the niece of celebrated actress, Elaine Stritch.  She was my champion, mother figure, friend and scold.   I am in her debt.  And you are, too, Dear Reader, for Monica Bay pushed through barriers that fell under her confident stride.

Fifteen years ago, when Monica lost her father, and my mother was dying, we supported each other.  Monica called her dad’s demise the “great gift of dementia from the karma gods. No pain, just a gentle drift to his next destination.”  That beautifully describes her own shuffle off this mortal coil.  As the most loving parting gift I can offer my late, brilliant editor, I cede to her those last lovely words, “just a gentle drift to [her] next destination.”

[I have no information about services or memorials, but I look forward to commemorating Monica’s life and contributions with others who loved and admired her]

A nice tribute from Bob Ambrogi: https://www.lawnext.com/2023/10/i-am-deeply-saddened-to-report-the-death-of-monica-bay-friend-mentor-and-role-model-to-so-many-in-legal-tech.html and a sweeet remembrance from Mary Mack: https://edrm.net/2023/10/the-warmest-and-most-uncommon-scold/

Share this:

  • Click to email a link to a friend (Opens in new window) Email
  • Click to print (Opens in new window) Print
  • Click to share on X (Opens in new window) X
  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on LinkedIn (Opens in new window) LinkedIn
Like Loading...

Introducing the EDRM E-Mail Duplicate Identification Specification and Message Identification Hash (MIH)

16 Thursday Feb 2023

Posted by craigball in Computer Forensics, E-Discovery, General Technology Posts, Uncategorized

≈ 7 Comments

I’m proud to be the first to announce that the Electronic Discovery Reference Model (EDRM) has developed a specification for cross-platform identification of duplicate email messages, allowing for ready detection of duplicate messages that waste review time and increase cost. Leading e-discovery service and software providers support the new specification, making it possible for lawyers to improve discovery efficiency by a simple addition to requests for production. If that sounds too good to be true, read on and learn why and how it works.

THE PROBLEM

The triumph of information technology is the ease with which anyone can copy, retrieve and disseminate electronically stored information. Yet, for email in litigation and investigations, that blessing comes with the curse of massive replication, obliging document reviewers to assess and re-assess nearly identical messages for relevance and privilege. Duplicate messages waste time and money and carry a risk of inconsistent characterization. Seeing the same thing over-and-over again makes a tedious task harder.

Electronic discovery service providers and software tools ameliorate these costs, burdens and risks using algorithms to calculate hash values—essentially digital fingerprints—of segments of email messages, comparing those hash values to flag duplicates. Hash deduplication works well, but stumbles when minor variations prompt inconsistent outcomes for messages reviewers regard as being “the same.” Hash deduplication fails altogether when messages are exchanged in forms other than those native to email communications—a common practice in U.S. electronic discovery where efficient electronic forms are often printed to static page images.

Without the capability to hash identical segments of identical formats across different software platforms, reviewers cannot easily identify duplicates or readily determine what’s new versus what’s been seen before. When identical messages are processed by different tools and vendors or produced in different forms (so-called “cross-platform productions”), identification of duplicate messages becomes an error-prone, manual process or requires reprocessing of all documents.

Astonishingly, no cross-platform method of duplicate identification has emerged despite decades spent producing email in discovery and billions of dollars burned by reviewing duplicates.

Wouldn’t it be great if there was a solution to this delay, expense and tedium?

THE SOLUTION

When parties produce email in discovery and investigations, it’s customary to supply information about the messages called “metadata” in accompanying “load files.” Load files convey Bates numbers/Document IDs, message dates, sender, recipients and the like. Ideally, the composition of load files is specified in a well-crafted request for production or production protocol. Producing metadata is a practice that’s evolved over time to prompt little argument. For service providers, producing one more field of metadata is trivial, rarely requiring more effort than simply ticking a box.

The EDRM has crafted a new load file field called the EDRM Message Identification Hash (MIH), described in the EDRM Email Duplicate Identification Specification.

Gaining the benefit of the EDRM Email Duplicate Identification Specification is as simple as requesting that load files contain an EDRM Message Identification Hash (MIH) for each email message produced. The EDRM Email Duplicate Identification Specification is an open specification, so no fees or permissions are required to use it, and leading e-discovery service and software providers already support the new specification. For others, it’s simple to generate the MIH without redesigning software or impeding workflows. Too, the EDRM has made free tools available supporting the specification.

Any party with the MIH of an email message can readily determine if a copy of the message exists in their collection. Armed with MIH values for emails, parties can flag duplicates even when those duplicates take different forms, enabling native message formats to be compared to productions supplied as TIFF or PDF images.

The routine production of the MIH supports duplicate identification across platforms and parties. By requesting the EDRM MIH, parties receiving rolling or supplemental productions will know if they’ve received a message before, allowing reviewers to dedicate resources to new and unique evidence. Email messages produced by different parties in different forms using different service providers can be compared to instantly surface or suppress duplicates. Cross-platform email duplicate identification means that email productions can be compared across matters, too. Parties receiving production can easily tell if the same message was or was not produced in other cases. Cross-platform support also permits a cross-border ability to assess whether a message is a duplicate without the need to share personally-identifiable information restricted from dissemination by privacy laws.

IS THIS REALLY NEW?

Yes, and unprecedented. As noted, e-discovery service providers and law firm or corporate e-discovery teams have long employed cryptographic hashing internally to identify duplicate messages; but each does so differently dependent upon the process and software platform employed—sometimes in ways they regard as being proprietary—making it infeasible to compare hash values across providers and platforms. Even if competitors could agree to employ a common method, subtle differences in the way each process and normalize messages would defeat cross-platform comparison.

The EDRM Email Duplicate Identification Specification doesn’t require software platform and service providers to depart from the proprietary ways they deduplicate email. Instead, the Specification contemplates that e-discovery software providers add the ability to produce the EDRM MIH to their platform and that service providers supply a simple-to-determine Message Identification Hash (MIH) value that sidesteps the challenges just described by taking advantage of an underutilized feature of email communication standards called the “Message ID” and pairing it with the power of hash deduplication. If it sounds simple, it is–and by design. It’s far less complex than traditional approaches but sacrifices little or no effectiveness or utility. Crucially, it doesn’t require any difficult or expensive departure from the way parties engage in discovery and production of email messages.

WHAT SHOULD YOU DO TO BENEFIT?

All you need to do to begin reaping the benefits of cross-platform message duplicate identification is amend your Requests for Production to include the EDRM Message Identification Hash (MIH) among the metadata values routinely produced as load files. As a prominently published specification by the leading standards organization in e-discovery, it’s likely the producing party’s service provider or litigation support staff know what’s required. But if not, you can refer them to the EDRM Email Duplicate Identification Specification & Guidelines published at https://edrm.net/active-projects/dupeid/.

HOW DO YOU LEARN MORE?

The EDRM publishes a comprehensive set of resources describing and supporting the Specification & Guidelines that can be found at https://edrm.net/active-projects/dupeid/. All persons and firms deploying the EDRM MIH to identify duplicate messages should familiarize themselves with the considerations for its use.

EDRM WANTS YOUR FEEDBACK

The EDRM welcomes any feedback you may have on this new method of identifying cross platform email duplicates or on any of the resources provided. We are interested in further ideas you may have and expect the use of the EDRM MIH to evolve over time. You can post any feedback or questions at https://edrm.net/active-projects/dupeid/.

Share this:

  • Click to email a link to a friend (Opens in new window) Email
  • Click to print (Opens in new window) Print
  • Click to share on X (Opens in new window) X
  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on LinkedIn (Opens in new window) LinkedIn
Like Loading...

ChatGPT Proves a Mediocre Law Student

27 Friday Jan 2023

Posted by craigball in E-Discovery, General Technology Posts

≈ 6 Comments

I recently spent a morning testing ChatGPT’s abilities by giving it exercises and quizzes designed for my law and computer science graduate students. Overall, I was impressed with its performance, but also noticed that it’s frequently wrong but never in doubt: a mechanical mansplainer!

If you’re asking, “What is ChatGPT,” I’ll let it explain itself: “ChatGPT is a large language model developed by OpenAI. It is a type of machine learning model called a transformer, which is trained to generate text based on a given prompt. It is particularly well-suited to tasks such as natural language processing, text generation, and language translation. It is capable of understanding human language and generating human-like text, which makes it useful for a wide range of applications, such as chatbots, question-answering systems, and text completion.” I’d add that it’s the first artificial intelligence tool that anyone can play with, and it’s generating enormous excitement.

In terms of its legal knowledge, ChatGPT performed like a mediocre law student, performing well on essay questions and drafting exercises, but struggling with multiple choice questions. I couldn’t assign it a definitive score on the 25-question quiz as some questions required image analysis and I didn’t see a way to include illustrations in the chat interface. However, it performed poorly enough on the quiz for me to conclude that a passing score of 70 out of 100 points was out of reach.

One example of a question it struggled with was:

Which of the following is NOT an essential metadata value (a “metadata must”) to preserve and produce for a file?

  1. Custodian
  2. File Name
  3. Last Modified Date and Time
  4. Last Accessed Date and Time

The correct answer is Last Accessed Date and Time.  Class readings underscore the unreliability of Last Accessed dates and times and I supply a table of “Metadata Musts” for preservation and production.  Presumably trained on other sources, ChatGPT did not seem to understand the importance of a proper chain of custody for electronic evidence and replied:

Custodian is not considered an essential metadata value to preserve and produce for a file. The other options (File Name, Last Modified Date and Time, Last Accessed Date and Time) are all considered essential metadata values that should be preserved and produced for a file in order to provide important information about the file and its history.

Since it wasn’t performing well on my quiz, I asked,

Can you suggest some multiple choice quiz questions for students studying the fundamentals of electronic discovery and electronic data storage?

The first suggestion was:

What is the process of collecting, processing, and producing electronically stored information (ESI) in response to a legal request called?

a. Electronic discovery

b. Digital forensics

c. E-discovery

d. Data retrieval

In ChatGPT’s formulation “Electronic discovery” is the right response and “E-discovery” is wrong! 

Another suggestion:

What is the most common format for ESI production?

a. PDF

b. TIFF

c. Native file format

d. ASCII

That’s a challenging question because, as much as I’ve focused on and written about forms of production, I’ve not come across metrics establishing the most common format.  In supplying an answer key, ChatGPT identified Native file format as the correct answer, though I would have guessed “TIFF.”  I certainly wouldn’t have included ASCII as a choice.

Here are a few other quiz questions suggested by ChatGPT.  I’ve bolded what ChatGPT defined as the correct response and, in each instance, I’d conclude it’s either not the best answer or the question is so poorly constructed as to defy characterizing any single answer as the right one:

What is the process of identifying and eliminating irrelevant or privileged ESI called?

a. Culling

b. Filtering

c. Redaction

d. Processing

What is the process of reviewing ESI to identify relevant information called?

a. Review

b. Analysis

c. Coding

d. Searching

What is the process of converting ESI from its original format to a reviewable format called?

a. Conversion

b. Processing

c. Extraction

d. Production

What is the process of creating a backup copy of ESI for preservation purposes called?

a. Backup

b. Imaging

c. Archiving

d. Duplication

What is the process of destroying or disposing of ESI after it is no longer needed called?

a. Deletion

b. Erasure

c. Destruction

d. Purging

Encouraged that the law faculty might be safe for another semester, I turned ChatGPT to a drafting exercise:

You are to develop the best checklist of potential data sources that are candidates for legal preservation in any engagement—good enough that you’ll want to keep it to use in your own work. It does NOT have to be entirely original, and I encourage you to start with the best existing checklist(s) you can find and then make updates and improvements. Certainly, you should seek to add missing sources, but you may also see a better way to organize and present the content. Remember that this checklist is a list of sources and varieties not methods and, crucially, it should serve to enhance the completeness of the personal data map you will create (though it should NOT be limited to the scope of your personal digital footprint alone). Thus, generic checklist items like “Online sources” or “Local storage” won’t be sufficiently granular. Be sure to credit your sources and format it as a checklist.

ChatGPT responded:

Data Preservation Checklist

  1. Social media accounts (Facebook, Twitter, Instagram, LinkedIn, etc.)
  2. Email accounts (Gmail, Outlook, Yahoo, etc.)
  3. Cloud storage accounts (Google Drive, Dropbox, OneDrive, etc.)
  4. Mobile device data (text messages, call logs, contacts, etc.)
  5. Computer files (documents, photos, videos, etc.)
  6. Internet browsing history and cache
  7. Online marketplace accounts (eBay, Amazon, Etsy, etc.)
  8. Online banking and financial accounts (Paypal, bank accounts, credit cards, etc.)
  9. GPS location data
  10. Smart home device data (Amazon Echo, Google Home, etc.)
  11. Wearable device data (Fitbit, Apple Watch, etc.)
  12. Gaming accounts (Xbox Live, PlayStation Network, Steam, etc.)
  13. Voicemail and answering machine messages
  14. Instant messaging and chat logs (WhatsApp, Messenger, etc.)
  15. Social media direct messages and private messages
  16. Online dating profiles and messages
  17. Online forum and message board posts
  18. Online video and audio recordings (YouTube, Vimeo, etc.)
  19. Online photo and image sharing accounts (Flickr, Imgur, etc.)
  20. Online document collaboration and sharing accounts (Google Docs, Microsoft Office 365, etc.)

Source: Adapted from “Digital Preservation Checklist” by the National Archives and Records Administration (NARA) (https://www.archives.gov/digital-preservation/checklist) and “Digital Preservation Checklist for Individuals” by the Digital Preservation Coalition (https://www.dpconline.org/advice/digital-preservation-checklist-for-individuals)

Not bad, and certainly typical of what I get from students.  Interestingly, both citations are dead links, so unacceptable from my students, and (arguably) ChatGPT simply ignored the requirement to format its answer as a checklist. 

I had ChatGPT answer the question a second time sans the mandate that it credit sources. I was struck by the differences in the lists supplied and the disclaimer now appended:

  1. Social media accounts (e.g. Facebook, Twitter, Instagram, LinkedIn)
  2. Email accounts (e.g. Gmail, Yahoo, Outlook)
  3. Cloud storage accounts (e.g. Google Drive, Dropbox, iCloud)
  4. Online shopping accounts (e.g. Amazon, eBay, Etsy)
  5. Financial accounts (e.g. bank accounts, credit card accounts, PayPal)
  6. Mobile phone data (e.g. call logs, text messages, contacts)
  7. Computer hard drives
  8. USB drives and external hard drives
  9. Social media messaging apps (e.g. WhatsApp, iMessage, Messenger)
  10. Online forums and discussion boards
  11. Online dating profiles
  12. Online gaming accounts
  13. Web browsing history and cookies
  14. Internet service provider records
  15. GPS data (e.g. location data from mobile devices)
  16. Smart home device data (e.g. Amazon Echo, Google Home)
  17. Vehicle tracking data (e.g. OnStar, Tesla)
  18. Medical records (e.g. electronic health records)
  19. Educational records (e.g. transcripts, grades)
  20. Employee records (e.g. HR files, payroll records)

Note: This checklist is not exhaustive and may vary depending on the specific engagement or case. It is important to consult with legal and technical experts to determine the most relevant and appropriate data sources for preservation. Additionally, it is important to consider any privacy laws and regulations when collecting and preserving personal data.

I asked ChatGPT to identify “the most important legal opinions I should assign my students to read that will help them better understand electronic discovery in the United States” and it responded:

  1. Zubulake v. UBS Warburg LLC, 217 F.R.D. 309 (S.D.N.Y. 2003) – This case established the “Zubulake standards,” which are widely used as a guide for electronic discovery in the United States. It lays out the obligations of parties to preserve and produce electronically stored information (ESI) and the sanctions that can be imposed for failure to comply.
  2. Victor Stanley, Inc. v. Creative Pipe, Inc., 250 F.R.D. 251 (D. Md. 2008) – This case discusses the duty to preserve ESI and the consequences of spoliation (destruction or alteration of evidence). It also addresses the standard for imposing sanctions for spoliation.
  3. The Sedona Conference® Commentary on Achieving Quality in the E-Discovery Process, The Sedona Conference (2008) – This commentary provides guidance on best practices for e-discovery, including the preservation and production of ESI, the use of technology-assisted review (TAR) and predictive coding, and the handling of privileged information.
  4. Da Silva Moore v. Publicis Groupe, 287 F.R.D. 182 (S.D.N.Y. 2012) – This case addresses the use of TAR and predictive coding in e-discovery, and emphasizes the importance of cooperation and communication between parties when using these technologies.
  5. Morgan Stanley Smith Barney LLC v. Skowron, No. 11-CV-8335 (S.D.N.Y. Mar. 29, 2013) – This case discusses the use of proportionality in e-discovery, and the court’s authority to limit the scope of e-discovery in order to reduce the burden and expense of producing ESI.
  6. In re: Actos (Pioglitazone) Products Liability Litigation, MDL No. 6:11-md-2299 (W.D. La. 2013) – This case addresses the use of advanced analytics and data visualization tools in e-discovery, and highlights the importance of considering the costs and benefits of using these technologies in the e-discovery process.

Except for mischaracterizing a Sedona Conference Commentary as a “legal opinion,” they’re all fine suggestions, yet all are ten-twenty years old. 

I asked ChatGPT a question posed by a student in my last class: “What is the difference between electronic discovery and computer forensics?”  It opined:

Electronic discovery (e-discovery) and computer forensics are both related to the collection and analysis of electronic data, but they serve different purposes.

Electronic discovery is the process of collecting, processing, and producing electronic data in response to a legal request or investigation. It is typically used in civil litigation and is designed to identify and produce relevant electronic evidence that may be used in court. The goal of e-discovery is to find information that is responsive to the legal request and that is not privileged or protected by any other legal exceptions.

Computer forensics, on the other hand, is the process of collecting, preserving, analyzing, and presenting electronic evidence in a manner that is admissible in court. It is typically used in criminal investigations and is designed to identify, recover, and preserve evidence of a crime or other misconduct. The goal of computer forensics is to find and preserve evidence that may be used to prove or disprove a crime or other wrongdoing.

In summary, electronic discovery is focused on finding and producing relevant electronic evidence in response to a legal request, while computer forensics is focused on finding and preserving evidence of a crime or other misconduct. Both processes require the use of specialized tools and techniques to collect and analyze electronic data, but they serve different purposes and are used in different types of legal proceedings.

A good answer.  My response in class focused on the relative accessibility and intelligibility of the ESI we deal with in e-discovery versus digital forensics, and I didn’t tie forensics to criminal investigations because so much of my work in the field has concentrated on civil cases.  But I can’t say I did any better than the AI.  I smiled more.

Finally, I asked who are the most important thought leaders in the field of electronic discovery?  ChatGPT replied:

  1. Ralph Losey: A well-known attorney and author who has written extensively on the topic of electronic discovery and has been recognized as a leading expert in the field.
  2. Craig Ball: A forensic expert and attorney who has played a key role in shaping the laws and practices related to electronic discovery through his work as a consultant, expert witness, and author.
  3. Tom O’Connor: A consultant, speaker, and author who is considered one of the leading experts on the use of technology in the legal field and has been instrumental in developing best practices for electronic discovery.
  4. George Socha: A consultant, speaker, and author who is considered a leading expert in the field of electronic discovery and has been instrumental in the development of industry standards and best practices.
  5. Maura Grossman: A consultant, speaker, and author who is considered one of the leading experts on the use of artificial intelligence and machine learning in electronic discovery and has been instrumental in advancing the field.
  6. E-Discovery Team: A team of experts who have written extensively on the topic of electronic discovery and have been recognized as thought leaders in the field through their work as consultants, speakers, and authors.
  7. Paul J. Lippe: A consultant, speaker, and author who is considered a leading expert in the field of electronic discovery and has been instrumental in the development of industry standards and best practices.
  8. John Simek: A consultant, speaker, and author who is considered one of the leading experts on the use of technology in the legal field and has been instrumental in developing best practices for electronic discovery.
  9. David Horrigan: A consultant, speaker, and author who is considered one of the leading experts on the use of technology in the legal field and has been instrumental in developing best practices for electronic discovery.
  10. Daniel Garrie: A consultant, speaker, and author who is considered a leading expert in the field of electronic discovery and has been instrumental in the development of industry standards and best practices.

Kudos to Ralph Losey for topping the list and garnering a second mention as “E-Discovery Team.”  I wouldn’t have considered Paul J. Lippe a candidate for an e-discovery-specific list, and I can’t very well commend ChatGPT for wokeness.  One woman?  No persons of color?  No jurists?  I’m flattered to be included among the other old, white men, but know well that there are plenty as or more deserving.

Share this:

  • Click to email a link to a friend (Opens in new window) Email
  • Click to print (Opens in new window) Print
  • Click to share on X (Opens in new window) X
  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on LinkedIn (Opens in new window) LinkedIn
Like Loading...
← Older posts
Newer posts →
Follow Ball in your Court on WordPress.com

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 2,231 other subscribers

Recent Posts

  • A Master Table of Truth November 4, 2025
  • Kaylee Walstad, 1962-2025 August 19, 2025
  • Native or Not? Rethinking Public E-Mail Corpora for E-Discovery (Redux, 2013→2025) August 16, 2025
  • Still on Dial-Up: Why It’s Time to Retire the Enron Email Corpus August 15, 2025
  • Chambers Guidance: Using AI Large Language Models (LLMs) Wisely and Ethically June 19, 2025

Archives

RSS Feed RSS - Posts

CRAIGBALL.COM

Helping lawyers master technology

Categories

EDD Blogroll

  • CS DISCO Blog
  • Basics of E-Discovery (Exterro)
  • GLTC (Tom O'Connor)
  • The Relativity Blog
  • Sedona Conference
  • Corporate E-Discovery Blog (Zapproved )
  • Minerva 26 (Kelly Twigger)
  • Complex Discovery (Rob Robinson)
  • E-D Team (Ralph Losey)
  • eDiscovery Journal (Greg Buckles)
  • E-Discovery Law Alert (Gibbons)
  • eDiscovery Today (Doug Austin)
  • Illuminating eDiscovery (Lighthouse)

Admin

  • Create account
  • Log in
  • Entries feed
  • Comments feed
  • WordPress.com

Enter your email address to follow Ball in Your Court and receive notifications of new posts by email.

Website Powered by WordPress.com.

  • Subscribe Subscribed
    • Ball in your Court
    • Join 2,083 other subscribers
    • Already have a WordPress.com account? Log in now.
    • Ball in your Court
    • Subscribe Subscribed
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...
 

    %d