• Home
  • About
  • CRAIGBALL.COM
  • Disclaimer
  • Log In

Ball in your Court

~ Musings on e-discovery & forensics.

Ball in your Court

Category Archives: Computer Forensics

A Dog and Its Tail: Don’t Let Version Uncertainty Cloud Linked Attachment Production

02 Thursday Apr 2026

Posted by craigball in Computer Forensics, E-Discovery, Law Practice & Procedure

≈ 4 Comments

Tags

ESI Protocols, Linked attachments

Two years ago, I wrote a pair of posts (3/29/24 and 4/8/24) about linked attachments—what Microsoft calls “Cloud Attachments”—arguing that producing parties had been getting away with murder by not collecting and searching them.  The argument was straightforward: a linked attachment is no less relevant than an embedded one, the tools to collect them exist, and the claimed burdens were overstated.  Genuine, but exaggerated.

Nothing that’s happened since has changed that core proposition.  If anything, developments in case law, the Sedona Conference’s 2025 Commentary on collaboration platform discovery, and the emergence of proposed technical standards have reinforced it.  But those same developments carry a risk I want to flag: that the versioning question—which version of a linked attachment is the “right” one—is being elevated in ways that could hand producing parties a shiny new excuse for doing nothing.

What’s Changed in a Year

The landscape has shifted since, and largely in the right direction.

Courts are beginning to tiptoe towards what tools can actually do rather than accepting blanket claims of infeasibility.  The Carvana securities litigation is perhaps the most striking example: the court ordered a bounded forensic capability test using a specific tool, then expanded it when the initial pilot supported further testing.  That’s a different approach than we’ve seen before—a court saying, in effect, “show me what you can recover, don’t just tell me you can’t.”

The Sedona Conference published its Commentary on Discovery of Collaboration Platforms Data in 2025, acknowledging the distinct preservation, collection, and production challenges these platforms present.  When Sedona identifies a problem, that identification becomes part of the baseline against which “reasonable steps” under Rule 37(e) will be measured.  Parties who were aware of these challenges—and by now, every competent e-discovery practitioner should be—will find it increasingly hard to argue that their traditional, email-era workflow was good enough.

And a proposed technical standard—the Reconstruction-Grade eDiscovery Standard, authored by Peter Kozak and Brandon D’Agostino—has articulated an architectural framework for what preservation of collaborative evidence should look like.  It’s ambitious and thoughtful.  I want to engage with it constructively, because I think it gets several things right.  But I also want to sound a caution about how standards like this could be deployed in the real world of discovery disputes.

Two Problems

The RG standard does something valuable: it names and taxonomizes the specific ways that traditional preservation fails when evidence is collaborative, hyperlinked, and versioned.  Its framework identifies what it calls the “Preservation Gap” (the referenced content is never preserved at all) and the “Context Gap” (the content is preserved but not in the state it existed at the relevant time).  That’s a useful distinction.

But here’s where I part company—not with the standard’s laudable intent, but with the risk of how it may play out in the field.

The standard treats deterministic version resolution—preserving the as-sent version of a linked document, the version that existed when the message was transmitted—as a core conformance requirement.  Architecturally, I understand why.  If you’re building a system that aspires to reconstruction-grade fidelity, you want to capture the version the recipient would have seen when they clicked the link.  That’s the gold standard.

The problem is that the gold standard can become the enemy of any standard at all. 

To my eye, the versioning concern has been weaponized.  It goes like this: a requesting party asks for linked attachments.  The producing party raises the specter of versioning—“Which version do you want?  The as-sent version?  The as-accessed version?  The current version?  We can’t be sure which is the ‘right’ one, so the whole exercise is fraught with uncertainty.”  And that uncertainty becomes the justification for producing no version.  Not the wrong version.  No version.

That’s the tail wagging the dog.

The “Dog” Is Collection

The threshold obligation is to collect and search linked attachments.  Full stop.  A link in an email reveals nothing about the content of the linked document.  If you don’t collect the document, you can’t search it.  If you can’t search it, you can’t assess it for relevance.  And if you can’t assess it for relevance, you’re making a unilateral decision to exclude potentially responsive evidence—evidence that, but for a shift in how email systems handle large files, would have been embedded in the message and collected automatically.

That obligation exists independently of any versioning question.  It existed before anyone coined the term “reconstruction-grade.”  It existed when I wrote about it a year ago, and it existed for years before that.  “Perfect” is not the standard in e-discovery, but neither is “lousy.”

Beware, too, the half-measure.  A producing party, pressed on missing linked attachments, may offer to search the email text first and seek out the linked attachment only if the parent email hits on a keyword.  This sounds reasonable until you think about how email actually works.  It is exceedingly common for a transmitting email to say nothing more than “Please see attached” or “Here’s the draft we discussed,” while the attachment contains all the substantive content.  If the email text doesn’t trigger a keyword, the attachment—however rich in relevant material—never gets collected or searched.  And even if produced as a loose document, won’t tie to its “parent” transmitting message    

When we search email families containing embedded attachments, we treat the family as responsive if either the message or the attachment generates a hit.  Any workflow that conditions collection of linked attachments on hits in the transmitting email inverts that logic and guarantees that a large share of responsive evidence will be missed.

A producing party that collects and searches the current version of a linked attachment has done something meaningful.  They’ve brought the document into the review population.  They’ve assessed its content against the issues in the case.  They’ve preserved the family relationship between message and attachment.  They may not have captured the precise version that existed at send time, but they’ve captured a version—one that, in the overwhelming majority of cases, is likely to be the same or substantially similar to the transmitted version.

A producing party that collects nothing because of versioning uncertainty has done nothing.  Lousy.

The “Tail” Is Versioning

I don’t dismiss the versioning issue.  It’s real, and the RG standard is right to address it.  There are cases where the difference between the as-sent version and the current version matters enormously—a contract with terms that changed, a financial model with revised projections, a compliance policy that was softened after the relevant communication.  In those cases, producing the wrong version could mislead or, worse, could conceal what the actors actually relied upon.

But how often does this actually happen?

A year ago, I called for objective analysis: what percentage of cloud attachments are actually modified after transmittal?  I’m repeating the call, louder, because the industry still hasn’t answered it.

I have a strong intuition—and I want to be candid that it’s an intuition based on experience, not evidence—that the incidence of post-transmittal modification is modest overall.  My suspicion is that fewer than ten- to twenty percent of linked attachments are meaningfully modified after being shared, and perhaps far fewer than that.  Most cloud attachments are final or near-final documents shared for information, not living collaborative drafts.  Someone emails a report, a slide deck, a signed contract.  The link is a delivery mechanism, not an invitation to co-author.

But I also suspect the percentage varies widely depending upon the culture.  An organization whose culture runs to emailing finished work product will have a very different modification profile than one where teams routinely share early drafts via links for iterative editing in SharePoint.  A law firm circulating closing documents will look different from a product team sharing design specs that change daily.  The incidence of versioning concerns is likely a function of organizational work style, not some universal constant.

Here’s the point: I don’t have solid metrics.  I believe what I’m describing here, but belief is not evidence, and I would readily yield my suspicion to meaningful measurement.  The data needed to resolve this question is not exotic.  Any organization with a reasonably mature M365 environment could sample and compare the version history of linked attachments against the timestamps of the messages that transmitted them.  The analysis would tell us, for a given corpus, what percentage of linked attachments were modified after the transmitting message was sent, how significantly they were modified, and how soon after transmittal the modifications occurred.  That’s a study someone should do—a vendor, a consultant, an academic, a standards body.  It would replace speculation with evidence and give courts and practitioners a rational basis for calibrating the proportionality of versioning remediation.  Too, litigants coming to Court seeking relief from the duty to collect linked attachments should collect the metrics to measure the claimed risk and burden.

Until we have that data, we’re arguing about a problem whose magnitude we don’t grasp, while ignoring a problem whose magnitude is obvious: linked attachments aren’t being collected as they should be.

Don’t Throw Out the Baby

I want to be clear about what I’m not saying.  I’m not saying the RG standard is wrong to aspire to as-sent version resolution.  I’m not saying versioning doesn’t matter.  And I’m not attributing to the standard’s authors any intent to create a new excuse for non-production.  Reading the standard carefully, its concept of graduated conformance levels and its emphasis on proportionality suggest the opposite intent.

But standards exist in an adversarial ecosystem.  A standard that defines three conformance levels—RG-Core, RG-Plus, RG-Max—can be turned into a shield by a party arguing: “Your Honor, we can’t achieve even RG-Core conformance, so we shouldn’t be required to attempt collection of linked attachments.”  That argument confuses the standard’s aspirational architecture with the floor of a party’s discovery obligations.

The floor is not reconstruction-grade fidelity.  The floor is reasonable steps under Rule 37(e) and the obligation to search and produce relevant, responsive, non-privileged material.  That floor requires, at minimum, that you collect linked attachments using the tools your platform provides, search them, and produce responsive documents—even if you’re producing the current version rather than the as-sent version.

To put it another way: producing the “wrong” version of a responsive document is a problem.  Producing no version of a responsive document is a bigger problem.

I’ve been accused of leaning toward the interests of plaintiffs on this topic.  That’s neither fair nor accurate.  I advocate for evidence.  I’m committed to getting to the evidence that resolves disputes in what Rule 1 of the Federal Rules calls a “just, speedy, and inexpensive” fashion.  Not perfect.  Certainly not at any cost.  But I won’t accommodate high-handed, evasive approaches to the duty to produce responsive, non-privileged evidence—and dressing up a refusal to collect linked attachments in the language of versioning complexity is exactly that.

What the Standard Gets Right

Credit where it’s due.  Several elements of the RG framework strike me as genuinely constructive:

Exception transparency.  The standard requires structured records of what couldn’t be collected and why.  In the current landscape, failures are silent.  A linked attachment that can’t be retrieved simply disappears—no record that it was attempted, no record that it failed, no record of why.  Requiring a producing party to document its failures is a significant improvement over the status quo, where the absence of evidence is invisible.  Notably, courts have already begun requiring this kind of transparency on an ad hoc basis.  In the Uber litigation, Judge Cisneros ordered two custom metadata fields—“Missing Google Drive Attachments” and “Non-Contemporaneous”—to flag gaps and version discrepancies in the production.  What the RG standard proposes as a systemic architectural requirement, courts are already imposing case by case.  Formalizing that expectation is a natural and constructive next step.

The Preservation Gap vs. Context Gap distinction.  Naming these as separate failure modes is useful because they have different legal implications.  The Preservation Gap—evidence that was never preserved at all—maps cleanly to Rule 37(e).  The Context Gap—evidence preserved in the wrong state—is doctrinally murkier.  Courts don’t yet have a clean framework for “you preserved it, but what you preserved isn’t what was communicated.”  Distinguishing the two helps practitioners and courts think more precisely about what went wrong and what remedies are appropriate.

Capability testing as an emerging judicial norm.  The companion post to the standard highlights Carvana and the broader trajectory of courts ordering parties to demonstrate what their tools can do.  This is a welcome and overdue development.  The e-discovery conversation around linked attachments has too often been dominated by conclusory assertions of infeasibility.  Capability testing replaces assertion with demonstration, and that benefits everyone—including producing parties who have invested in the right tools and want credit for doing so.

Where We Go from Here

The path forward requires distinguishing between the immediate obligation and the aspirational architecture.

The immediate obligation  is collection.  If you’re on Microsoft 365, use Purview.  If you’re on Google Workspace, use Vault.  These tools aren’t perfect, but they exist, and they collect linked attachments.  The version you collect may be the current version rather than the as-sent version.  That’s a known limitation, not a reason to collect nothing.

The aspirational architecture  is reconstruction-grade fidelity—as-sent version resolution, deterministic exception handling, reproducible exports.  That’s where the industry needs to go.  Tools like Forensic Email Collector are already demonstrating that historical version recovery is technically possible in many cases.  The Carvana court’s willingness to order capability testing suggests that judges are ready to push the envelope.

But the bridge between those two isn’t “wait until perfect tools exist.”  The bridge is “do what you can now, document what you can’t, and improve your capabilities over time.”

That’s what proportionality actually means.  Not perfection.  Not paralysis.  But reasonable, good-faith efforts commensurate with the stakes and the state of the art.

The versioning problem will resolve because courts will order testing, because tools will improve, because someone will finally produce the empirical data on post-transmittal modification rates (pretty please), and because standards like the RG framework will mature.  These are all good-faith efforts to move the law and the industry forward, and they well deserve recognition for that commendable effort.

In the meantime, the producing party’s obligation is clear: collect the linked attachments, search them, and produce what’s responsive.

The tail does not get to wag the dog.

Hat tip to Doug Austin for highlighting the publication of the Reconstruction-Grade eDiscovery Standard on his eDiscovery Today blog.  Doug continues to be an indispensable resource for practitioners trying to keep pace with developments in this space.

© 2026 Craig D. Ball.  All rights reserved.

Share this:

  • Email a link to a friend (Opens in new window) Email
  • Print (Opens in new window) Print
  • Share on X (Opens in new window) X
  • Share on Facebook (Opens in new window) Facebook
  • Share on LinkedIn (Opens in new window) LinkedIn
Like Loading...

Detecting Deep Fakes

24 Tuesday Feb 2026

Posted by craigball in ai, Computer Forensics, E-Discovery, General Technology Posts, Law Practice & Procedure

≈ 2 Comments

This morning, I was approached to present in Texas on deep fake evidence and what litigators need to know to confront it.  It’s to be called, “Real or Rigged: How to Know Whether Evidence Is Fake.” I realized, to my chagrin, that I didn’t have a paper I could hand out—no single place where I had pulled together the technical realities, evidentiary doctrine, and practical litigation tactics this subject demands. So, I wrote one. Whether I ultimately give the talk remains to be seen, but I’m hopeful the resulting article will prove useful to you. The paper—Forensic Tells: A Practitioner’s Guide to Detecting Deep Fakes and Authenticating Digital Evidence—runs about thirty pages and is available here.

The piece starts from a simple premise: digital evidence does not fall like manna from heaven; it has a provenance that speaks to its authenticity. It is fundamentally different from paper because it carries a payload of information about its origins and handling—metadata that functions as a chain of custody embedded within the file itself. In an era when AI systems can generate convincing photographs, videos, and audio recordings of events that never occurred, that metadata has become the last line of defense against manufactured reality.

While I regard myself as much more a student of AI than an authority, I’ve been writing about metadata and evidence as long as anyone on two legs; so, I hope I bring something of value to the topic.  You be the judge.  The article explains, in practical terms, how synthetic media is created, why fabricated media often lacks the coherent metadata of authentic recordings, and how lawyers can use that disparity to authenticate—or challenge—digital evidence. It also addresses the emerging “liar’s dividend,” the phenomenon whereby wrongdoers dismiss authentic recordings as fake simply because the technology exists to fabricate them.

More importantly, the article is written as a practitioner’s guide, not a technical treatise. It outlines concrete discovery strategies: demanding native files, targeting interrogatories and requests for admission, pursuing third-party records, and, where necessary, seeking forensic examination of source devices. It explains what to look for in metadata, what visual and auditory artifacts may signal manipulation, and how federal and Texas evidence rules—including Rules 901 and 902—apply to synthetic media challenges. It closes with a practical checklist and discussion of emerging provenance technologies that may someday make authentication easier—but, for now, make it more essential that lawyers understand how to ask the right questions.

Your feedback is always welcome and appreciated.

Share this:

  • Email a link to a friend (Opens in new window) Email
  • Print (Opens in new window) Print
  • Share on X (Opens in new window) X
  • Share on Facebook (Opens in new window) Facebook
  • Share on LinkedIn (Opens in new window) LinkedIn
Like Loading...

A Master Table of Truth

04 Tuesday Nov 2025

Posted by craigball in ai, Computer Forensics, E-Discovery, General Technology Posts, Law Practice & Procedure, Uncategorized

≈ 5 Comments

Tags

ai, artificial-intelligence, chatgpt, eDiscovery, generative-ai, law, technology

Lawyers using AI keep turning up in the news for all the wrong reasons—usually because they filed a brief brimming with cases that don’t exist. The machines didn’t mean to lie. They just did what they’re built to do: write convincingly, not truthfully.

When you ask a large language model (LLM) for cases, it doesn’t search a trustworthy database. It invents one. The result looks fine until a human judge, an opponent or an intern with Westlaw access, checks. That’s when fantasy law meets federal fact.

We call these fictions “hallucinations,” which is a polite way of saying “making shit up;” and though lawyers are duty-bound to catch them before they reach the docket, some don’t. The combination of an approaching deadline and a confident-sounding computer is a dangerous mix.

Perhaps a Useful Guardrail

It struck me recently that the legal profession could borrow a page from the digital forensics world, where we maintain something called the NIST National Software Reference Library (NIST NSRL). The NSRL is a public database of hash values for known software files. When a forensic examiner analyzes a drive, the NSRL helps them skip over familiar system files—Windows dlls and friends—so they can focus on what’s unique or suspicious.

So here’s a thought: what if we had a master table of genuine case citations—a kind of NSRL for case citations?

Picture a big, continually updated, publicly accessible table listing every bona fide reported decision: the case name, reporter, volume, page, court, and year. When your LLM produces Smith v. Jones, 123 F.3d 456 (9th Cir. 2005), your drafting software checks that citation against the table.

If it’s there, fine—it’s probably references a genuine reported case.
If it’s not, flag it for immediate scrutiny.

Think of it as a checksum for truth. A simple way to catch the most common and indefensible kind of AI mischief before it becomes Exhibit A at a disciplinary hearing.

The Obstacles (and There Are Some)

Of course, every neat idea turns messy the moment you try to build it.

Coverage is the first challenge. There are millions of decisions, with new ones arriving daily. Some are published, some are “unpublished” but still precedential, and some live only in online databases. Even if we limited the scope to federal and state appellate courts, keeping the table comprehensive and current would be an unending job; but not an insurmountable obstacle.

Then there’s variation. Lawyers can’t agree on how to cite the same case twice. The same opinion might appear in multiple reporters, each with its own abbreviation. A master table would have to normalize all of that—an ambitious act of citation herding.

And parsing is no small matter. AI tools are notoriously careless about punctuation. A missing comma or swapped parenthesis can turn a real case into a false negative. Conversely, a hallucinated citation that happens to fit a valid pattern could fool the filter, which is why it’s not the sole filter.

Lastly, governance. Who would maintain the thing? Westlaw and Lexis maintain comprehensive citation data, but guard it like Fort Knox. Open projects such as the Caselaw Access Project and the Free Law Project’s CourtListener come close, but they’re not quite designed for this kind of validation task. To make it work, we’d need institutional commitment—perhaps from NIST, the Library of Congress, or a consortium of law libraries—to set standards and keep it alive.

Why Bother?

Because LLMs aren’t going away. Lawyers will keep using them, openly or in secret. The question isn’t whether we’ll use them—it’s how safely and responsibly we can do so.

A public master table of citations could serve as a quiet safeguard in every AI-assisted drafting environment. The AI could automatically check every citation against that canonical list. It wouldn’t guarantee correctness, but it would dramatically reduce the risk of citing fiction. Not coincidentally, it would have prevented most of the public excoriation of careless counsel we’ve seen.

Even a limited version—a federal table, or one covering each state’s highest court—would be progress. Universities, courts, and vendors could all contribute. Every small improvement to verifiability helps keep the profession credible in an era of AI slop, sloppiness and deep fakes.

No Magic Bullet, but a Sensible Shield

Let’s be clear: a master table won’t prevent all hallucinations. A model could still misstate what a case holds, or cite a genuine decision for the wrong proposition. But it would at least help keep the completely fabricated ones from slipping through unchecked.

In forensics, we accept imperfect tools because they narrow uncertainty. This could do the same for AI-drafted legal writing—a simple checksum for reality in a profession that can’t afford to lose touch with it.

If we can build databases to flag counterfeit currency and pirated software, surely we can build one to spot counterfeit law?

Until that day, let’s agree on one ironclad proposition: if you didn’t verify it, don’t file it.

Share this:

  • Email a link to a friend (Opens in new window) Email
  • Print (Opens in new window) Print
  • Share on X (Opens in new window) X
  • Share on Facebook (Opens in new window) Facebook
  • Share on LinkedIn (Opens in new window) LinkedIn
Like Loading...

Native or Not? Rethinking Public E-Mail Corpora for E-Discovery (Redux, 2013→2025)

16 Saturday Aug 2025

Posted by craigball in ai, Computer Forensics, E-Discovery, Uncategorized

≈ 2 Comments

Tags

ai, artificial-intelligence, chatgpt, eDiscovery, EDRM, generative-ai, Linked attachments, Purview, technology

Yesterday, I found myself in a spirited exchange with a colleague about whether the e-discovery community has suitable replacements for the Enron e-mail corpora1—now more than two decades old—as a “sandbox” for testing tools and training students. I argued that the quality of the data matters: native or near-native e-mail collections remain essential to test processing and review workflows in ways that mirror real-world litigation.

The back-and-forth reminded me that, unlike forensic examiners or service providers, ediscovery lawyers may not know or care much about the nature of electronically-stored information until it finds its way to a review tool. I get that. If your interest in email is in testing AI coding tools, you’re laser-focused on text and maybe a handful of metadata; but if your focus is on the integrity and authenticity of evidence, or in perfecting processing tools, the originating native or near-native form of the corpus matters more.

What follows is a re-publication of a post from July 2013. I’m bringing it back because the debate over forms of email hasn’t gone away; the issue is as persistent and important as ever. A central takeaway bears repeating: the litmus test is whether a corpus hews to a fulsome RFC-5322 compliant format. If headers, MIME boundaries, and transport artifacts are stripped or incompletely synthesized, what remains ceases to be a faithful native or near-native format. That distinction matters, because even experienced e-discovery practitioners—those fixated on review at the far-right side of the EDRM—may not fully appreciate what an RFC-5322 email is, or how much fidelity is lost when working with post-processed sets.

Continue reading →

Share this:

  • Email a link to a friend (Opens in new window) Email
  • Print (Opens in new window) Print
  • Share on X (Opens in new window) X
  • Share on Facebook (Opens in new window) Facebook
  • Share on LinkedIn (Opens in new window) LinkedIn
Like Loading...

Still on Dial-Up: Why It’s Time to Retire the Enron Email Corpus

15 Friday Aug 2025

Posted by craigball in Computer Forensics, E-Discovery, General Technology Posts

≈ 11 Comments

Tags

corpora, E-Discovery, eDiscovery, Enron, ESI, forensics

Early this century, when I was gaining a reputation as a trial lawyer who understood e-discovery and digital forensics, I was hired to work as the lead computer forensic examiner for plaintiffs in a headline-making case involving a Houston-based company called Enron.  It was a heady experience.

Today, everywhere you turn in e-discovery, Enron is still with us. Not the company that went down in flames more than two decades ago, but the Enron Email Corpus, the industry’s default demo dataset.

Type in “Ken Lay” or “Andy Fastow,” hit search, and watch the results roll in. For vendors, it’s the easy choice: free, legal, and familiar. But for 2025, it’s also frozen in time—benchmarking the future of discovery against the technological equivalent of a rotary phone. Or, now that AOL has lately retired its dial-up service, benchmarking it against a 56K modem.

How Enron Became Everyone’s Test Data

When Enron collapsed in 2001 amid accounting fraud and market-manipulation scandals, the U.S. Federal Energy Regulatory Commission (FERC) launched a sweeping investigation into abuses during the Western U.S. energy crisis. As part of that probe, FERC collected huge volumes of internal Enron email.

In 2003, in an extraordinary act of transparency, FERC made a subset of those emails public as part of its docket. Some messages were removed at employees’ request; all attachments were stripped.

The dataset got a second life when Carnegie Mellon University’s School of Computer Science downloaded the FERC release, cleaned and structured it into individual mailboxes, and published it for research. That CMU version contains roughly half a million messages from about 150 Enron employees.

A few years later, the Electronic Discovery Reference Model (EDRM)—where I serve as General Counsel—stepped in to make the corpus more accessible to the legal tech world. EDRM curated, repackaged, and hosted improved versions, including PST-structured mailboxes and more comprehensive metadata. Even after CMU stopped hosting it, EDRM kept it available for years, ensuring that anyone building or testing e-discovery tools had a free, legal dataset to use. [Note: EDRM no longer hosts the Enron corpus, but for those who like hunting antiques, you may find it (or parts of it) at CMU, Enrondata.org, Kaggle.com and, no joke, The Library of Congress].

Because it’s there, lawful, and easy, Enron became—and regrettably remains—the de facto benchmark in our industry.

Why Enron Endures

Its virtues are obvious:

  • Free and lawful to use
  • Large enough to exercise search and analytics tools
  • Real corporate communications with all their messy quirks
  • Familiar to the point of being an industry standard

But those virtues are also the trap. The data is from 2001—before smartphones, Teams, Slack, Zoom, linked attachments, and nearly every other element that makes modern email review challenging.

In 2025, running Enron through a discovery platform is like driving a Formula One race car on cobblestone streets.

Continue reading →

Share this:

  • Email a link to a friend (Opens in new window) Email
  • Print (Opens in new window) Print
  • Share on X (Opens in new window) X
  • Share on Facebook (Opens in new window) Facebook
  • Share on LinkedIn (Opens in new window) LinkedIn
Like Loading...

Safety First: A Fun Day at the “Office”

16 Monday Dec 2024

Posted by craigball in Computer Forensics, E-Discovery, General Technology Posts, Personal

≈ 4 Comments

Tags

bosiet, caebs, drill-ship, forensics, offshore, vdr, voyage-data-recorder

As a forensic examiner, I’ve gathered data in locales ranging from vast, freezing data centers to the world’s largest classic car collection. Yet, wherever work has taken me, I’ve not needed special equipment or certifications beyond my forensic skills and tools.  That is, until I was engaged to inspect and acquire a Voyage Data Recorder aboard a drilling vessel operating in the Gulf of Mexico.

A Voyage Data Recorder (VDR) is the marine counterpart of the Black Box event recorder in an airliner.  It’s a computer like any other, but hardened and specialized.  Components are designed to survive a catastrophic event and tell the story of what transpired.

Going offshore by helicopter to a rig or vessel demands more than a willingness to go.  The vessel operator required that I have a BOSIET with CAEBS certification to come aboard.  That stands for Basic Offshore Safety Induction Emergency Training with Compressed Air Emergency Breathing System.  It’s sixteen hours of training, half online and half onsite and hands on.  I suppose I was expected to balk, but I completed the course in Houston on Thursday.  Now, I’m the only BOSIET with CAEBS-certified lawyer forensic examiner I know (for all the good that’s likely to do me beyond this one engagement).  Still, it was a blast to train in a different discipline.

A BOSIET with CAEBS certification encompasses four units:

  1. Safety Induction
  2. Helicopter Safety and Escape Training (with CA-EBS) using a Modular Egress Training Simulator (METS)
  3. Sea Survival including Evacuation, TEMSPC, and Emergency First Aid
  4. Firefighting and Self Rescue Techniques
Continue reading →

Share this:

  • Email a link to a friend (Opens in new window) Email
  • Print (Opens in new window) Print
  • Share on X (Opens in new window) X
  • Share on Facebook (Opens in new window) Facebook
  • Share on LinkedIn (Opens in new window) LinkedIn
Like Loading...

Doveryai, No Proveryai!

07 Wednesday Aug 2024

Posted by craigball in Computer Forensics, E-Discovery, General Technology Posts

≈ 4 Comments

I recently published an AI prompt to run against search terms then get the AI to propose improvements.  Among the pitfalls I’d hoped to expose was the presence of “stop” or “noise” words; terms routinely excluded from search indices.  Searches incorporating stop words fail because terms not in the index won’t be found.  Ensuring your searches don’t include stop words is an essential step in framing effective queries.

To help the AI recognize stop words, the prompt included a list of default stop words for well-known eDiscovery tools.  That is, I thought I’d done that, but what I included in error (and have now replaced) was ChatGPT’s rendition of stop words for the major tools.  I’d made a mental note to check the lists supplied but—DOH!—I plugged it into the prompt and then forgot to do my due diligence.

I was feeling pretty good about the post and getting some nice feedback.  Last night, my dear friend and e-discovery Empress Mary Mack commented on the novelty of seeing the various stop word lists broken out in a ready reference.  I think echoes of Mary’s kind comment woke me at 4:00am, my subconscious screaming, “HEY DUMMY!  Did you verify those stop words?  Tell me you didn’t blindly trust an AI?!?”

So, long before sunrise, I was manually checking each stop word list against product websites and—lo and behold—every list was off: some merely incomplete but others not even close. ChatGPT hallucinated the lists, and I failed to do the crucial thing lawyers must do when using AI as a research assistant: Trust but verify.

No harm done, but I share my chagrin here to underscore that you just cannot trust an AI generative large language model to do your research without careful human assessment of the output.  I know this and let it slip my mind.  Last time for that.  I’ve corrected the prompt on my blog and hope I’ve gotten it right.  I post this to remind my readers that AI LLMs are great—USE THEM–but they are no substitute for you.  Doveryai, no proveryai!

Share this:

  • Email a link to a friend (Opens in new window) Email
  • Print (Opens in new window) Print
  • Share on X (Opens in new window) X
  • Share on Facebook (Opens in new window) Facebook
  • Share on LinkedIn (Opens in new window) LinkedIn
Like Loading...

AI Prompt to Improve Keyword Search

04 Sunday Aug 2024

Posted by craigball in Computer Forensics, E-Discovery, General Technology Posts

≈ 15 Comments

Twenty years ago, I dreamed up a website where you would submit a list of eDiscovery keywords and queries and the site would critique the searches and suggest improvements to make them more efficient and effective. It would flag stop words, propose alternate spellings, and alert the user to pitfalls making searches less effective or noisy. I even envisioned it testing queries against a benign dataset to identify overly broad terms and false hits.

I believed this tool would be invaluable for helping lawyers enhance their search skills and achieve greater efficiency. Over the years, I tried to bring this idea to life, seeking proposals from offshore developers and pitching it to e-discovery software publishers as a value-add. In the end, a pipe dream. Even now, nothing like it exists.

The emergence of AI-powered Large Language Models like ChatGPT made me think what I’d hoped to bring to life years ago might finally be feasible. I wondered if I could create a prompt for ChatGPT that would achieve much of what I envisioned. So, I dedicated a sunny Sunday morning to playing “prompt engineer,” a whole cloth term for those who craft AI prompts to achieve desired outcomes.

The result was promising, a significant step forward for lawyers who struggle with search queries without understanding why some fail. Most search errors I encounter aren’t subtle. I’ve written about ways to improve lexical search, and the techniques aren’t rocket science, though they require some familiarity with how electronically stored information is indexed and how search syntaxes differ across platforms. Okay, maybe a little rocket science. But if you’re using a tool for critical tasks, shouldn’t you know what it can and cannot do?

Some believe refining keywords and queries is a waste of time, casting keyword search as obsolete. Perhaps on your planet, Klaatu, but here on Earth, lawyers continue using keywords with reckless abandon. I’m not defending that but neither will I ignore lawyers’ penchant for lexical search. Until the cost, reliability, and replicability of AI-enabled discovery improve, keywords will remain a tool for sifting through large datasets. However, we can use AI LLMs right now to enhance the performance and efficiency of shopworn approaches.

Continue reading →

Share this:

  • Email a link to a friend (Opens in new window) Email
  • Print (Opens in new window) Print
  • Share on X (Opens in new window) X
  • Share on Facebook (Opens in new window) Facebook
  • Share on LinkedIn (Opens in new window) LinkedIn
Like Loading...

Girding for the E-Savvy Opponent (Revisited)

26 Friday Apr 2024

Posted by craigball in Computer Forensics, E-Discovery, Uncategorized

≈ 7 Comments

Tags

competence, disclosure, discovery, edisclosure, eDiscovery

A friend shared that she was seeing the Carole King musical, “Beautiful,” and I recalled the time I caught it twice on different visits to London in 2015 because I enjoyed it so. I reflected on why I was in London in Summer nine years ago and came across a post from the time–a post that I liked well-enough to revisit it below. I predicted the emergence of the e-savvy opponent, something that has indeed come to pass, yet less-widely or -effectively than I’d hoped (and still hope for). A new generation of e-discoverers has emerged since, so perhaps the post will be fresh (and remain relevant) for more than a few, and sufficiently forgotten to feel fresh for the rest:

(From May 12, 2015): I am in Great Britain this week addressing an E-Discovery and Information Governance conclave, joined by esteemed American colleagues and friends, Jason Baron and Ralph Losey among other luminaries.  My keynote topic opening the conference is Girding for the E-Savvy Opponent. Here is a smattering of what I expect to say.

I arrived in London from Budapest in time to catch some of the events for the 70th anniversary of VE Day, marking the hard-won victory over Germany in the war that shortly followed the war that was to have ended all wars.

As we sported poppies and stood solemnly at the Cenotaph recalling the sacrifices made by our parents and grandparents, I mulled technology’s role in battle, and the disasters that come from being unprepared for a tech-savvy opponent.

It’s said that, “Generals are always prepared to fight the last war.” This speaks as much to technology as to tactics.  Mounted cavalry proved no match for armored tanks.  Machine guns made trench warfare obsolete.  The Maginot Line became a punch line thanks to the Blitzkrieg. “Heavy fortifications?  “No problem, mein schatzi, ve vill just drive arount tem.”

In e-disclosure, we still fight the last war, smug in the belief that our opponents will never be e-savvy enough to defeat us.

Our old war ways have served so long that we are slow to recognize a growing vulnerability.  To date, our opponents have proved unsophisticated, uncreative and un-tenacious.  Oh, they make a feint against databases here and a half-hearted effort to get native production there; but, for the most part, they’re still fighting with hordes, horses and sabers.  We run roughshod over them.  We pacify them with offal and scraps.

But, we don’t think of it that way, of course.  We think we are great at all this stuff, and that the way we do things is the way it’s supposed to be done.  Large companies and big law firms have been getting away with abusive practices in e-disclosure for so long that they have come to view it as a birthright.  I am the 19th Earl of TIFF.  My father was the Royal Exchequer of Keywords.  I have more than once heard an opponent defend costly, cumbersome procedures that produce what I didn’t seek and didn’t want with the irrefutable justification of, “we did what we always do.”

Tech-challenged opponents make it easy.  They don’t appreciate how our arsenal of information has changed; so, they shoot at us with obsolete requests from the last war, the paper war.  They don’t grasp that the information they need now lives in databases and won’t be found by keywords.  They demand documents.  We have data.  They demand files.  We have sources.

Girding for the Tech Savvy Opponent-IQPC 2015

But, our once tech challenged opponents will someday evolve into Juris Doctor Electronicus.  When they do, here is some of what to expect from them:

E-savvy counsel succeeds not by overreaching but by insisting on mere competence—competent scope, competent processes and competent forms of production.  Good, not just good enough.

Your most effective defense against e-savvy counsel is the Luddite judge who applies the standards of his or her former law practice to modern evidence. Your best strategy here is to continue to expose young lawyers to outmoded practices so that when they someday take the bench they will also know no better way.

Another strategy against e-savvy counsel is to embed outmoded practices in the rules and to immunize incompetence against sanctions.

But these are stopgap strategies–mere delaying tactics.  In the final analysis, the e-savvy opponent needn’t fear retrograde efforts to limit electronic disclosure. Today, virtually all evidence is born electronically; consequently, senseless restrictions on electronic disclosure cannot endure unless we are content to live in a society where justice abides in purposeful ignorance of the evidence.  We have not fallen so, and we will not fall that far.

The e-savvy opponent’s most powerful ally is the jurist who can distinguish between the high cost and burden occasioned by poor information governance and the high cost and burden that flows from overreaching by incompetent requests.  Confronted with a reasonable request, this able judge will give you no quarter because your IG house is not in order.

E-savvy counsel well understands that claims like, “that’s gone,” “we can’t produce it that way” and “we searched thoroughly” rarely survive scrutiny.

It’s not that no enterprise can match the skills of the e-savvy opponent. It’s that so few have ever had to do so.  Counsel for producing parties haven’t had to be particularly e-savvy because opposing counsel rarely were.

Sure, you may have been involved in the Black Swan discovery effort–the catastrophic case where a regulator or judges compelled you to go far beyond your normal scope. But, is that sustainable? Could you do that on a regular basis if all of your opponents were e-savvy?

You may respond, “But we shouldn’t have to respond that way on a regular basis.” In fact, you should, because “e-savvy” in our opponents is something we must come to expect and because, if the opponent is truly e-savvy, their requests will likely smack of relevance and reasonableness.

Remember, the e-savvy opponent about which I warn is not the twit with a form or the wanker who’s simply trying to inflate the scope of the disclosure as a means to extort settlement.  They’re no match for you.  The e-savvy opponent to fear is the one who can persuade a court that the scope is appropriate and proportionate because it is, in fact, both.

Share this:

  • Email a link to a friend (Opens in new window) Email
  • Print (Opens in new window) Print
  • Share on X (Opens in new window) X
  • Share on Facebook (Opens in new window) Facebook
  • Share on LinkedIn (Opens in new window) LinkedIn
Like Loading...

Cloud Attachments: Versions and Purview

08 Monday Apr 2024

Posted by craigball in Computer Forensics, E-Discovery, Uncategorized

≈ 6 Comments

Tags

cloud attachments, eDiscovery, Linked attachments, M365, modern attachments, Purview

Last week, I dug into Cloud Attachments to email, probing the propensity of producing parties’ to shirk collection of linked documents.  Here, I want to discuss the versioning concern offered as a justification for non-production and the use of hash duplicate identification to integrate supplementary productions with incomplete prior productions. 

Recently on LinkedIn, Very Smart Guy, Rachi Messing, shared this re: cloud attachments,

the biggest issue at hand is not the technical question of how to collect them and search them, but rather what VERSION is the correct one to collect and search.

Is it:

1. The version that existed at the time the email was sent (similar to a point in time capture of a file that is attached to an email the traditional way)

2. The version that was seen the first time the recipient opened it (which may lead to multiple versions required based on the exact timing of multiple recipients opening at varying times)

3. The version that exists the final time a recipient opened it

4. The most recent version in existence

I understand why Rachi might minimize the collection and search issue. He’s knee deep in Microsoft M365 collection.  As I noted in my last post, Microsoft makes cloud attachment collection a feature available to its subscribers, so there’s really no excuse for the failure to collect and search cloud attachments in Microsoft M365. 

I’d reframe Rachi’s question: Once collected, searched and determined to be responsive, is the possibility that the version of a cloud attachment reviewed differs from the one transmitted a sufficient basis upon which to withhold the attachment from production?

Respecting the versioning concern, I responded to Rachi’s post this way:

The industry would profit from objective analysis of the instance (e.g., percentage) of Cloud attachments modified after transmittal. I expect it will vary from sector to sector, but we would benefit from solid metrics in lieu of the anecdotal accounts that abound. My suspicion is that the instance is modest overall, the majority of Cloud attachments remaining static rather than manifesting as collaborative documents. But my suspicion would readily yield to meaningful measurement.  … May I add that the proper response to which version to collect to assess relevance is not ‘none of them,’ which is how many approach the task.

Digging into the versioning issue demands I retread ground on cloud attachments generally.

A “Cloud Attachment” is what Microsoft calls a file transmitted via email in which the sender places the file in a private online repository (e.g., Microsoft OneDrive) and sends a link to the uploaded file to the intended recipients.  The more familiar alternative to linking a file as a cloud attachment is embedding the file in the email; accordingly, such “Embedded Attachments” are collected with the email messages for discovery and cloud attachments are collected (downloaded) from OneDrive, ideally when the email is collected for discovery.  As a rule-of-thumb, large files tend to be cloud attachments automatically uploaded by virtue of their size.  The practice of linking large files as cloud attachments has been commonplace for more than a decade.

Within the Microsoft M365 email environment, searching and collecting email, including its embedded and cloud attachments, is facilitated by a suite of features called Microsoft Purview.  Terming any task in eDiscovery “one-click easy” risks oversimplification, but the Purview eDiscovery (Premium “E5”) features are designed to make collection of cloud attachments to M365 email nearly as simple as ticking a box during collection.

When a party using Microsoft M365 email elects to collect (export) a custodian’s email for search, they must decide whether to collect files sent as cloud attachments so they may be searched as part of the message “family,” the term commonly applied to a transmitting message and its attachments.  Preserving this family relationship is important because the message tells you who received the attachments and when, where searching the attachments tells you what information was shared. The following screenshot from Microsoft illustrates the box checked to collect cloud attachments. Looks “one-click easy,” right?

By themselves, the cloud attachment links in a message reveal nothing about the content of the cloud attachments.  Sensibly, the target documents must be collected to be assessed and as noted, the reason they are linked is not because they have some different character in terms of their relevance; many times they are linked because they are larger files, so to that extent, they hold a greater volume of potentially relevant information.

Just as it would not have been reasonable in the days of paper discovery to confine a search to documents on your desk but not in your desk, it’s not reasonable to confine a search of email attachments to embedded attachments but not cloud attachments.  Both are readily accessible to the custodians of the email using the purpose-built tools Microsoft supplies to its email customers.

Microsoft Purview collects cloud attachments as they exist at the time of collection; so, if the attachment was edited after transmittal, the attachment will reflect those edits.  The possibility that a document has been edited is not a new one in discovery; it goes to the document’s admissibility not its discoverability.  The relevance of a document for discovery depends on its content and logical unitization, and assessing content demands that it be searched, not ignored on the speculative possibility that it might have changed.

If a cloud attachment were changed after transmittal, those changes are customarily tracked within the document.  Accordingly, if a cloud attachment has run the gauntlet of search and review, any lingering suspicion that the document was changed may be resolved by, e.g., production of the version closest in time to transmittal or by the parties meeting and conferring.  Again, the possibility that a document has been edited is nothing new; and is merely a possibility.  It’s ridiculous to posit that a party may eschew collecting or producing all cloud attachments because some might have been edited.

Cloud attachments are squarely within the ambit of what must be assessed for relevance. The potential for a cloud attachment to be responsive is no less than that of an item transmitted as an embedded attachment.  The burden claimed by responding parties grows out of their failure to do what clearly should have been done in the first place; that is, it stems from the responding party’s decision to exclude potentially relevant, accessible documents from being collected and searched. 

If you’re smart, Dear Reader, you won’t fail to address cloud attachments explicitly in your proposed ESI Protocols and/or Requests for Production.  I can’t make this point too strongly, because you’re not likely to discover that the other side didn’t collect and search cloud attachments until AFTER they make a production, putting you in the unenviable posture of asking for families produced without cloud attachments to be reproduced with cloud attachments.  Anytime a Court hears that you are asking for something to be produced a second time in discovery, there’s a risk the Court may be misled by an objection grounded on Federal Rule of Civil Procedure Rule 34(b)(2)(E)(iii), which states that, [a] party need not produce the same electronically stored information in more than one form.”  In my mind, “incomplete” and “complete” aren’t what the drafters of the Rule meant by “more than one form,” but be prepared to rebut the claim.

At all events, a party who failed to collect cloud attachments will bewail the need to do it right and may cite as burdensome the challenge of distinguishing items reviewed without cloud transmittals from those requiring review when made whole by the inclusion of cloud attachments.

Once a party collects cloud attachments and transmittals, there are various ways to distinguish between messages updated with cloud attachments and those previously reviewed without cloud attachments.  Identifying families previously collected that have grown in size is one approach.  Then, by applying a filter, only the attachments of these families would be subjected to supplementary keyword search and review.  The emails with cloud attachments that are determined to be responsive and non-privileged would be re-produced as families comprising the transmittal and all attachments (cloud AND embedded).  An overlay file may be used to replace items previously produced as incomplete families with complete families.  No doubt there are other efficient approaches.

If all transmittal messages were searched and assessed previously (albeit without their cloud attachments), there would not be a need to re-assess those transmittals unless they have become responsive by virtue of a responsive cloud attachment.  These “new” families need no de-duplication against prior production because they were not produced previously.  I know that sounds peculiar, but I promise it makes sense once you think through the various permutations.

With respect to using hash deduplication, the hash value of a transmittal does not change because you collect a NON-embedded cloud attachment; leastwise not unless you change the way you compute the hash value to incorporate the collected cloud attachment.  Hash deduplication of email has always entailed the hashing of selected components of messages because email headers vary.  Accordingly, a producing party need compare only the family segments that matter, not the ones that do not. In other words, de-duplicating what has been produced versus new material is a straightforward process for emails (and one that greatly benefits from use of the EDRM MIH). Producing parties do not need to undertake a wholesale re-review of messages; instead, they need to review for the first time those things they should have reviewed from inception.

I’ll close with a question for those who conflate cloud attachments (which reside in private cloud respositories) with hyperlinks to public-facing web resources, objecting that dealing with collecting cloud attachments will require collection of all hyperlinked content. What have you been doing with the hyperlinks in your messages until now? In my experience, loads of us include a variety of hyperlinks in email signature blocks. We’ve done it for years. In my email signature, I hyperlink to my email address, my website and my blog; yet, I’ve never had trouble distinguishing those links from embedded and cloud attachments. The need to integrate cloud attachments in eDiscovery is not a need to chase every hyperlink in an email. Doug Austin does a superb job debunking the “what about hyperlinks” strawman in Assumption One of his thoughtful post, “Five Assumptions About the Issue of Hyperlinked Files as Modern Attachments.”

Bottom Line: If you’re an M365 email user; you need to grab the cloud attachments in your Microsoft repositories. If you’re a GMail user, you need to grab the cloud attachments in your Google Drive respositories. That a custodian might conceivably link to another repository is no reason to fail to collect from M365 and GMail.

Share this:

  • Email a link to a friend (Opens in new window) Email
  • Print (Opens in new window) Print
  • Share on X (Opens in new window) X
  • Share on Facebook (Opens in new window) Facebook
  • Share on LinkedIn (Opens in new window) LinkedIn
Like Loading...
← Older posts
Follow Ball in your Court on WordPress.com

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 2,231 other subscribers

Recent Posts

  • Free at Last: Ditching TurboTax for FreeTaxUSA April 5, 2026
  • A Dog and Its Tail: Don’t Let Version Uncertainty Cloud Linked Attachment Production April 2, 2026
  • The EDRM Isn’t Broken; It’s Misunderstood. March 18, 2026
  • Detecting Deep Fakes February 24, 2026
  • A Fun Way to Build AI Fluency February 21, 2026

Archives

RSS Feed RSS - Posts

CRAIGBALL.COM

Helping lawyers master technology

Categories

EDD Blogroll

  • The Relativity Blog
  • Illuminating eDiscovery (Lighthouse)
  • Minerva 26 (Kelly Twigger)
  • Complex Discovery (Rob Robinson)
  • GLTC (Tom O'Connor)
  • Sedona Conference
  • CS DISCO Blog
  • Corporate E-Discovery Blog (Zapproved )
  • eDiscovery Today (Doug Austin)
  • E-D Team (Ralph Losey)
  • Basics of E-Discovery (Exterro)
  • eDiscovery Journal (Greg Buckles)
  • E-Discovery Law Alert (Gibbons)

Admin

  • Create account
  • Log in
  • Entries feed
  • Comments feed
  • WordPress.com

Enter your email address to follow Ball in Your Court and receive notifications of new posts by email.

Website Powered by WordPress.com.

  • Subscribe Subscribed
    • Ball in your Court
    • Join 2,083 other subscribers
    • Already have a WordPress.com account? Log in now.
    • Ball in your Court
    • Subscribe Subscribed
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...
 

    %d