Native or Not? Rethinking Public E-Mail Corpora for E-Discovery (Redux, 2013→2025)

16 Saturday Aug 2025

Posted by craigball in ai, Computer Forensics, E-Discovery, Uncategorized

Tags

ai, artificial-intelligence, chatgpt, eDiscovery, EDRM, generative-ai, Linked attachments, Purview, technology

Yesterday, I found myself in a spirited exchange with a colleague about whether the e-discovery community has suitable replacements for the Enron e-mail corpora¹—now more than two decades old—as a “sandbox” for testing tools and training students. I argued that the quality of the data matters: native or near-native e-mail collections remain essential to test processing and review workflows in ways that mirror real-world litigation.

The back-and-forth reminded me that, unlike forensic examiners or service providers, ediscovery lawyers may not know or care much about the nature of electronically-stored information until it finds its way to a review tool. I get that. If your interest in email is in testing AI coding tools, you’re laser-focused on text and maybe a handful of metadata; but if your focus is on the integrity and authenticity of evidence, or in perfecting processing tools, the originating native or near-native form of the corpus matters more.

What follows is a re-publication of a post from July 2013. I’m bringing it back because the debate over forms of email hasn’t gone away; the issue is as persistent and important as ever. A central takeaway bears repeating: the litmus test is whether a corpus hews to a fulsome RFC-5322 compliant format. If headers, MIME boundaries, and transport artifacts are stripped or incompletely synthesized, what remains ceases to be a faithful native or near-native format. That distinction matters, because even experienced e-discovery practitioners—those fixated on review at the far-right side of the EDRM—may not fully appreciate what an RFC-5322 email is, or how much fidelity is lost when working with post-processed sets.

Continue reading →

Cloud Attachments: Versions and Purview

08 Monday Apr 2024

Posted by craigball in Computer Forensics, E-Discovery, Uncategorized

≈ 6 Comments

Tags

cloud attachments, eDiscovery, Linked attachments, M365, modern attachments, Purview

Last week, I dug into Cloud Attachments to email, probing the propensity of producing parties’ to shirk collection of linked documents. Here, I want to discuss the versioning concern offered as a justification for non-production and the use of hash duplicate identification to integrate supplementary productions with incomplete prior productions.

Recently on LinkedIn, Very Smart Guy, Rachi Messing, shared this re: cloud attachments,

the biggest issue at hand is not the technical question of how to collect them and search them, but rather what VERSION is the correct one to collect and search.

Is it:

1. The version that existed at the time the email was sent (similar to a point in time capture of a file that is attached to an email the traditional way)

2. The version that was seen the first time the recipient opened it (which may lead to multiple versions required based on the exact timing of multiple recipients opening at varying times)

3. The version that exists the final time a recipient opened it

4. The most recent version in existence

I understand why Rachi might minimize the collection and search issue. He’s knee deep in Microsoft M365 collection. As I noted in my last post, Microsoft makes cloud attachment collection a feature available to its subscribers, so there’s really no excuse for the failure to collect and search cloud attachments in Microsoft M365.

I’d reframe Rachi’s question: Once collected, searched and determined to be responsive, is the possibility that the version of a cloud attachment reviewed differs from the one transmitted a sufficient basis upon which to withhold the attachment from production?

Respecting the versioning concern, I responded to Rachi’s post this way:

The industry would profit from objective analysis of the instance (e.g., percentage) of Cloud attachments modified after transmittal. I expect it will vary from sector to sector, but we would benefit from solid metrics in lieu of the anecdotal accounts that abound. My suspicion is that the instance is modest overall, the majority of Cloud attachments remaining static rather than manifesting as collaborative documents. But my suspicion would readily yield to meaningful measurement. … May I add that the proper response to which version to collect to assess relevance is not ‘none of them,’ which is how many approach the task.

Digging into the versioning issue demands I retread ground on cloud attachments generally.

A “Cloud Attachment” is what Microsoft calls a file transmitted via email in which the sender places the file in a private online repository (e.g., Microsoft OneDrive) and sends a link to the uploaded file to the intended recipients. The more familiar alternative to linking a file as a cloud attachment is embedding the file in the email; accordingly, such “Embedded Attachments” are collected with the email messages for discovery and cloud attachments are collected (downloaded) from OneDrive, ideally when the email is collected for discovery. As a rule-of-thumb, large files tend to be cloud attachments automatically uploaded by virtue of their size. The practice of linking large files as cloud attachments has been commonplace for more than a decade.

Within the Microsoft M365 email environment, searching and collecting email, including its embedded and cloud attachments, is facilitated by a suite of features called Microsoft Purview. Terming any task in eDiscovery “one-click easy” risks oversimplification, but the Purview eDiscovery (Premium “E5”) features are designed to make collection of cloud attachments to M365 email nearly as simple as ticking a box during collection.

When a party using Microsoft M365 email elects to collect (export) a custodian’s email for search, they must decide whether to collect files sent as cloud attachments so they may be searched as part of the message “family,” the term commonly applied to a transmitting message and its attachments. Preserving this family relationship is important because the message tells you who received the attachments and when, where searching the attachments tells you what information was shared. The following screenshot from Microsoft illustrates the box checked to collect cloud attachments. Looks “one-click easy,” right?

By themselves, the cloud attachment links in a message reveal nothing about the content of the cloud attachments. Sensibly, the target documents must be collected to be assessed and as noted, the reason they are linked is not because they have some different character in terms of their relevance; many times they are linked because they are larger files, so to that extent, they hold a greater volume of potentially relevant information.

Just as it would not have been reasonable in the days of paper discovery to confine a search to documents on your desk but not in your desk, it’s not reasonable to confine a search of email attachments to embedded attachments but not cloud attachments. Both are readily accessible to the custodians of the email using the purpose-built tools Microsoft supplies to its email customers.

Microsoft Purview collects cloud attachments as they exist at the time of collection; so, if the attachment was edited after transmittal, the attachment will reflect those edits. The possibility that a document has been edited is not a new one in discovery; it goes to the document’s admissibility not its discoverability. The relevance of a document for discovery depends on its content and logical unitization, and assessing content demands that it be searched, not ignored on the speculative possibility that it might have changed.

If a cloud attachment were changed after transmittal, those changes are customarily tracked within the document. Accordingly, if a cloud attachment has run the gauntlet of search and review, any lingering suspicion that the document was changed may be resolved by, e.g., production of the version closest in time to transmittal or by the parties meeting and conferring. Again, the possibility that a document has been edited is nothing new; and is merely a possibility. It’s ridiculous to posit that a party may eschew collecting or producing all cloud attachments because some might have been edited.

Cloud attachments are squarely within the ambit of what must be assessed for relevance. The potential for a cloud attachment to be responsive is no less than that of an item transmitted as an embedded attachment. The burden claimed by responding parties grows out of their failure to do what clearly should have been done in the first place; that is, it stems from the responding party’s decision to exclude potentially relevant, accessible documents from being collected and searched.

If you’re smart, Dear Reader, you won’t fail to address cloud attachments explicitly in your proposed ESI Protocols and/or Requests for Production. I can’t make this point too strongly, because you’re not likely to discover that the other side didn’t collect and search cloud attachments until AFTER they make a production, putting you in the unenviable posture of asking for families produced without cloud attachments to be reproduced with cloud attachments. Anytime a Court hears that you are asking for something to be produced a second time in discovery, there’s a risk the Court may be misled by an objection grounded on Federal Rule of Civil Procedure Rule 34(b)(2)(E)(iii), which states that, [a] party need not produce the same electronically stored information in more than one form.” In my mind, “incomplete” and “complete” aren’t what the drafters of the Rule meant by “more than one form,” but be prepared to rebut the claim.

At all events, a party who failed to collect cloud attachments will bewail the need to do it right and may cite as burdensome the challenge of distinguishing items reviewed without cloud transmittals from those requiring review when made whole by the inclusion of cloud attachments.

Once a party collects cloud attachments and transmittals, there are various ways to distinguish between messages updated with cloud attachments and those previously reviewed without cloud attachments. Identifying families previously collected that have grown in size is one approach. Then, by applying a filter, only the attachments of these families would be subjected to supplementary keyword search and review. The emails with cloud attachments that are determined to be responsive and non-privileged would be re-produced as families comprising the transmittal and all attachments (cloud AND embedded). An overlay file may be used to replace items previously produced as incomplete families with complete families. No doubt there are other efficient approaches.

If all transmittal messages were searched and assessed previously (albeit without their cloud attachments), there would not be a need to re-assess those transmittals unless they have become responsive by virtue of a responsive cloud attachment. These “new” families need no de-duplication against prior production because they were not produced previously. I know that sounds peculiar, but I promise it makes sense once you think through the various permutations.

With respect to using hash deduplication, the hash value of a transmittal does not change because you collect a NON-embedded cloud attachment; leastwise not unless you change the way you compute the hash value to incorporate the collected cloud attachment. Hash deduplication of email has always entailed the hashing of selected components of messages because email headers vary. Accordingly, a producing party need compare only the family segments that matter, not the ones that do not. In other words, de-duplicating what has been produced versus new material is a straightforward process for emails (and one that greatly benefits from use of the EDRM MIH). Producing parties do not need to undertake a wholesale re-review of messages; instead, they need to review for the first time those things they should have reviewed from inception.

I’ll close with a question for those who conflate cloud attachments (which reside in private cloud respositories) with hyperlinks to public-facing web resources, objecting that dealing with collecting cloud attachments will require collection of all hyperlinked content. What have you been doing with the hyperlinks in your messages until now? In my experience, loads of us include a variety of hyperlinks in email signature blocks. We’ve done it for years. In my email signature, I hyperlink to my email address, my website and my blog; yet, I’ve never had trouble distinguishing those links from embedded and cloud attachments. The need to integrate cloud attachments in eDiscovery is not a need to chase every hyperlink in an email. Doug Austin does a superb job debunking the “what about hyperlinks” strawman in Assumption One of his thoughtful post, “Five Assumptions About the Issue of Hyperlinked Files as Modern Attachments.”

Bottom Line: If you’re an M365 email user; you need to grab the cloud attachments in your Microsoft repositories. If you’re a GMail user, you need to grab the cloud attachments in your Google Drive respositories. That a custodian might conceivably link to another repository is no reason to fail to collect from M365 and GMail.

Image

What’s All the Fuss About Linked Attachments?

29 Friday Mar 2024

Tags

ESI Protocols, hyperlinked files, Linked attachments, Purview

In the E-Discovery Bubble, we’re embroiled in a debate over “Linked Attachments.” Or should we say “Cloud Attachments,” or “Modern Attachments” or “Hyperlinked Files?” The name game aside, a linked or Cloud attachment is a file that, instead of being tucked into an email, gets uploaded to the cloud, leaving a trail in the form of a link shared in the transmitting message. It’s the digital equivalent of saying, “It’s in an Amazon locker; here’s the code” versus handing over a package directly. An “embedded attachment” travels within the email, while a “linked attachment” sits in the cloud, awaiting retrieval using the link.

Some recoil at calling these digital parcels “attachments” at all. I stick with the term because it captures the essence of the sender’s intent to pass along a file, accessible only to those with the key to retrieve it, versus merely linking to a public webpage. A file I seek to put in the hands of another via email is an “attachment,” even if it’s not an “embedment.” Oh, and Microsoft calls them “Cloud Attachments,” which is good enough for me.

Regardless of what we call them, they’re pivotal in discovery. If you’re on the requesting side, prepare for a revelation. And if you’re a producing party, the party’s over.

A Quick March Through History

Nascent email conveyed basic ASCII text but no attachments. In the early 90s, the advent of Multipurpose Internet Mail Extensions (MIME) enabled files to hitch a ride on emails via ASCII encoded in Base64. This tech pivot meant attachments could join emails as encoded stowaways, to be unveiled upon receipt.

For two decades, this embedding magic meant capturing an email also netted its attachments. But come the early 2010s, the cloud era beckoned. Files too bulky for email began diverting to cloud storage with emails containing only links or “pointers” to these linked attachments.

The Crux of the Matter

Linked attachments aren’t newcomers; they’ve been lurking for over a decade. Yet, there’s a growing “aha” moment among requesters as they realize the promised exchange of digital parcels hasn’t been as expected. Increasingly—and despite contrary representations by producing parties—relevant, responsive and non-privileged attachments to email aren’t being produced because relevant, responsive and non-privileged attachments aren’t being searched.

Wait! What? Say that again.

You heard me. As attachments shifted from being embedded to being linked, producing parties simply stopped collecting and searching those attachments.

How is that possible? Why didn’t they disclose that?

I’ll explain if you’ll indulge me in another history lesson.

Echoes From the Past

Traditionally, discovery leaned on indexing the content of email and attachments for quicker search, bypassing the need to sift through each individually. Every service provider employs indexed search.

When attachments are embedded in messages, those attachments are collected with the messages, then indexed and searched. But when those attachments are linked instead of embedded, collecting them requires an added step of downloading the linked attachments with the transmitting message. You must do this before you index and search because, if you fail to do so, the linked attachments aren’t searched or tied to the transmitting message in a so-called “family relationship.”

They aren’t searched. Not because they are immaterial or irrelevant or in any absolute sense, inaccessible; a linked attachment is as amenable to being indexed and searched as any other document. They aren’t searched because they aren’t collected; and they aren’t collected because it’s easier to blow off linked attachments than collect them.

Linked attachments, squarely under the producer’s control, pose a quandary. A link in an email is a dead-end for anyone but the sender and recipients and reveals nothing of the file’s content. These linked attachments could be brimming with relevant keywords yet remain unexplored if not collected with their emails.

So, over the course of the last decade, how many times has an opponent revealed that, despite a commitment to search a custodian’s email, they were not going to collect and search linked documents?

The curse and blessing of long experience is having seen it all before. Every generation imagines they invented sex, drugs and rock-n-roll, and every new information and communication technology is followed by what I call the “getting-away-with-murder” phase in civil discovery. Litigants claim that whatever new tech has wrought is “too hard” to deal with in discovery, and they get away with murder by not having to produce the new stuff until long after we have the means and methods to do so. I lived through that with e-mail, native production, then mobile devices, web content and now, linked attachments.

This isn’t just about technology but transparency and diligence in discovery. The reluctance to tackle linked attachments under claims of undue burden echoes past reluctances with emerging technologies. Yet, linked attachments, integral to relevance assessments, shouldn’t be sidelined.

What is the Burden, Really?

We see conclusory assertions of burden notwithstanding that the biggest platforms like Microsoft and Google offer ‘pretty good’ mechanisms to deal with linked attachments. So, if a producing party claims burden, it behooves the Court and requesting parties to inquire into the source of the messaging. When they do, judges may learn that the tools and techniques to collect linked attachments and preserve family relationships exist, but the producing party elected not to employ them. Granted, these tools aren’t perfect; but they exist, and perfect is not the standard, just as pretending there are no solutions and doing nothing is not the standard.

Claims that collecting linked attachments pose an undue burden because of increased volume are mostly nonsense. The longstanding practice has been to collect a custodian’s messages and ALL embedded attachments, then index and search them. With few exceptions, the number of items collected won’t differ materially whether the attachment is embedded or linked (although larger files tend to be linked). So, any party arguing that collecting linked attachments will require the search of many more documents than before is fibbing or out of touch. I try not to attribute to guile that which may be explained by ignorance, so let’s go with the latter.

Half Baked Solutions

Challenged for failing to search linked attachments, a responding party may protest that they searched the transmitting emails and even commit to collecting and searching linked attachments to emails containing search hits. Sounds reasonable, right? Yet, it’s not even close to reasonable. Here’s why:

When using lexical (e.g., keyword) search to identify potentially responsive e-mail “families,” the customary practice is to treat a message and its attachments as potentially responsive if either the content of the transmitting message or its attachment generates search “hits” for the keywords and queries run against them. This is sensible because transmittals often say no more than, “see attached;” it’s the attachment that holds the hits. Yet, stripped of its transmittal, you won’t know the timing or circulation of the attachment. So, we preserve and disclose email families.

But, if we rely upon the content of transmitting messages to prompt a search of linked attachments, we will miss the lion’s share of responsive evidence. If we produce responsive documents without tying them to their transmittals, we can’t tell who got what and when. All that “what did you know and when did you know it” matters.

Why Guess When You Can Measure?

Hopefully, you’re wondering how many hits suggesting relevance occur in transmittals and how many in attachments? How many occur in both? Great questions! Happily, we can measure these things. We can determine, on average, the percentage of messages that produce hits versus their attachments.

If you determine that, say, half of hits were within embedded attachments, then you can fairly attribute that character to linked attachments not being searched. In that sense, you can estimate how much you’re missing and ascertain a key component of a proper proportionality analysis.

So why don’t producing parties asserting burden supply this crucial metric?

The Path Forward

Producing parties have been getting away with murder on linked attachments for so long that they’ve come to view it as an entitlement. Linked attachments are squarely within the ambit of what must be assessed for relevance. The potential for a linked attachment to be responsive is no less than that of an item transmitted as an embedded attachment. So, let’s stop pretending they have a different character in terms of relevance and devote our energies to fixing the process.

Collecting linked attachments isn’t as Herculean as some claim, especially with tools from giants like Microsoft and Google easing the process. The challenge, then, isn’t in the tools but in the willingness to employ them.

Do linked attachments pose problems? They absolutely do! I’ve elided over ancillary issues of versioning and credentials because those concerns reside in the realm between good and perfect solutions. Collection methods must be adapted to them—with clumsy workarounds at first and seamless solutions soon enough. But in acknowledging that there are challenges, we must also acknowledge that these linked attachments have been around for years, and they are evidence. Waiting until the crisis stage to begin thinking about how to deal with them was a choice, and a poor one. I shudder to think of the responsive information ignored every single day because this issue is inadequately appreciated by counsel and courts.

Happily, this is simply a technical challenge and one starting to resolve. Speeding the race to resolution requires that courts stop giving a free pass to the practice of ignoring linked attachments. Abraham Lincoln defined a hypocrite as a “man who murdered his parents, and then pleaded for mercy on the grounds that he was an orphan.” Having created the problem and ignored it for years, it seems disingenuous to indulge requesting parties’ pleas for mercy.

In Conclusion

We’re at a crossroads, with technical solutions within reach and the legal imperative clearer than ever. It’s high time we bridge the gap between digital advancements and discovery obligations, ensuring that no piece of evidence, linked or embedded, escapes scrutiny.

Posted by craigball | Filed under Computer Forensics, E-Discovery, Uncategorized

≈ 18 Comments

Ball in your Court

~ Musings on e-discovery & forensics.

Tag Archives: Linked attachments

Native or Not? Rethinking Public E-Mail Corpora for E-Discovery (Redux, 2013→2025)

What’s All the Fuss About Linked Attachments?

Share this:

Share this:

Share this: