Tags
ai, artificial-intelligence, chatgpt, eDiscovery, EDRM, generative-ai, Linked attachments, Purview, technology
Yesterday, I found myself in a spirited exchange with a colleague about whether the e-discovery community has suitable replacements for the Enron e-mail corpora1—now more than two decades old—as a “sandbox” for testing tools and training students. I argued that the quality of the data matters: native or near-native e-mail collections remain essential to test processing and review workflows in ways that mirror real-world litigation.
The back-and-forth reminded me that, unlike forensic examiners or service providers, ediscovery lawyers may not know or care much about the nature of electronically-stored information until it finds its way to a review tool. I get that. If your interest in email is in testing AI coding tools, you’re laser-focused on text and maybe a handful of metadata; but if your focus is on the integrity and authenticity of evidence, or in perfecting processing tools, the originating native or near-native form of the corpus matters more.
What follows is a re-publication of a post from July 2013. I’m bringing it back because the debate over forms of email hasn’t gone away; the issue is as persistent and important as ever. A central takeaway bears repeating: the litmus test is whether a corpus hews to a fulsome RFC-5322 compliant format. If headers, MIME boundaries, and transport artifacts are stripped or incompletely synthesized, what remains ceases to be a faithful native or near-native format. That distinction matters, because even experienced e-discovery practitioners—those fixated on review at the far-right side of the EDRM—may not fully appreciate what an RFC-5322 email is, or how much fidelity is lost when working with post-processed sets.
