Tags
Early this century, when I was gaining a reputation as a trial lawyer who understood e-discovery and digital forensics, I was hired to work as the lead computer forensic examiner for plaintiffs in a headline-making case involving a Houston-based company called Enron. It was a heady experience.
Today, everywhere you turn in e-discovery, Enron is still with us. Not the company that went down in flames more than two decades ago, but the Enron Email Corpus, the industry’s default demo dataset.
Type in “Ken Lay” or “Andy Fastow,” hit search, and watch the results roll in. For vendors, it’s the easy choice: free, legal, and familiar. But for 2025, it’s also frozen in time—benchmarking the future of discovery against the technological equivalent of a rotary phone. Or, now that AOL has lately retired its dial-up service, benchmarking it against a 56K modem.
How Enron Became Everyone’s Test Data
When Enron collapsed in 2001 amid accounting fraud and market-manipulation scandals, the U.S. Federal Energy Regulatory Commission (FERC) launched a sweeping investigation into abuses during the Western U.S. energy crisis. As part of that probe, FERC collected huge volumes of internal Enron email.
In 2003, in an extraordinary act of transparency, FERC made a subset of those emails public as part of its docket. Some messages were removed at employees’ request; all attachments were stripped.
The dataset got a second life when Carnegie Mellon University’s School of Computer Science downloaded the FERC release, cleaned and structured it into individual mailboxes, and published it for research. That CMU version contains roughly half a million messages from about 150 Enron employees.
A few years later, the Electronic Discovery Reference Model (EDRM)—where I serve as General Counsel—stepped in to make the corpus more accessible to the legal tech world. EDRM curated, repackaged, and hosted improved versions, including PST-structured mailboxes and more comprehensive metadata. Even after CMU stopped hosting it, EDRM kept it available for years, ensuring that anyone building or testing e-discovery tools had a free, legal dataset to use. [Note: EDRM no longer hosts the Enron corpus, but for those who like hunting antiques, you may find it (or parts of it) at CMU, Enrondata.org, Kaggle.com and, no joke, The Library of Congress].
Because it’s there, lawful, and easy, Enron became—and regrettably remains—the de facto benchmark in our industry.
Why Enron Endures
Its virtues are obvious:
- Free and lawful to use
- Large enough to exercise search and analytics tools
- Real corporate communications with all their messy quirks
- Familiar to the point of being an industry standard
But those virtues are also the trap. The data is from 2001—before smartphones, Teams, Slack, Zoom, linked attachments, and nearly every other element that makes modern email review challenging.
In 2025, running Enron through a discovery platform is like driving a Formula One race car on cobblestone streets.
Why There’ll Never Be Another Enron
The conditions that produced the Enron corpus are gone. Back then, there was:
- A spectacular corporate collapse (OK, still plenty of these)
- Bankruptcy, eliminating any ongoing operations to protect
- A regulator willing to make internal corporate communications public, and most crucially,
- A legal environment without GDPR, CCPA, or modern privacy expectations
Post-2000 privacy laws, heightened sensitivity to privilege and aggressive litigation strategies make a wholesale release of modern corporate mailboxes virtually impossible. Even massive corporate implosions now end in settlements, with data returned, destroyed, or locked behind protective orders.
Chances are we will never see the likes of the Enron email release.
When ‘Safe’ Isn’t Good Enough
That leaves us with a problem. The only large, lawful, realistic corporate dataset we all share is woefully out of date. The “safe” choice—Enron—is the wrong choice if you want to see how a 2025 platform handles contemporary realities:
- Cloud-hosted mail systems
- Embedded chat and meeting content
- Mobile-generated messages
- Linked attachments
- Mixed MIME types and non-standard encodings
- Encryption, redactions, and multifile attachments
Teaching e-discovery and digital evidence at the University of Texas Schools of Law, Computer Science and Information Science, I sidestep that by using something more modern and manageable: the John Podesta emails. Yes, they were stolen and released without consent. Yes, that’s ethically and legally fraught. But they’re from 2015, structured as PSTs with full headers and attachments, and a suitable size for students—around 50,000 messages. In a controlled educational setting, with disclaimers, they supply a realistic glimpse of the formats, quirks, and challenges of modern email collections that Enron simply can’t provide.
I wouldn’t expect vendors to demo with Podesta considering the optics, risks and size, but it underscores the problem: the most realistic training data is often the least “safe” to use.
A Practical Path Forward
If we can’t have another lawful, massive corporate email release, we need to stop pretending Enron is a valid stand-in and build datasets that reflect the present. That means:
- Synthetic corpora: purpose-built to mimic modern corporate environments, with realistic metadata, message formats, and attachment types. One example of this is the Avocado Research Email Collection available from the Linguistic Data Consortium, but it, too, is old data and too costly and restricted for general use.
- FOIA-based collections: modern government email releases converted into structured, searchable corpora.
- Anonymized donor data: corporate partners willing to contribute sanitized, non-privileged communications for research.
- Blended corpora: combinations of lawful sources to produce realistic size, variety, and complexity.
These won’t have the mystique of a real scandal’s raw inboxes, but they will give us a far better sense of how today’s tools handle today’s ESI challenges.
Time to Log Off
Enron doesn’t need to vanish. It’s part of our field’s history, and for researchers in some settings it still has value. But for evaluating modern discovery software, It’s a dinosaur from the dial-up era, and it’s past time we stopped pretending otherwise.
The next time you see Enron in a demo, ask: If your tool can only shine on data from 2001, how will it perform on what I’m dealing with today? That’s the real benchmark.
For more on this, please look at tomorrow’s post: Native or Not? Rethinking Public E-Mail Corpora for E-Discovery (Redux, 2013→2025)
Postscript: After I penned this, I asked ChatGPT, “Has anyone published advocating for retirement of the Enron corpus in ediscovery?” It pointed to a 2019 post on CloudNine Software blog titled, “The Enron Data Set is No Longer a Representative Test Data Set: eDiscovery Best Practices.” The post carries no byline, but my guess is it’s the work of my dear friend and blogger extraordinaire, Doug Austin from Houston. Credit where it’s due.

Patrick T Cronin said:
Check out Kaggle, they have several email dataset (yes, including Enron). As tagged emails, these are really good for GenAI training.
LikeLike
craigball said:
Thanks. I perused their offerings and failed to find anything suitable in a native or near-native format. FWIW, I feel strongly that the corpus sought should be a compressed PST or MBOX, or a collection of EML or MSG single messages with full headers and attachments. A real collection with all the full range of real-world anomalies we encounter.
LikeLike
simsong said:
Hi. I’m surprised you are not aware of the Digital Corpora, at https://digitalcorpora.org/ ? Simson
>
LikeLike
craigball said:
I am well aware, but I had no idea that it has come to include a sizeable real-world collection of email in native- or near-native format. When I looked just now, I saw only the same old samples circa 2012.
LikeLike
simsong said:
There’s more recent stuff than 2012. But it’s all scenario based. As you note, privacy issues preclude actual emails from real people. The real problem is that email typically is multi-party
However, if you look here, you’ll see a lot of things since 2012, including some very useful scenarios: https://digitalcorpora.s3.amazonaws.com/s3_browser.html#corpora/scenarios/

>
LikeLike
Doug Austin said:
Hi Craig,
Seedless is a relatively new company that “generates realistic fictional data, like emails, chat messages, contracts, reports, patient health information and financial records, to enable companies to test, train and ultimately trust AI without compromising privacy or security.”
https://www.seedlessdata.com/
I can’t personally speak for the quality (or quantity) of the data they can generate, but their CEO is an industry expert with leadership experience in corporate, law firm and provider organizations.
Josh and I were going to do an interview but got sidetracked. I’ll reach out to him to see if we can get that done in the next week or so.
P.S.: Thanks for the shout out! As indicated by the date of the post, I’ve been looking for an alternative data set for a long time so far, Enron is still the best we have.
LikeLike
Pingback: Week 33 – 2025 – This Week In 4n6
Pingback: Time to Retire the Enron Email Corpus
Pingback: Native or Not? Rethinking Public E-Mail Corpora for E-Discovery (Redux, 2013→2025) | Ball in your Court
Pingback: This Week in eDiscovery: What’s a Modern eDiscovery Training Database? + Another Reason to Meet and Confer - Array
Pingback: This Week in eDiscovery: What’s a Modern eDiscovery Training Database? + Another Reason to Meet and Confer | Array – Your Source for Real-Time News