Tags
Early this century, when I was gaining a reputation as a trial lawyer who understood e-discovery and digital forensics, I was hired to work as the lead computer forensic examiner for plaintiffs in a headline-making case involving a Houston-based company called Enron. It was a heady experience.
Today, everywhere you turn in e-discovery, Enron is still with us. Not the company that went down in flames more than two decades ago, but the Enron Email Corpus, the industry’s default demo dataset.
Type in “Ken Lay” or “Andy Fastow,” hit search, and watch the results roll in. For vendors, it’s the easy choice: free, legal, and familiar. But for 2025, it’s also frozen in time—benchmarking the future of discovery against the technological equivalent of a rotary phone. Or, now that AOL has lately retired its dial-up service, benchmarking it against a 56K modem.
How Enron Became Everyone’s Test Data
When Enron collapsed in 2001 amid accounting fraud and market-manipulation scandals, the U.S. Federal Energy Regulatory Commission (FERC) launched a sweeping investigation into abuses during the Western U.S. energy crisis. As part of that probe, FERC collected huge volumes of internal Enron email.
In 2003, in an extraordinary act of transparency, FERC made a subset of those emails public as part of its docket. Some messages were removed at employees’ request; all attachments were stripped.
The dataset got a second life when Carnegie Mellon University’s School of Computer Science downloaded the FERC release, cleaned and structured it into individual mailboxes, and published it for research. That CMU version contains roughly half a million messages from about 150 Enron employees.
A few years later, the Electronic Discovery Reference Model (EDRM)—where I serve as General Counsel—stepped in to make the corpus more accessible to the legal tech world. EDRM curated, repackaged, and hosted improved versions, including PST-structured mailboxes and more comprehensive metadata. Even after CMU stopped hosting it, EDRM kept it available for years, ensuring that anyone building or testing e-discovery tools had a free, legal dataset to use. [Note: EDRM no longer hosts the Enron corpus, but for those who like hunting antiques, you may find it (or parts of it) at CMU, Enrondata.org, Kaggle.com and, no joke, The Library of Congress].
Because it’s there, lawful, and easy, Enron became—and regrettably remains—the de facto benchmark in our industry.
Why Enron Endures
Its virtues are obvious:
- Free and lawful to use
- Large enough to exercise search and analytics tools
- Real corporate communications with all their messy quirks
- Familiar to the point of being an industry standard
But those virtues are also the trap. The data is from 2001—before smartphones, Teams, Slack, Zoom, linked attachments, and nearly every other element that makes modern email review challenging.
In 2025, running Enron through a discovery platform is like driving a Formula One race car on cobblestone streets.
Continue reading
