• Home
  • About
  • CRAIGBALL.COM
  • Disclaimer
  • Log In

Ball in your Court

~ Musings on e-discovery & forensics.

Ball in your Court

Tag Archives: corpora

Still on Dial-Up: Why It’s Time to Retire the Enron Email Corpus

15 Friday Aug 2025

Posted by craigball in Computer Forensics, E-Discovery, General Technology Posts

≈ 11 Comments

Tags

corpora, E-Discovery, eDiscovery, Enron, ESI, forensics

Early this century, when I was gaining a reputation as a trial lawyer who understood e-discovery and digital forensics, I was hired to work as the lead computer forensic examiner for plaintiffs in a headline-making case involving a Houston-based company called Enron.  It was a heady experience.

Today, everywhere you turn in e-discovery, Enron is still with us. Not the company that went down in flames more than two decades ago, but the Enron Email Corpus, the industry’s default demo dataset.

Type in “Ken Lay” or “Andy Fastow,” hit search, and watch the results roll in. For vendors, it’s the easy choice: free, legal, and familiar. But for 2025, it’s also frozen in time—benchmarking the future of discovery against the technological equivalent of a rotary phone. Or, now that AOL has lately retired its dial-up service, benchmarking it against a 56K modem.

How Enron Became Everyone’s Test Data

When Enron collapsed in 2001 amid accounting fraud and market-manipulation scandals, the U.S. Federal Energy Regulatory Commission (FERC) launched a sweeping investigation into abuses during the Western U.S. energy crisis. As part of that probe, FERC collected huge volumes of internal Enron email.

In 2003, in an extraordinary act of transparency, FERC made a subset of those emails public as part of its docket. Some messages were removed at employees’ request; all attachments were stripped.

The dataset got a second life when Carnegie Mellon University’s School of Computer Science downloaded the FERC release, cleaned and structured it into individual mailboxes, and published it for research. That CMU version contains roughly half a million messages from about 150 Enron employees.

A few years later, the Electronic Discovery Reference Model (EDRM)—where I serve as General Counsel—stepped in to make the corpus more accessible to the legal tech world. EDRM curated, repackaged, and hosted improved versions, including PST-structured mailboxes and more comprehensive metadata. Even after CMU stopped hosting it, EDRM kept it available for years, ensuring that anyone building or testing e-discovery tools had a free, legal dataset to use. [Note: EDRM no longer hosts the Enron corpus, but for those who like hunting antiques, you may find it (or parts of it) at CMU, Enrondata.org, Kaggle.com and, no joke, The Library of Congress].

Because it’s there, lawful, and easy, Enron became—and regrettably remains—the de facto benchmark in our industry.

Why Enron Endures

Its virtues are obvious:

  • Free and lawful to use
  • Large enough to exercise search and analytics tools
  • Real corporate communications with all their messy quirks
  • Familiar to the point of being an industry standard

But those virtues are also the trap. The data is from 2001—before smartphones, Teams, Slack, Zoom, linked attachments, and nearly every other element that makes modern email review challenging.

In 2025, running Enron through a discovery platform is like driving a Formula One race car on cobblestone streets.

Continue reading →

Share this:

  • Click to email a link to a friend (Opens in new window) Email
  • Click to print (Opens in new window) Print
  • Click to share on X (Opens in new window) X
  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on LinkedIn (Opens in new window) LinkedIn
Like Loading...
Follow Ball in your Court on WordPress.com

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 2,230 other subscribers

Recent Posts

  • A Master Table of Truth November 4, 2025
  • Kaylee Walstad, 1962-2025 August 19, 2025
  • Native or Not? Rethinking Public E-Mail Corpora for E-Discovery (Redux, 2013→2025) August 16, 2025
  • Still on Dial-Up: Why It’s Time to Retire the Enron Email Corpus August 15, 2025
  • Chambers Guidance: Using AI Large Language Models (LLMs) Wisely and Ethically June 19, 2025

Archives

RSS Feed RSS - Posts

CRAIGBALL.COM

Helping lawyers master technology

Categories

EDD Blogroll

  • E-Discovery Law Alert (Gibbons)
  • Corporate E-Discovery Blog (Zapproved )
  • Illuminating eDiscovery (Lighthouse)
  • eDiscovery Journal (Greg Buckles)
  • GLTC (Tom O'Connor)
  • Minerva 26 (Kelly Twigger)
  • CS DISCO Blog
  • eDiscovery Today (Doug Austin)
  • E-D Team (Ralph Losey)
  • Complex Discovery (Rob Robinson)
  • Sedona Conference
  • The Relativity Blog
  • Basics of E-Discovery (Exterro)

Admin

  • Create account
  • Log in
  • Entries feed
  • Comments feed
  • WordPress.com

Enter your email address to follow Ball in Your Court and receive notifications of new posts by email.

Website Powered by WordPress.com.

  • Subscribe Subscribed
    • Ball in your Court
    • Join 2,082 other subscribers
    • Already have a WordPress.com account? Log in now.
    • Ball in your Court
    • Subscribe Subscribed
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...
 

    %d