Recently, I’ve weighed in on disputes where the parties were fighting over whether the e-mail production was sufficiently “native” to comply with the court’s orders to produce natively. In one matter, the question was whether Gmail could be produced in a native format, and in another, the parties were at odds about what forms are native to Microsoft Exchange e-mail. In each instance, I saw two answers; the technically correct one and the helpful one.
I am a vocal proponent of native production for e-discovery. Native is complete. Native is functional. Native is inherently searchable. Native costs less. I’ve explored these advantages in other writings and will spare you that here. But when I speak of “native” production in the context of databases, I am using a generic catchall term to describe electronic forms with superior functionality and completeness, notwithstanding the common need in e-discovery to produce less than all of a collection of ESI.
It’s a Database
When we deal with e-mail in e-discovery, we are usually dealing with database content. Microsoft Exchange, an e-mail server application, is a database. Microsoft Outlook, an e-mail client application, is a database. Gmail, a SaaS webmail application, is a database. Lotus Domino, Lotus Notes, Yahoo! Mail, Hotmail and Novell GroupWise—they’re all databases. It’s important to understand this at the outset because if you think of e-mail as a collection of discrete objects (like paper letters in a manila folder), you’re going to have trouble understanding why defining the “native” form of production for e-mail isn’t as simple as many imagine.
Native in Transit: Text per a Protocol
E-mail is one of the oldest computer networking applications. Before people were sharing printers, and long before the internet was a household word, people were sending e-mail across networks. That early e-mail was plain text, also called ASCII text or 7-bit (because you need just seven bits of data, one less than a byte, to represent each ASCII character). In those days, there were no attachments, no pictures, not even simple enhancements like bold, italic or underline.
Early e-mail was something of a free-for-all, implemented differently by different systems. So the fledgling internet community circulated proposals seeking a standard. They stuck with plain text in order that older messaging systems could talk to newer systems. These proposals were called Requests for Comment or RFCs, and they came into widespread use as much by convention as by adoption (the internet being a largely anarchic realm). The RFCs lay out the form an e-mail should adhere to in order to be compatible with e-mail systems.
The RFCs concerning e-mail have gone through several major revisions since the first one circulated in 1973. The latest protocol revision is called RFC 5322 (2008), which made obsolete RFC 2822 (2001) and its predecessor, RFC 822 (1982). Another series of RFCs (RFC 2045-47, RFC 4288-89 and RFC 2049), collectively called Multipurpose Internet Mail Extensions or MIME, address ways to graft text enhancements, foreign language character sets and multimedia content onto plain text emails. These RFCs establish the form of the billions upon billions of e-mail messages that cross the internet.
So, if you asked me to state the native form of an e-mail as it traversed the Internet between mail servers, I’d likely answer, “plain text (7-bit ASCII) adhering to RFC 5322 and MIME.” In my experience, this is the same as saying “.EML format;” and, it can be functionally the same as the MHT format, but only if the content of each message adheres strictly to the RFC and MIME protocols listed above. You can even change the file extension of a properly formatted message from EML to MHT and back in order to open the file in a browser or in a mail client like Outlook 2010. Try it. If you want to see what the native “plain text in transit” format looks like, change the extension from .EML to .TXT and open the file in Windows Notepad.
The appealing feature of producing e-mail in exactly the same format in which the message traversed the internet is that it’s a form that holds the entire content of the message (header, message bodies and encoded attachments), and it’s a form that’s about as compatible as it gets in the e-mail universe. 
Unfortunately, the form of an e-mail in transit is often incomplete in terms of metadata it acquires upon receipt that may have probative or practical value; and the format in transit isn’t native to the most commonly-used e-mail server and client applications, like Microsoft Exchange and Outlook. It’s from these applications–these databases–that e-mail is collected in e-discovery.
Outlook and Exchange
Microsoft Outlook and Microsoft Exchange are database applications that talk to each other using a protocol (machine language) called MAPI, for Messaging Application Programming Interface. Microsoft Exchange is an e-mail server application that supports functions like contact management, calendaring, to do lists and other productivity tools. Microsoft Outlook is an e-mail client application that accesses the contents of a user’s account on the Exchange Server and may synchronize such content with local (i.e., retained by the user) container files supporting offline operation. If you can read your Outlook e-mail without a network connection, you have a local storage file.
Practice Tip (and Pet Peeve): When your client or company runs Exchange Server and someone asks what kind of e-mail system your client or company uses, please don’t say “Outlook.” That’s like saying “iPhone” when asked what cell carrier you use. Outlook can serve as a front-end client to Microsoft Exchange, Lotus Domino and most webmail services; so saying “Outlook” just makes you appear out of your depth (assuming you are someone who’s supposed to know something about the evidence in the case).
Outlook: The native format for data stored locally by Outlook is a file or files with the extension PST or OST. Henceforth, I’m going to speak only of PSTs, but know that either variant may be seen. PSTs are container files. They hold collections of e-mail—typically stored in multiple folders—as well as content supporting other Outlook features. The native PST found locally on the hard drive of a custodian’s machine will hold all of the Outlook content that the custodian can see when not connected to the e-mail server.
Because Outlook is a database application designed for managing messaging, it goes well beyond simply receiving messages and displaying their content. Outlook begins by taking messages apart and using the constituent information to populate various fields in a database. What we see as an e-mail message using Outlook is actually a report queried from a database. The native form of Outlook e-mail carries these fields and adds metadata not present in the transiting message. The added metadata fields include such information as the name of the folder in which the e-mail resides, whether the e-mail was read or flagged and its date and time of receipt. Moreover, because Outlook is designed to “speak” directly to Exchange using their own MAPI protocol, messages between Exchange and Outlook carry MAPI metadata not present in the “generic” RFC 5322 messaging. Whether this MAPI metadata is superfluous or invaluable depends upon what questions may arise concerning the provenance and integrity of the message. Most of the time, you won’t miss it. Now and then, you’ll be lost without it.
Because Microsoft Outlook is so widely used, its PST file format is widely supported by applications designed to view, process and search e-mail. Moreover, the complex structure of a PST is so well understood that many commercial applications can parse PSTs into single message formats or assemble single messages into PSTs. Accordingly, it’s feasible to produce responsive messaging in a PST format while excluding messages that are non-responsive or privileged. It’s also feasible to construct a production PST without calendar content, contacts, to do lists and the like. You’d be hard pressed to find a better form of production for Exchange/Outlook messaging. Here, I’m defining “better” in terms of completeness and functionality, not compatibility with your ESI review tools.
MSGs: There’s little room for debate that the PST or OST container files are the native forms of data storage and interchange for a collection of messages (and other content) from Microsoft Outlook. But is there a native format for individual messages from Outlook, like the RFC 5322 format discussed above? The answer isn’t clear cut. On the one hand, if you were to drag a single message from Outlook to your Windows desktop, Outlook would create that message in its proprietary MSG format. The MSG format holds the complete content of its RFC 5322 cousin plus additional metadata; but it lacks information (like foldering data) that’s contained within a PST. It’s not “native” in the sense that it’s not a format that Outlook uses day-to-day; but it’s an export format that holds more message metadata unique to Outlook. All we can say is that the MSG file is a highly compatible near-native format for individual Outlook messages–more complete than the transiting e-mail and less complete than the native PST. Though it’s encoded in a proprietary Microsoft format (i.e., it’s not plain text), the MSG format is so ubiquitous that, like PSTs, many applications support it as a standard format for moving messages between applications.
Exchange: The native format for data housed in an Exchange server is its database file, prosaically called the Exchange Database and sporting the file extension .EDB. The EDB holds the account content for everyone in the mail domain; so unless the case is the exceedingly rare one that warrants production of all the e-mail, attachments, contacts and calendars for every user, no litigant hands over their EDB.
It may be possible to create an EDB that contains only messaging from selected custodians (and excludes privileged and non-responsive content) such that you could really, truly produce in a native form. But, I’ve never seen it done that way, and I can’t think of anything to commend it over simpler approaches.
So, if you’re not going to produce in the “true” native format of EDB, the desirable alternatives left to you are properly called “near-native,” meaning that they preserve the requisite content and essential functionality of the native form, but aren’t the native form. If an alternate form doesn’t preserve content and functionality, you can call it whatever you want. I lean toward “garbage,” but to each his own.
E-mail is a species of ESI that doesn’t suffer as mightily as, say, Word documents or Excel spreadsheets when produced in non-native forms. If one were meticulous in their text extraction, exacting in their metadata collection and careful in their load file construction, one could produce Exchange content in a way that’s sufficiently complete and utile as to make a departure from the native less problematic—assuming, of course, that one produces the attachments in their native forms. That’s a lot of “ifs,” and what will emerge is sure to be incompatible with e-mail client applications and native review tools.
Litmus Test: Perhaps we have the makings of a litmus test to distinguish functional near-native forms from dysfunctional forms like TIFF images and load files: Can the form produced be imported into common e-mail client or server applications?
You have to admire the simplicity of such a test. If the e-mail produced is so distorted that not even e-mail programs can recognize it as e-mail, that’s a fair and objective indication that the form of production has strayed too far from its native origins.
The question whether it’s feasible to produce Gmail in its native form triggered an order by U.S. Magistrate Judge Mark J. Dinsmore in a case styled, Keaton v. Hannum, 2013 U.S. Dist. LEXIS 60519 (S.D. Ind. Apr. 29, 2013). It’s a seamy, sad suit brought pro se by an attorney named Keaton against both his ex-girlfriend, Christine Zook, and the cops who arrested Keaton for stalking Zook. It got my attention because the court cited a blog post I made three years ago.  The Court wrote:
Zook has argued that she cannot produce her Gmail files in a .pst format because no native format exists for Gmail (i.e., Google) email accounts. The Court finds this to be incorrect based on Exhibit 2 provided by Zook in her Opposition Brief. [Dkt. 92 at Ex. 2 (Ball, Craig: Latin: To Bring With You Under Penalty of Punishment, EDD Update (Apr. 17, 2010)).] Exhibit 2 explains that, although Gmail does not support a “Save As” feature to generate a single message format or PST, the messages can be downloaded to Outlook and saved as .eml or.msg files, or, as the author did, generate a PDF Portfolio – “a collection of multiple files in varying format that are housed in a single, viewable and searchable container.” [Id.] In fact, Zook has already compiled most of her archived Gmail emails between her and Keaton in a .pst format when Victim.pst was created. It is not impossible to create a “native” file for Gmail emails.
Id. at 3.
I’m gratified when a court cites my work, and here, I’m especially pleased that the Court took an enlightened approach to “native” forms in the context of e-mail discovery. Of course, one strictly defining “native” to exclude near-native forms might be aghast at the loose lingo; but the more important takeaway from the decision is the need to strive for the most functional and complete forms when true native is out-of-reach or impractical.
Gmail is a giant database in a Google data center someplace (or in many places). I’m sure I don’t know what the native file format for cloud-based Gmail might be. Mere mortals don’t get to peek at the guts of Google. But, I’m also sure that it doesn’t matter, because even if I could name the native file format, I couldn’t obtain that format, nor could I faithfully replicate its functionality locally.
Since I can’t get “true” native, how can I otherwise mirror the completeness and functionality of native Gmail? After all, a litigant doesn’t seek native forms for grins. A litigant seeks native forms to secure the unique benefits native brings, principally functionality and completeness.
There are a range of options for preserving a substantial measure of the functionality and completeness of Gmail. One would be to produce in Gmail.
Yes, you could conceivably open a fresh Gmail account for production, populate it with responsive messages and turn over the access credentials for same to the requesting party. That’s probably as close to true native as you can get (though some metadata will change), and it flawlessly mirrors the functionality of the source. Still, it’s not what most people expect or want. It’s certainly not a form they can pull into their favorite e-discovery review tool.
Alternatively, as the Court noted in Keaton v. Hannum, an IMAP capture to a PST format (using Microsoft Outlook or a collection tool) is a practical alternative. The resultant PST won’t look or work exactly like Gmail (i.e., messages won’t thread in the same way and flagging will be different); but it will supply a large measure of the functionality and completeness of the Gmail source. Plus, it’s a form that lends itself to many downstream processing options.
So, What’s the native form of that e-mail?
Which answer do you want; the technically correct one or the helpful one? No one is a bigger proponent of native production than I am; but I’m finding that litigants can get so caught up in the quest for native that they lose sight of what truly matters.
Where e-mail is concerned, we should be less captivated by the term “native” and more concerned with specifying the actual form or forms that are best suited to supporting what we need and want to do with the data. That means understanding the differences between the forms (e.g., what information they convey and their compatibility with review tools), not just demanding native like it’s a brand name.
When I seek “native” for a Word document or an Excel spreadsheet, it’s because I recognize that the entire native file—and only the native file—supports the level of completeness and functionality I need, a level that can’t be fairly replicated in any other form. But when I seek native production of e-mail, I don’t expect to receive the entire “true” native file. I understand that responsive and privileged messages must be segregated from the broader collection and that there are a variety of near native forms in which the responsive subset can be produced so as to closely mirror the completeness and functionality of the source.
When it comes to e-mail, what matters most is getting all the important information within and about the message in a fielded form that doesn’t completely destroy its character as an e-mail message.
So let’s not get too literal about native forms when it comes to e-mail. Don’t seek native to prove a point. Seek native to prove your case.
Postscript: When I publish an article extolling the virtues of native production, I usually get a comment or two saying, “TIFF and load files are good enough.” I can’t always tell if the commentator means “good enough to fairly serve the legitimate needs of the case” or “good enough for those sleazy bastards on the other side.” I suspect they mean both. Either way, it might surprise readers to know that, when it comes to e-mail, I agree with the first assessment…with a few provisos.
First, TIFF and load file productions can be good enough for production of e-mail if no one minds paying more than necessary. It generally costs more to extract text and convert messages to images than it does to leave it in a native or near-native form. But that’s only part of the extra expense. TIFF images of messages are MUCH larger files than their native or near native counterparts. With so many service providers charging for ingestion, processing, hosting and storage of ESI on a per-gigabyte basis, those bigger files continue to chew away at both side’s bottom lines, month-after-month.
Second, TIFF and load file productions are good enough for those who only have tools to review TIFF and load file productions. There’s no point in giving light bulbs to those without electricity. On the other hand, just because you don’t pay your light bill, must I sit in the dark?
Third, because e-mails and attachments have the unique ability to be encoded entirely in plain text, a load file can carry the complete contents of a message and its contents as RFC 5322-compliant text accompanied by MAPI metadata fields. It’s one of the few instances where it’s possible to furnish a load file that simply and genuinely compensates for most of the shortcomings of TIFF productions. Yet, it’s not done.
Finally, TIFF and load file productions are good enough for requesting parties who just don’t care. A lot of requesting parties fall into that category, and they’re not looking to change. They just want to get the e-mail, and they don’t give a flip about cost, completeness, utility, metadata, efficiency, authentication or any of the rest. If both sides and the court are content not to care, TIFF and load files really are good enough.
 There’s even an established format for storing multiple RFC 5322 messages in a container format called mbox. The mbox format was described in 2005 in RFC 4155, and though it reflects a simple, reliable way to group e-mails in a sequence for storage, it lacks the innate ability to memorialize mail features we now take for granted, like message foldering. A common workaround is to create a single mbox file named to correspond to each folder whose contents it holds (e.g., Inbox.mbox)
 With a tip of the hat to Josh Gilliland, the blogger behind Bow Tie Law, who brought the Keaton decision to my attention.
 It was once possible to create complete, offline replications of Gmail using a technology called Gears; however, Google discontinued support of Gears some time ago. Gears’ successor, called “Gmail Offline for Chrome,” limits its offline collection to just a month’s worth of Gmail, making it a complete non-starter for e-discovery. Moreover, neither of these approaches employs true native forms as each was designed to support a different computing environment.
 IMAP (for Internet Message Access Protocol) is another way that e-mail client and server applications can talk to one another. The latest version of IMAP is described in RFC 3501. IMAP is not a form of e-mail storage; it is a means by which the structure (i.e., foldering) of webmail collections can be replicated in local mail client applications like Microsoft Outlook. Another way that mail clients communicate with mail servers is the Post Office Protocol or POP; however, POP is limited in important ways, including in its inability to collect messages stored outside a user’s Inbox. Further, POP does not replicate foldering. Outlook “talks” to Exchange servers using MAPI and to other servers and webmail services using MAPI (or via POP, if MAPI is not supported).