Surviving a Registration Bomb Attack

Tags

, , , ,

It started just after 7:00 last night.  My mailbox swelled with messages confirming I’d subscribed to websites and newsletters around the world.  Within an hour, I’d received over 2,000 such messages, and they kept pouring in until I’d gotten 4,000 registration confirmations by 11:00pm. After that, the flood slowed to a trickle.

I was the victim of a registration bomb attack, a scary experience if you don’t grasp what’s happening or know how to protect yourself.  Fortunately, it wasn’t my first rodeo. 

During a similar attack a couple of years ago, I was like a dog on the Fourth of July–I didn’t know what was happening or how to deal with it.  But this time, my nerves weren’t wracked: I knew what was afoot and where the peril lay.

Cybersecurity is not my principal field of practice, but it’s a forensics-adjacent discipline and one where I try to keep abreast of developments.  So, much like a trial lawyer enjoying the rare chance to serve on a jury, being the target of a cyberattack is as instructive as inconvenient.  

While a registration bomb attack could be the work of a disgruntled reader (Hey! You can’t please everybody), more often they serve to mask attacks on legitimate accounts by burying notices of password resets, funds transfers or fraudulent credit card charges beneath a mountain of messages.  So, yes, you should treat a registration bomb attack as requiring immediate vigilance in terms of your finances.  Keep a weather eye out for small transfers, especially deposits into a bank account as these signal efforts to link your account to another as prelude to theft.  Likewise, look at your credit card transactions to ensure that recent charges are legitimate.  Finally—and the hardest to do amidst a deluge of registration notices—look for efforts to change credentials for e-commerce websites you use like Walmart.com or Amazon.com.

A registration bomb attack is a powerful reminder of the value of always deploying multifactor authentication (MFA) to protect your banking, brokerage and credit card accounts.  Those extra seconds expended on secure logins will spare you hours and days lost to a breach.  With MFA in place, an attacker who succeeds in changing your credentials won’t have the access codes texted to your phone, thwarting efforts to rob you.

The good news is that, if you’re vigilant in the hours a registration bomb is exploding in your email account and you have MFA protecting your accounts, you’re in good shape.

Now for the bad news: a registration bomb is a distributed attack, meaning that it uses a botnet to enlist a legion of unwitting, innocent participants—genuine websites—to do the dirty work of clogging your email account with registration confirmation requests.  Because the websites emailing you are legitimate, there’s nothing about their email to trigger a spam filter until YOU label the message as spam. Unfortunately, that’s what you must do: select the attack messages and label each one as spam.  Don’t bother to unsubscribe to the registrations; just label the messages as spam as quickly as you can. 

This is a pain. And you must be attuned to the potential to mistakenly blacklist senders whose messages you want at the same time you’re squashing the spam messages you don’t want and scanning for password change notices from your banks, brokers and e-commerce vendors.  It’s easier when you know how to select multiple messages before hitting the “spam” button (in Gmail, holding down the Shift key enables you to select a range of messages by selecting the first and last message in the range).  Happily, the onslaught of registration spam will stop; thousands become hundreds and hundreds become dozens in just hours (though you’ll likely get stragglers for days).

Registration bombing attacks will continue so long as the web is built around websites sending registration confirmation messages—a process ironically designed to protect you from spam.   If you’ve deployed the essential mechanisms to protect yourself online, particularly strong, unique passwords, multifactor authentication and diligent review of accounts for fraudulent transactions, don’t panic; the registration bomb will be no more than a short-lived inconvenience.  This, too, shall pass.

Will AI Summarization Disrupt Discovery?

Tags

, ,

Reader’s Digest, the century-old magazine with the highest paid circulation, has long published “condensed” books; anthologies of four-to-five popular novels abridged to fit in a single volume.  Condensed Books were once enormously popular, with tens of millions of copies in circulation.  They were also an abomination to serious readers, a literary Tang for those who preferred fresh-squeezed OJ. I’ve never read a condensed book, so I’m in no position to judge their merit save to say that I believe reading anything is a good thing.  I imagine the condensed versions conveyed the guts of the story well enough to sound like you’d read it over drinks with the neighbors before the Ed Sullivan show.

But I am enough of a purist (okay, “snob”) to worry about the impact of summarization.  As an undergraduate English major, I had to wade through some challenging tomes.  I have no empirical evidence for it, but I’m certain those books are a part of me in ways they never would have been had I sought out the Cliffs Notes instead.  I expect most avid readers feel the same.  Summaries necessarily discard content, and what remains is incapable of conveying the same tone, nuance and detail.

So, I worry when the tech industry touts the value of AI summarization of documents, especially as a means of speeding identification and review of evidence in discovery.  I question whether the “Reader’s Digest Condensed Evidence” will convey the same tone, nuance and detail that characterize responsive productions.  Will distillation be made of distillations until genuine intelligence is lost altogether? 

It’s an inchoate apprehension—an old man’s anxiety perhaps—but litigation is about human behavior, human frailty and failings.  I fear too much humanity will disappear in AI-generated summaries with the underlying communications less likely to see the light of day.  The mandate that discovery be “just, speedy and inexpensive” is now read as “just speedy and inexpensive.”  That discarded comma is tragic.

Technology is my lifelong passion.  So, I am not afraid of new tech as much as put off by the embrace of technology to further speed and economy without due consideration of quality.  LegalWeek 2024 will be a carnival of vendors touting AI features and roadmaps.  How many will have metrics to support the quality of their AI-abetted outcomes?  How many have forgotten the comma while chasing the cash? Per Upton Sinclair, ““It is difficult to get a man to understand something, when his salary depends on his not understanding it.”

Unquestionably, we must reduce the cost of discovery to protect the portals of justice.  Justice no one can afford to pursue is no justice at all.  But there are uniquely human characteristics we should continue to esteem in discovery, like curiosity, intuition, suspicion and impression; the “Spidey-sense” we derive from tone, nuance and detail.  Before we use AI to summarize collections then deploy AI to characterize the summaries, can we pause just long enough to see if it’s going to work? Real testing, not just that which supports salaries.

Policy for Student Use of AI

Tags

, , , , , ,

Andy Williams used to croon that this is “The Most Wonderful Time of the Year.” For me, it’s time to update the curriculum for my class on Electronic Discovery and Digital Evidence at the University of Texas in the graduate schools of Law, Computer Science and Information Science. I’ve long built the course around a Workbook I wrote with readings and some two dozen exercises. But, when I last taught the course a year ago, generative AI was hardly a twinkle in Santa’s eye. Now, of course, AI is the topic that’s eaten all others. So, I’ve had to fashion a policy for student use of AI. I elected to embrace student use of AI tools, in part because legal scholarsip is artful plagarism termed “precedent” and–let’s face it–students are going to use LLMs, whatever I say. So, here’s what I’ve come up with. I’ll be grateful for your feedback as comments, most especially if you are an educator facing the same issues with advice born of experience.

Use of Generative Large Language Models to Assist with Exercises

1. Explicit Disclosure Requirement

  • It is a violation of the honor code to misrepresent work by characterizing it as your own if it is not.  Students may use generative LLMs, such as ChatGPT or Bard, for assistance in completing Workbook exercises; however, they must explicitly disclose the use of these tools by providing a brief note or acknowledgment in their submissions. Transparency is mandatory.

2. Verification and Cross-Checking

  • Students may utilize generative LLMs during Workbook exercises but are required to independently verify and cross-check the information generated by these models through additional research using alternate, reliable sources.

3.  Accountability

  • While generative LLMs are permitted tools, students are held accountable for the accuracy and completeness of the information obtained from these models. Any errors or omissions resulting from the use of LLMs are considered the responsibility of the student. This policy underscores the importance of independent verification and personal accountability.

4. Prohibited for Quizzes and Exams

  • Notwithstanding the foregoing, you may not consult any source of information, including AI resources, when completing quizzes or the final exam.

POSTSCRIPT: I add this a day after the foregoing, after reading that the Fifth Circuit’s proposed a rule change requiring that counsel and pro se litigants certify of any filed document, that “no generative artificial intelligence program was used in drafting the document…or to the extent such a program was used, all generated text, including all citations and legal analysis, has been reviewed for accuracy and approved by a human.” I recall shaking my head at how foolish it was when a grandstanding district court judge made headlines by requiring such certifications following a high-profile gaffe in New York. “Of course a lawyer must verify the accuracy of legal analysis and citations! Lawyers shouldn’t need to certify that we did what we are required to do!”

Yet, here I am requiring my students to do much the same. I feel confident in advising students that, if they use AI, they must verify the information and sink or swim based on what they submit, even if the AI hallucinates or misleads. Back in the day, lawyers knew they had to “Shepardize” citations to verify that the cases cited were still solid. Proffering a a made-up citation was beyond comprehension.

So, am I right to require explicit disclosure of generative AI? Or will AI soon be woven into so many sources of information that disclosure will feel as foolish as requiring students to disclose they used a word processor instead of a typewriter would have been forty years ago? I’m struggling with this. What do you think?

Monica Bay, 1949-2023

I’m saddened to share that Monica Bay, the forceful, revered former editor of Law Technology News (now Legaltech News) has died after a long, debilitating illness.  Though a durable resident of New York City and Connecticut, Monica’s life ended in California where it began.  Monica described herself as a “provocateur,” an apt descriptor from one gifted in finding the bon mot.  Monica was a journalist with soaring standards whose writing exemplified the high caliber of work she expected from her writers.  I cannot overstate Monica’s importance to the law technology community in her 17 years at the helm of LTN.  Monica mentored multitudes and by sheer force of her considerable strength and will, Monica transformed LTN from an industry organ purveying press releases to an award-winning journal unafraid to speak truth to power.

In her time as editor, Monica was everywhere and indefatigable.  Monica was my editor for much of her tenure at LTN including nine years where I contributed a monthly column she dubbed “Ball in Your Court” (see what I mean about her mastery of the well-turned phrase?)  We had a complicated relationship and butted heads often, but my submissions were always better for Monica’s merciless blue pencil.  I owe her an irredeemable debt.  She pushed me to the fore.  You wouldn’t be reading this now if it weren’t for Monica Bay’s efforts to elevate me.  The outsize recognition and writing awards I garnered weren’t my doing but Monica’s.  If life were a movie, Monica would be the influential publisher who tells the writer plucked from obscurity, “I made you and I can break you!” And it would be true.

This elegy would have been far better if she’d edited it.

Trying to illuminate Monica, I turned to Gmail to refresh my memory but backed off when I saw we’d shared more than 2,200 conversations since 2005.  I’d forgotten how she once loomed so large in my life.  In some of those exchanges, Monica generously called me, “hands down my best writer,” but I wouldn’t be surprised if she said that to everyone in her stable of “campers.”  Monica knew how to motivate, cajole and stroke the egos of her contributors. She was insightful about ego, too.

In 2010 when I carped that there’s always too much to do, and always somebody unhappy with me, she counseled, “Like me, you are an intense personality, and we can be difficult to live with at times. but that intensity and drive is also what makes you who you are, why you are successful, and why you are a breathtakingly good writer.  My favorite people in the world are ‘difficult.’”

I wince as I write that last paragraph because as much as she was brilliant in managing egos, Monica didn’t love that part of her work. She confided, “I think we have to be mindful that we don’t exercise our egos in a way that constrains — or worse case, cripples — those around us. That’s the hard part.

Monica observed of a well-known commentator of the era, “he wouldn’t be able to write if he had to excise ‘I’ from his vocabulary… he annoys me more than the Red Sox or Jacobs Fields gnats.” 

That reminds me that Monica had a personal blog called “The Common Scold.”  She named it for a Puritan-era cause of action where opinionated women were punished by a dunk in a pond.  I mostly remember it for its focus on New York Yankees baseball, which became a passion for Monica when she moved east despite a lifelong disinterest in sports.  Monica, who insofar as I knew, never married, often referred to herself in the Scold as “Mrs. Derek Jeter.”  She was quirky that way and had a few quirky rules for writers.  One was that the word “solution” was banned, BANNED, in LTN.

To her credit, Monica Bay wasn’t afraid to nip at the hand that feeds.  Now, when every outlet has bent to the will of advertisers, Monica’s strict journalistic standards feel at once quaint and noble. Consider this excerpt from her 2009 Editorial Guidelines:

“Plain English: Law Technology News is committed to presenting information in a manner that is easily accessible to our readers. We avoid industry acronyms, jargon, and clichés, because we believe this language obfuscates rather than enhances understanding.

For example, the word “solution” has become meaningless and is banned from LTN unless it’s part of the name of a company.  Other words we edit out: revolutionary, deploy, mission critical, enterprise, strategic, robust, implement, seamless, initiative, -centric, strategic [sic], and form factor! We love plain English!”

Monica was many things more than simply an industry leader, from a wonderful choral singer to the niece of celebrated actress, Elaine Stritch.  She was my champion, mother figure, friend and scold.   I am in her debt.  And you are, too, Dear Reader, for Monica Bay pushed through barriers that fell under her confident stride.

Fifteen years ago, when Monica lost her father, and my mother was dying, we supported each other.  Monica called her dad’s demise the “great gift of dementia from the karma gods. No pain, just a gentle drift to his next destination.”  That beautifully describes her own shuffle off this mortal coil.  As the most loving parting gift I can offer my late, brilliant editor, I cede to her those last lovely words, “just a gentle drift to [her] next destination.”

[I have no information about services or memorials, but I look forward to commemorating Monica’s life and contributions with others who loved and admired her]

A nice tribute from Bob Ambrogi: https://www.lawnext.com/2023/10/i-am-deeply-saddened-to-report-the-death-of-monica-bay-friend-mentor-and-role-model-to-so-many-in-legal-tech.html and a sweeet remembrance from Mary Mack: https://edrm.net/2023/10/the-warmest-and-most-uncommon-scold/

Being the Better Expert Witness

I’ll need to dust off the cobwebs as I haven’t been in this space in quite some time! I’ve not had much to say, and honestly, if I didn’t sneak “ChatGPT” into the title, who’d notice? Preparing for a September 20th presentation to an international conclave of forensic examiners in Phoenix, I extensively revised and expanded my guide for testifying experts, now called “Being the Better Expert Witness: A Primer for Forensic Examiners.” I describe it thus:

This paper covers ways to become an effective witness and pitfalls to avoid.  They say lawyers make notoriously poor witnesses and I have no illusions that I’m a great witness.  But after forty years of trial practice and thirty as a forensic examiner, I’ve learned a few lessons I hope might help other examiners build their skills in court.

In the paper, I discuss the difficulty computer forensic examiners face honing their testimonial abilities because it’s rare to be interrogated by a lawyer who truly understands what we are talking about.  Most interrogators work from a script.  They know the first question to ask, but not the next or the one after that.  Pushed from their path, they’re lost.  Computer forensic examiners have it easy on the stand.  Deep fakes notwithstanding, computer-generated evidence still enjoys an aura of accuracy and objectivity, and the hyper-technical nature of digital forensics awes and intimidates the uninitiated.  Thank you, CSI, NCIS and all the rest!  But sooner or later, computer forensic examiners will square off against interrogators able to skillfully undermine ability and credibility. I want them to be ready.

As I’m wont to do, I ambled down memory lane:

“Evidence professor John Henry Wigmore famously called cross-examination “the greatest legal engine ever invented for the discovery of truth.” Apparently, every lawyer who writes about cross-examination is obliged to say that. Likewise, every trial lawyer aspires to do a great cross examination, and every judge and juror aspires to hear one.  Yet, as I observed at the start, they are rare.”

“Forty years ago, my boss was on the trial team of a lawsuit between Pennzoil and Texaco that resulted in the biggest plaintiff’s verdict of the era and a three-billion-dollar settlement—back when that was a lot of money.  The lawyer for Texaco, the big loser, was named Dick Miller, and my boss used to say of him, “Dick Miller has two speeds: OFF and KILL.”  I’ll never forget that because it encapsulates how some lawyers approach cross-examination.  A truly devastating cross examination flows from applying lessons learned from the raptors in Jurassic Park: get the prey to look one way, while the attack comes from another.

“In court, that entails laying a trap and not springing it too early. Skilled cross examiners box witnesses in and seal off points of retreat before the witness recognizes the need to run.  The very best cross examiners don’t spring their traps during the cross; they save that for final argument.”

“The greatest teacher of cross-examination I’ve ever come across was a former prosecutor, judge and law professor named Irving Younger, who died about 35 years ago.  Younger’s famous lecture on the topic was called “The Ten Commandments of Cross-Examination.”  I’ve listened to multiple versions of his talk over the years and all are magnificent. Stirring.  Funny.  Unforgettable.  Younger opined that a lawyer must try about 25 cases to begin to be skilled in cross-examination, but he GUARANTEED that any lawyer strictly adhering to his Ten Commandments would be able to conduct a reasonably effective cross-examination.  Of course, he added, no lawyer is capable of sticking to all his commandments until the lawyer has about 25 trials under his belt!”

“I do not have ten surefire commandments that will guarantee you won’t get in trouble on cross-examination, but I have a lifetime in court (much of one anyway) and many years teaching law to draw on in offering advice on what to expect on cross plus a few suggested techniques that I GUARANTEE will help you become a better witness.”

So, if you’re looking to help an expert witness new to the role or a veteran making the same old mistakes, perhaps you’ll point them to my new primer at http://www.craigball.com/Ball_Expert_Witness_2023.pdf

Introducing the EDRM E-Mail Duplicate Identification Specification and Message Identification Hash (MIH)

I’m proud to be the first to announce that the Electronic Discovery Reference Model (EDRM) has developed a specification for cross-platform identification of duplicate email messages, allowing for ready detection of duplicate messages that waste review time and increase cost. Leading e-discovery service and software providers support the new specification, making it possible for lawyers to improve discovery efficiency by a simple addition to requests for production. If that sounds too good to be true, read on and learn why and how it works.

THE PROBLEM

The triumph of information technology is the ease with which anyone can copy, retrieve and disseminate electronically stored information. Yet, for email in litigation and investigations, that blessing comes with the curse of massive replication, obliging document reviewers to assess and re-assess nearly identical messages for relevance and privilege. Duplicate messages waste time and money and carry a risk of inconsistent characterization. Seeing the same thing over-and-over again makes a tedious task harder.

Electronic discovery service providers and software tools ameliorate these costs, burdens and risks using algorithms to calculate hash values—essentially digital fingerprints—of segments of email messages, comparing those hash values to flag duplicates. Hash deduplication works well, but stumbles when minor variations prompt inconsistent outcomes for messages reviewers regard as being “the same.” Hash deduplication fails altogether when messages are exchanged in forms other than those native to email communications—a common practice in U.S. electronic discovery where efficient electronic forms are often printed to static page images.

Without the capability to hash identical segments of identical formats across different software platforms, reviewers cannot easily identify duplicates or readily determine what’s new versus what’s been seen before. When identical messages are processed by different tools and vendors or produced in different forms (so-called “cross-platform productions”), identification of duplicate messages becomes an error-prone, manual process or requires reprocessing of all documents.

Astonishingly, no cross-platform method of duplicate identification has emerged despite decades spent producing email in discovery and billions of dollars burned by reviewing duplicates.

Wouldn’t it be great if there was a solution to this delay, expense and tedium?

THE SOLUTION

When parties produce email in discovery and investigations, it’s customary to supply information about the messages called “metadata” in accompanying “load files.” Load files convey Bates numbers/Document IDs, message dates, sender, recipients and the like. Ideally, the composition of load files is specified in a well-crafted request for production or production protocol. Producing metadata is a practice that’s evolved over time to prompt little argument. For service providers, producing one more field of metadata is trivial, rarely requiring more effort than simply ticking a box.

The EDRM has crafted a new load file field called the EDRM Message Identification Hash (MIH), described in the EDRM Email Duplicate Identification Specification.

Gaining the benefit of the EDRM Email Duplicate Identification Specification is as simple as requesting that load files contain an EDRM Message Identification Hash (MIH) for each email message produced. The EDRM Email Duplicate Identification Specification is an open specification, so no fees or permissions are required to use it, and leading e-discovery service and software providers already support the new specification. For others, it’s simple to generate the MIH without redesigning software or impeding workflows. Too, the EDRM has made free tools available supporting the specification.

Any party with the MIH of an email message can readily determine if a copy of the message exists in their collection. Armed with MIH values for emails, parties can flag duplicates even when those duplicates take different forms, enabling native message formats to be compared to productions supplied as TIFF or PDF images.

The routine production of the MIH supports duplicate identification across platforms and parties. By requesting the EDRM MIH, parties receiving rolling or supplemental productions will know if they’ve received a message before, allowing reviewers to dedicate resources to new and unique evidence. Email messages produced by different parties in different forms using different service providers can be compared to instantly surface or suppress duplicates. Cross-platform email duplicate identification means that email productions can be compared across matters, too. Parties receiving production can easily tell if the same message was or was not produced in other cases. Cross-platform support also permits a cross-border ability to assess whether a message is a duplicate without the need to share personally-identifiable information restricted from dissemination by privacy laws.

IS THIS REALLY NEW?

Yes, and unprecedented. As noted, e-discovery service providers and law firm or corporate e-discovery teams have long employed cryptographic hashing internally to identify duplicate messages; but each does so differently dependent upon the process and software platform employed—sometimes in ways they regard as being proprietary—making it infeasible to compare hash values across providers and platforms. Even if competitors could agree to employ a common method, subtle differences in the way each process and normalize messages would defeat cross-platform comparison.

The EDRM Email Duplicate Identification Specification doesn’t require software platform and service providers to depart from the proprietary ways they deduplicate email. Instead, the Specification contemplates that e-discovery software providers add the ability to produce the EDRM MIH to their platform and that service providers supply a simple-to-determine Message Identification Hash (MIH) value that sidesteps the challenges just described by taking advantage of an underutilized feature of email communication standards called the “Message ID” and pairing it with the power of hash deduplication. If it sounds simple, it is–and by design. It’s far less complex than traditional approaches but sacrifices little or no effectiveness or utility. Crucially, it doesn’t require any difficult or expensive departure from the way parties engage in discovery and production of email messages.

WHAT SHOULD YOU DO TO BENEFIT?

All you need to do to begin reaping the benefits of cross-platform message duplicate identification is amend your Requests for Production to include the EDRM Message Identification Hash (MIH) among the metadata values routinely produced as load files. As a prominently published specification by the leading standards organization in e-discovery, it’s likely the producing party’s service provider or litigation support staff know what’s required. But if not, you can refer them to the EDRM Email Duplicate Identification Specification & Guidelines published at https://edrm.net/active-projects/dupeid/.

HOW DO YOU LEARN MORE?

The EDRM publishes a comprehensive set of resources describing and supporting the Specification & Guidelines that can be found at https://edrm.net/active-projects/dupeid/. All persons and firms deploying the EDRM MIH to identify duplicate messages should familiarize themselves with the considerations for its use.

EDRM WANTS YOUR FEEDBACK

The EDRM welcomes any feedback you may have on this new method of identifying cross platform email duplicates or on any of the resources provided. We are interested in further ideas you may have and expect the use of the EDRM MIH to evolve over time. You can post any feedback or questions at https://edrm.net/active-projects/dupeid/.

ChatGPT Proves a Mediocre Law Student

I recently spent a morning testing ChatGPT’s abilities by giving it exercises and quizzes designed for my law and computer science graduate students. Overall, I was impressed with its performance, but also noticed that it’s frequently wrong but never in doubt: a mechanical mansplainer!

If you’re asking, “What is ChatGPT,” I’ll let it explain itself: “ChatGPT is a large language model developed by OpenAI. It is a type of machine learning model called a transformer, which is trained to generate text based on a given prompt. It is particularly well-suited to tasks such as natural language processing, text generation, and language translation. It is capable of understanding human language and generating human-like text, which makes it useful for a wide range of applications, such as chatbots, question-answering systems, and text completion.” I’d add that it’s the first artificial intelligence tool that anyone can play with, and it’s generating enormous excitement.

In terms of its legal knowledge, ChatGPT performed like a mediocre law student, performing well on essay questions and drafting exercises, but struggling with multiple choice questions. I couldn’t assign it a definitive score on the 25-question quiz as some questions required image analysis and I didn’t see a way to include illustrations in the chat interface. However, it performed poorly enough on the quiz for me to conclude that a passing score of 70 out of 100 points was out of reach.

One example of a question it struggled with was:

Which of the following is NOT an essential metadata value (a “metadata must”) to preserve and produce for a file?

  1. Custodian
  2. File Name
  3. Last Modified Date and Time
  4. Last Accessed Date and Time

The correct answer is Last Accessed Date and Time.  Class readings underscore the unreliability of Last Accessed dates and times and I supply a table of “Metadata Musts” for preservation and production.  Presumably trained on other sources, ChatGPT did not seem to understand the importance of a proper chain of custody for electronic evidence and replied:

Custodian is not considered an essential metadata value to preserve and produce for a file. The other options (File Name, Last Modified Date and Time, Last Accessed Date and Time) are all considered essential metadata values that should be preserved and produced for a file in order to provide important information about the file and its history.

Since it wasn’t performing well on my quiz, I asked,

Can you suggest some multiple choice quiz questions for students studying the fundamentals of electronic discovery and electronic data storage?

The first suggestion was:

What is the process of collecting, processing, and producing electronically stored information (ESI) in response to a legal request called?

a. Electronic discovery

b. Digital forensics

c. E-discovery

d. Data retrieval

In ChatGPT’s formulation “Electronic discovery” is the right response and “E-discovery” is wrong! 

Another suggestion:

What is the most common format for ESI production?

a. PDF

b. TIFF

c. Native file format

d. ASCII

That’s a challenging question because, as much as I’ve focused on and written about forms of production, I’ve not come across metrics establishing the most common format.  In supplying an answer key, ChatGPT identified Native file format as the correct answer, though I would have guessed “TIFF.”  I certainly wouldn’t have included ASCII as a choice.

Here are a few other quiz questions suggested by ChatGPT.  I’ve bolded what ChatGPT defined as the correct response and, in each instance, I’d conclude it’s either not the best answer or the question is so poorly constructed as to defy characterizing any single answer as the right one:

What is the process of identifying and eliminating irrelevant or privileged ESI called?

a. Culling

b. Filtering

c. Redaction

d. Processing

What is the process of reviewing ESI to identify relevant information called?

a. Review

b. Analysis

c. Coding

d. Searching

What is the process of converting ESI from its original format to a reviewable format called?

a. Conversion

b. Processing

c. Extraction

d. Production

What is the process of creating a backup copy of ESI for preservation purposes called?

a. Backup

b. Imaging

c. Archiving

d. Duplication

What is the process of destroying or disposing of ESI after it is no longer needed called?

a. Deletion

b. Erasure

c. Destruction

d. Purging

Encouraged that the law faculty might be safe for another semester, I turned ChatGPT to a drafting exercise:

You are to develop the best checklist of potential data sources that are candidates for legal preservation in any engagement—good enough that you’ll want to keep it to use in your own work. It does NOT have to be entirely original, and I encourage you to start with the best existing checklist(s) you can find and then make updates and improvements. Certainly, you should seek to add missing sources, but you may also see a better way to organize and present the content. Remember that this checklist is a list of sources and varieties not methods and, crucially, it should serve to enhance the completeness of the personal data map you will create (though it should NOT be limited to the scope of your personal digital footprint alone). Thus, generic checklist items like “Online sources” or “Local storage” won’t be sufficiently granular. Be sure to credit your sources and format it as a checklist.

ChatGPT responded:

Data Preservation Checklist

  1. Social media accounts (Facebook, Twitter, Instagram, LinkedIn, etc.)
  2. Email accounts (Gmail, Outlook, Yahoo, etc.)
  3. Cloud storage accounts (Google Drive, Dropbox, OneDrive, etc.)
  4. Mobile device data (text messages, call logs, contacts, etc.)
  5. Computer files (documents, photos, videos, etc.)
  6. Internet browsing history and cache
  7. Online marketplace accounts (eBay, Amazon, Etsy, etc.)
  8. Online banking and financial accounts (Paypal, bank accounts, credit cards, etc.)
  9. GPS location data
  10. Smart home device data (Amazon Echo, Google Home, etc.)
  11. Wearable device data (Fitbit, Apple Watch, etc.)
  12. Gaming accounts (Xbox Live, PlayStation Network, Steam, etc.)
  13. Voicemail and answering machine messages
  14. Instant messaging and chat logs (WhatsApp, Messenger, etc.)
  15. Social media direct messages and private messages
  16. Online dating profiles and messages
  17. Online forum and message board posts
  18. Online video and audio recordings (YouTube, Vimeo, etc.)
  19. Online photo and image sharing accounts (Flickr, Imgur, etc.)
  20. Online document collaboration and sharing accounts (Google Docs, Microsoft Office 365, etc.)

Source: Adapted from “Digital Preservation Checklist” by the National Archives and Records Administration (NARA) (https://www.archives.gov/digital-preservation/checklist) and “Digital Preservation Checklist for Individuals” by the Digital Preservation Coalition (https://www.dpconline.org/advice/digital-preservation-checklist-for-individuals)

Not bad, and certainly typical of what I get from students.  Interestingly, both citations are dead links, so unacceptable from my students, and (arguably) ChatGPT simply ignored the requirement to format its answer as a checklist. 

I had ChatGPT answer the question a second time sans the mandate that it credit sources. I was struck by the differences in the lists supplied and the disclaimer now appended:

  1. Social media accounts (e.g. Facebook, Twitter, Instagram, LinkedIn)
  2. Email accounts (e.g. Gmail, Yahoo, Outlook)
  3. Cloud storage accounts (e.g. Google Drive, Dropbox, iCloud)
  4. Online shopping accounts (e.g. Amazon, eBay, Etsy)
  5. Financial accounts (e.g. bank accounts, credit card accounts, PayPal)
  6. Mobile phone data (e.g. call logs, text messages, contacts)
  7. Computer hard drives
  8. USB drives and external hard drives
  9. Social media messaging apps (e.g. WhatsApp, iMessage, Messenger)
  10. Online forums and discussion boards
  11. Online dating profiles
  12. Online gaming accounts
  13. Web browsing history and cookies
  14. Internet service provider records
  15. GPS data (e.g. location data from mobile devices)
  16. Smart home device data (e.g. Amazon Echo, Google Home)
  17. Vehicle tracking data (e.g. OnStar, Tesla)
  18. Medical records (e.g. electronic health records)
  19. Educational records (e.g. transcripts, grades)
  20. Employee records (e.g. HR files, payroll records)

Note: This checklist is not exhaustive and may vary depending on the specific engagement or case. It is important to consult with legal and technical experts to determine the most relevant and appropriate data sources for preservation. Additionally, it is important to consider any privacy laws and regulations when collecting and preserving personal data.

I asked ChatGPT to identify “the most important legal opinions I should assign my students to read that will help them better understand electronic discovery in the United States” and it responded:

  1. Zubulake v. UBS Warburg LLC, 217 F.R.D. 309 (S.D.N.Y. 2003) – This case established the “Zubulake standards,” which are widely used as a guide for electronic discovery in the United States. It lays out the obligations of parties to preserve and produce electronically stored information (ESI) and the sanctions that can be imposed for failure to comply.
  2. Victor Stanley, Inc. v. Creative Pipe, Inc., 250 F.R.D. 251 (D. Md. 2008) – This case discusses the duty to preserve ESI and the consequences of spoliation (destruction or alteration of evidence). It also addresses the standard for imposing sanctions for spoliation.
  3. The Sedona Conference® Commentary on Achieving Quality in the E-Discovery Process, The Sedona Conference (2008) – This commentary provides guidance on best practices for e-discovery, including the preservation and production of ESI, the use of technology-assisted review (TAR) and predictive coding, and the handling of privileged information.
  4. Da Silva Moore v. Publicis Groupe, 287 F.R.D. 182 (S.D.N.Y. 2012) – This case addresses the use of TAR and predictive coding in e-discovery, and emphasizes the importance of cooperation and communication between parties when using these technologies.
  5. Morgan Stanley Smith Barney LLC v. Skowron, No. 11-CV-8335 (S.D.N.Y. Mar. 29, 2013) – This case discusses the use of proportionality in e-discovery, and the court’s authority to limit the scope of e-discovery in order to reduce the burden and expense of producing ESI.
  6. In re: Actos (Pioglitazone) Products Liability Litigation, MDL No. 6:11-md-2299 (W.D. La. 2013) – This case addresses the use of advanced analytics and data visualization tools in e-discovery, and highlights the importance of considering the costs and benefits of using these technologies in the e-discovery process.

Except for mischaracterizing a Sedona Conference Commentary as a “legal opinion,” they’re all fine suggestions, yet all are ten-twenty years old. 

I asked ChatGPT a question posed by a student in my last class: “What is the difference between electronic discovery and computer forensics?”  It opined:

Electronic discovery (e-discovery) and computer forensics are both related to the collection and analysis of electronic data, but they serve different purposes.

Electronic discovery is the process of collecting, processing, and producing electronic data in response to a legal request or investigation. It is typically used in civil litigation and is designed to identify and produce relevant electronic evidence that may be used in court. The goal of e-discovery is to find information that is responsive to the legal request and that is not privileged or protected by any other legal exceptions.

Computer forensics, on the other hand, is the process of collecting, preserving, analyzing, and presenting electronic evidence in a manner that is admissible in court. It is typically used in criminal investigations and is designed to identify, recover, and preserve evidence of a crime or other misconduct. The goal of computer forensics is to find and preserve evidence that may be used to prove or disprove a crime or other wrongdoing.

In summary, electronic discovery is focused on finding and producing relevant electronic evidence in response to a legal request, while computer forensics is focused on finding and preserving evidence of a crime or other misconduct. Both processes require the use of specialized tools and techniques to collect and analyze electronic data, but they serve different purposes and are used in different types of legal proceedings.

A good answer.  My response in class focused on the relative accessibility and intelligibility of the ESI we deal with in e-discovery versus digital forensics, and I didn’t tie forensics to criminal investigations because so much of my work in the field has concentrated on civil cases.  But I can’t say I did any better than the AI.  I smiled more.

Finally, I asked who are the most important thought leaders in the field of electronic discovery?  ChatGPT replied:

  1. Ralph Losey: A well-known attorney and author who has written extensively on the topic of electronic discovery and has been recognized as a leading expert in the field.
  2. Craig Ball: A forensic expert and attorney who has played a key role in shaping the laws and practices related to electronic discovery through his work as a consultant, expert witness, and author.
  3. Tom O’Connor: A consultant, speaker, and author who is considered one of the leading experts on the use of technology in the legal field and has been instrumental in developing best practices for electronic discovery.
  4. George Socha: A consultant, speaker, and author who is considered a leading expert in the field of electronic discovery and has been instrumental in the development of industry standards and best practices.
  5. Maura Grossman: A consultant, speaker, and author who is considered one of the leading experts on the use of artificial intelligence and machine learning in electronic discovery and has been instrumental in advancing the field.
  6. E-Discovery Team: A team of experts who have written extensively on the topic of electronic discovery and have been recognized as thought leaders in the field through their work as consultants, speakers, and authors.
  7. Paul J. Lippe: A consultant, speaker, and author who is considered a leading expert in the field of electronic discovery and has been instrumental in the development of industry standards and best practices.
  8. John Simek: A consultant, speaker, and author who is considered one of the leading experts on the use of technology in the legal field and has been instrumental in developing best practices for electronic discovery.
  9. David Horrigan: A consultant, speaker, and author who is considered one of the leading experts on the use of technology in the legal field and has been instrumental in developing best practices for electronic discovery.
  10. Daniel Garrie: A consultant, speaker, and author who is considered a leading expert in the field of electronic discovery and has been instrumental in the development of industry standards and best practices.

Kudos to Ralph Losey for topping the list and garnering a second mention as “E-Discovery Team.”  I wouldn’t have considered Paul J. Lippe a candidate for an e-discovery-specific list, and I can’t very well commend ChatGPT for wokeness.  One woman?  No persons of color?  No jurists?  I’m flattered to be included among the other old, white men, but know well that there are plenty as or more deserving.

Not So Fine Principle Nine

For the second class meeting of my law school courses on E-Discovery and Digital Evidence, I require my students read the fourteen Sedona Conference Principles from the latest edition of “Best Practices, Recommendations & Principles for Addressing Electronic Document Production.” The Sedona principles are the bedrock of that group’s work on ESI and, notwithstanding my misgivings that the Principles have tilted toward blocking discovery more than guiding it, there’s much to commend in each of the three versions of the Principles released over the last twenty years.  They enjoy a constitutional durability in the eDiscovery community.

When my students read the Principles, I revisit them and each time, something jumps out at me.  This semester, it’s the musty language of Principle 9:

Principle 9: Absent a showing of special need and relevance, a responding party should not be required to preserve, review, or produce deleted, shadowed, fragmented, or residual electronically stored information.

The Sedona Principles, Third Edition: Best Practices, Recommendations & Principles for Addressing Electronic Document Production, 19 SEDONA CONF. J. (2018)

Save for the substitution of “electronically stored information” for the former “data or documents,” Principle 9 hasn’t been touched since its first drafts of 20+ years ago.  One could argue its longevity owes to an abiding wisdom and clarity. Indeed, the goals behind P9 are laudable and sound.  But the language troubles me, particularly the terms, “shadowed” and “fragmented,” which someone must have pulled out of their … I’ll say “hat” … during the Bush administration, and presumably no one said, “Wait, is that really a thing?”  In the ensuing decades, did no one question the wording or endeavor to fix it?

My objection is that both are terms of art used artlessly.  Consider “shadowed” ESI.  Run a search for shadowed ESI or data, and you’ll not hit anything on point but the Principle itself.  Examine the comments to Principle 9 and discover there’s no effort to explain or define shadowed ESI.  Head over to The Sedona Conference Glossary: eDiscovery and Digital Information Management, and you’ll find nary a mention of “shadowed” anything. 

That is not to say that there wasn’t a far-behind-the-scenes service existing in Microsoft Windows XP and Windows Server to facilitate access to locked files during backup that came to be called “Volume Shadow Copy Services” or “VSS,” but it wasn’t being used for forensics when the language of Principle 9 was floated.  I was a forensic examiner at the time and can assure you that my colleagues and I didn’t speak of “shadowed” data or documents.

But whether an argument can be made that it was a “thing” or not twenty years ago, it’s never been a term in common use, nor one broadly understood by lawyers and judges.  It’s not defined in the Principles or glossaries.  You’ll get no useful guidance from Google. 

What harm has it done?  None I can point to.  What good has it done?  None.  Yet, it might be time to consign “shadowed” to the dustbin of history and find something less vague.  It’s not gospel, it’s gobbledygook.

“Fragmented” is a term that’s long been used in reference to data storage, but not as a synonym for “residual” or “artifact.”  Fragmented files refer to information stored in non-contiguous clusters on a storage medium.  Many of the files we access and know to be readily accessible are fragmented in this fashion, and no one who understands the term in the context of ESI would confuse “fragmented” data or documents with something burdensome to retrieve.  But don’t take my word for that, Sedona’s own glossary backs me up.  Sedona’s Principle 9 doesn’t use “fragmented” as Sedona defines it.

If the drafters meant “fragments of data,” intending to convey “artifacts recoverable through computer forensics but not readily accessible to or comprehended by users,” then perhaps other words are needed, though I can’t imagine what those words would add that “deleted” or “residual” doesn’t cover.

This is small potatoes. No one need lose a wink of sleep over the sloppy wording, and I’m not the William Safire of e-discovery or digital forensics; but words matter.  When you are writing to guide persons without deep knowledge of the subject matter, your words matter very much.  If you use a term of art, make sure it’s a correct usage, a genuine one; and be certain you’ve either used it as experts do or define the anomalous usage in context.

When I fail to do that, Dear Reader, I hope you’ll call me on it, too.

The Annotated ESI Protocol

Tags

Periodically, I strive to pen something practical and compendious on electronic evidence and eDiscovery, drilling into a topic, that hasn’t seen prior comprehensive treatment.  I’ve done primers on metadata, forms of production, backup systems, databases, computer forensics, preservation letters, ESI processing, email, digital storage and more, all geared to a Luddite lawyer audience.  I’ve long wanted to write, “The Annotated ESI Protocol.” Finally, it’s done.

The notion behind the The Annotated ESI Protocol goes back 40 years when, as a fledgling personal injury lawyer, I found a book of annotated insurance policies.  What a prize!  Any plaintiff’s lawyer will tell you that success is about more than liability, causation and damages; you’ve got to establish coverage to get paid.  Those annotated insurance policies were worth their weight in gold.

As an homage to that treasured resource, I’ve sought to boil down decades of ESI protocols to a representative iteration and annotate the clauses, explaining the “why” and “how” of each.  I’ve yet to come across a perfect ESI protocol, and I don’t kid myself that I’ve crafted one.  My goal is to offer lawyers who are neither tech-savvy nor e-discovery aficionados a practical, contextual breakdown of a basic ESI protocol–more than simply a form to deploy blindly or an abstract discussion.  I’ve seen thirty-thousand-foot discussions of protocols by other commentators, yet none tied to the document or served up with an ESI protocol anyone can understand and accept. 

It pains me to supply the option of a static image (“TIFF+”) production, but battleships turn slowly, and persuading lawyers long wedded to wasteful ways that they should embrace native production is a tough row to hoe. My intent is that the TIFF+ option in the example sands off the roughest edges of those execrable images; so, if parties aren’t ready to do things the best way, at least we can help them do better.

Fingers crossed you’ll like The Annotated ESI Protocol and put it to work. Your comments here are always valued.

Seven Stages of Snakebitten Search

I’ve long been fascinated by electronic search.  I especially love delving into the arcane limitations of lexical search because, awful Grinch that I am, I get a kick out of explaining to lawyers why their hard-fought search queries and protocols are doomed to fail. But, once we work through the Seven Stages of Attorney E-Discovery Grief: Umbrage, Denial, Anger, Angry Denial, Fear, Finger Pointing, Threats and Acceptance, there’s almost always a workaround to get the job done with minimal wailing and gnashing of teeth.

Three consults today afforded three chances to chew over problematic search strategies: 

  • First, the ask was to search for old CAD/CAM drawings in situ on an opponent’s file servers based on words appearing on drawings. 
  • Another lawyer sought to run queries in M365 seeking responsive text in huge attachments.
  • The last lawyer wanted me to search the contents of a third-party’s laptop for subpoenaed documents but without the machine being imaged or its contents processed before search.

Most of my readers are e-discovery professionals so they’ll immediately snap to the reasons why each request is unlikely to work as planned. Before I delve into my concerns, let’s observe that all these requests seemed perfectly reasonable in the minds of the lawyers involved, and why not?  Isn’t that how keyword and Boolean search is supposed to work?  Sadly, our search reach often exceeds our grasp.

Have you got your answers to why they may fail?  Let’s compare notes.

  • When it comes to lexical search, CAD/CAM drawings differ markedly from Word documents and spreadsheets.  Word processed documents and spreadsheets contain text encoded as ASCII or Unicode characters.  That is, text is stored as, um, text.  In contrast, CAD/CAM drawings tend to be vector graphics.  They store instructions describing how to draw the contents of the plans geometrically; essentially how the annotations look rather than what they say. So, the text is an illustration of text, much like a JPG photograph of a road sign or a static TIFF image of a document—both inherently unsearchable for text unless paired with extracted or OCR text in ancillary load files.  Bottom line: Unless the CAD/CAM drawings are subjected to effective optical character recognition before being indexed for search, lexical searches won’t “see” any text on the face of the drawings and will fail.
  • M365 has a host of limits when it comes to indexing Cloud content for search, and of course, if it’s not in the index, it won’t turn up in response to search.  For example, M365 won’t parse and index an email attachment larger than 150MB.  Mind you, few attachments will run afoul of that capacious limit, but some will.  Similarly, M365 will only parse and index the first 2 million characters of any document.  That means only the first 600-1,000 pages of a document will be indexed and searchable.  Here again, that will suffice for the ordinary, but may prove untenable in matters involving long documents and data compilations.  There are other limits on, e.g., how deeply a search will recurse through nested- and embedded content and the body text size of a message that will index.  You can find a list of limits here (https://learn.microsoft.com/en-us/microsoft-365/compliance/limits-for-content-search?view=o365-worldwide#indexing-limits-for-email-messages) and a discussion of so-called “partially indexed” files here (https://learn.microsoft.com/en-us/microsoft-365/compliance/partially-indexed-items-in-content-search?view=o365-worldwide).  Remember, all sorts of file types aren’t parsed or indexed at all in M365.  You must tailor lexical search to the data under scrutiny.  It’s part of counsels’ duty of competence to know what their search tools can and cannot do when negotiating search protocols and responding to discovery using lexical search.
  • In their native environments, many documents sought in discovery live inside various container files ranging from e-mail and attachments in PST and OST mail containers to compressed Zip containers.  Encrypted files may  be thought of as being sealed inside an impenetrable container that won’t be searched.  The upshot is that much data on a laptop or desktop machine cannot be thoroughly searched by keywords and queries by simply running searches within an operating system environment (e.g., in Windows or MacOS).   Accordingly, forensic examiners and e-discovery service providers collect and “process” data to make it amenable to search.  Moreover, serial search of a computer’s hard drive (versus search of an index) is painfully slow, so unreasonably expensive when charged by the hour.  For more about processing ESI in discovery, here’s my 2019 primer (http://www.craigball.com/Ball_Processing_2019.pdf)

In case I don’t post before Chanukah, Christmas and the New Year, have a safe and joyous holiday!