• Home
  • About
  • CRAIGBALL.COM
  • Disclaimer
  • Log In

Ball in your Court

~ Musings on e-discovery & forensics.

Ball in your Court

Category Archives: General Technology Posts

ChatGPT Proves a Mediocre Law Student

27 Friday Jan 2023

Posted by craigball in E-Discovery, General Technology Posts

≈ 6 Comments

I recently spent a morning testing ChatGPT’s abilities by giving it exercises and quizzes designed for my law and computer science graduate students. Overall, I was impressed with its performance, but also noticed that it’s frequently wrong but never in doubt: a mechanical mansplainer!

If you’re asking, “What is ChatGPT,” I’ll let it explain itself: “ChatGPT is a large language model developed by OpenAI. It is a type of machine learning model called a transformer, which is trained to generate text based on a given prompt. It is particularly well-suited to tasks such as natural language processing, text generation, and language translation. It is capable of understanding human language and generating human-like text, which makes it useful for a wide range of applications, such as chatbots, question-answering systems, and text completion.” I’d add that it’s the first artificial intelligence tool that anyone can play with, and it’s generating enormous excitement.

In terms of its legal knowledge, ChatGPT performed like a mediocre law student, performing well on essay questions and drafting exercises, but struggling with multiple choice questions. I couldn’t assign it a definitive score on the 25-question quiz as some questions required image analysis and I didn’t see a way to include illustrations in the chat interface. However, it performed poorly enough on the quiz for me to conclude that a passing score of 70 out of 100 points was out of reach.

One example of a question it struggled with was:

Which of the following is NOT an essential metadata value (a “metadata must”) to preserve and produce for a file?

  1. Custodian
  2. File Name
  3. Last Modified Date and Time
  4. Last Accessed Date and Time

The correct answer is Last Accessed Date and Time.  Class readings underscore the unreliability of Last Accessed dates and times and I supply a table of “Metadata Musts” for preservation and production.  Presumably trained on other sources, ChatGPT did not seem to understand the importance of a proper chain of custody for electronic evidence and replied:

Custodian is not considered an essential metadata value to preserve and produce for a file. The other options (File Name, Last Modified Date and Time, Last Accessed Date and Time) are all considered essential metadata values that should be preserved and produced for a file in order to provide important information about the file and its history.

Since it wasn’t performing well on my quiz, I asked,

Can you suggest some multiple choice quiz questions for students studying the fundamentals of electronic discovery and electronic data storage?

The first suggestion was:

What is the process of collecting, processing, and producing electronically stored information (ESI) in response to a legal request called?

a. Electronic discovery

b. Digital forensics

c. E-discovery

d. Data retrieval

In ChatGPT’s formulation “Electronic discovery” is the right response and “E-discovery” is wrong! 

Another suggestion:

What is the most common format for ESI production?

a. PDF

b. TIFF

c. Native file format

d. ASCII

That’s a challenging question because, as much as I’ve focused on and written about forms of production, I’ve not come across metrics establishing the most common format.  In supplying an answer key, ChatGPT identified Native file format as the correct answer, though I would have guessed “TIFF.”  I certainly wouldn’t have included ASCII as a choice.

Here are a few other quiz questions suggested by ChatGPT.  I’ve bolded what ChatGPT defined as the correct response and, in each instance, I’d conclude it’s either not the best answer or the question is so poorly constructed as to defy characterizing any single answer as the right one:

What is the process of identifying and eliminating irrelevant or privileged ESI called?

a. Culling

b. Filtering

c. Redaction

d. Processing

What is the process of reviewing ESI to identify relevant information called?

a. Review

b. Analysis

c. Coding

d. Searching

What is the process of converting ESI from its original format to a reviewable format called?

a. Conversion

b. Processing

c. Extraction

d. Production

What is the process of creating a backup copy of ESI for preservation purposes called?

a. Backup

b. Imaging

c. Archiving

d. Duplication

What is the process of destroying or disposing of ESI after it is no longer needed called?

a. Deletion

b. Erasure

c. Destruction

d. Purging

Encouraged that the law faculty might be safe for another semester, I turned ChatGPT to a drafting exercise:

You are to develop the best checklist of potential data sources that are candidates for legal preservation in any engagement—good enough that you’ll want to keep it to use in your own work. It does NOT have to be entirely original, and I encourage you to start with the best existing checklist(s) you can find and then make updates and improvements. Certainly, you should seek to add missing sources, but you may also see a better way to organize and present the content. Remember that this checklist is a list of sources and varieties not methods and, crucially, it should serve to enhance the completeness of the personal data map you will create (though it should NOT be limited to the scope of your personal digital footprint alone). Thus, generic checklist items like “Online sources” or “Local storage” won’t be sufficiently granular. Be sure to credit your sources and format it as a checklist.

ChatGPT responded:

Data Preservation Checklist

  1. Social media accounts (Facebook, Twitter, Instagram, LinkedIn, etc.)
  2. Email accounts (Gmail, Outlook, Yahoo, etc.)
  3. Cloud storage accounts (Google Drive, Dropbox, OneDrive, etc.)
  4. Mobile device data (text messages, call logs, contacts, etc.)
  5. Computer files (documents, photos, videos, etc.)
  6. Internet browsing history and cache
  7. Online marketplace accounts (eBay, Amazon, Etsy, etc.)
  8. Online banking and financial accounts (Paypal, bank accounts, credit cards, etc.)
  9. GPS location data
  10. Smart home device data (Amazon Echo, Google Home, etc.)
  11. Wearable device data (Fitbit, Apple Watch, etc.)
  12. Gaming accounts (Xbox Live, PlayStation Network, Steam, etc.)
  13. Voicemail and answering machine messages
  14. Instant messaging and chat logs (WhatsApp, Messenger, etc.)
  15. Social media direct messages and private messages
  16. Online dating profiles and messages
  17. Online forum and message board posts
  18. Online video and audio recordings (YouTube, Vimeo, etc.)
  19. Online photo and image sharing accounts (Flickr, Imgur, etc.)
  20. Online document collaboration and sharing accounts (Google Docs, Microsoft Office 365, etc.)

Source: Adapted from “Digital Preservation Checklist” by the National Archives and Records Administration (NARA) (https://www.archives.gov/digital-preservation/checklist) and “Digital Preservation Checklist for Individuals” by the Digital Preservation Coalition (https://www.dpconline.org/advice/digital-preservation-checklist-for-individuals)

Not bad, and certainly typical of what I get from students.  Interestingly, both citations are dead links, so unacceptable from my students, and (arguably) ChatGPT simply ignored the requirement to format its answer as a checklist. 

I had ChatGPT answer the question a second time sans the mandate that it credit sources. I was struck by the differences in the lists supplied and the disclaimer now appended:

  1. Social media accounts (e.g. Facebook, Twitter, Instagram, LinkedIn)
  2. Email accounts (e.g. Gmail, Yahoo, Outlook)
  3. Cloud storage accounts (e.g. Google Drive, Dropbox, iCloud)
  4. Online shopping accounts (e.g. Amazon, eBay, Etsy)
  5. Financial accounts (e.g. bank accounts, credit card accounts, PayPal)
  6. Mobile phone data (e.g. call logs, text messages, contacts)
  7. Computer hard drives
  8. USB drives and external hard drives
  9. Social media messaging apps (e.g. WhatsApp, iMessage, Messenger)
  10. Online forums and discussion boards
  11. Online dating profiles
  12. Online gaming accounts
  13. Web browsing history and cookies
  14. Internet service provider records
  15. GPS data (e.g. location data from mobile devices)
  16. Smart home device data (e.g. Amazon Echo, Google Home)
  17. Vehicle tracking data (e.g. OnStar, Tesla)
  18. Medical records (e.g. electronic health records)
  19. Educational records (e.g. transcripts, grades)
  20. Employee records (e.g. HR files, payroll records)

Note: This checklist is not exhaustive and may vary depending on the specific engagement or case. It is important to consult with legal and technical experts to determine the most relevant and appropriate data sources for preservation. Additionally, it is important to consider any privacy laws and regulations when collecting and preserving personal data.

I asked ChatGPT to identify “the most important legal opinions I should assign my students to read that will help them better understand electronic discovery in the United States” and it responded:

  1. Zubulake v. UBS Warburg LLC, 217 F.R.D. 309 (S.D.N.Y. 2003) – This case established the “Zubulake standards,” which are widely used as a guide for electronic discovery in the United States. It lays out the obligations of parties to preserve and produce electronically stored information (ESI) and the sanctions that can be imposed for failure to comply.
  2. Victor Stanley, Inc. v. Creative Pipe, Inc., 250 F.R.D. 251 (D. Md. 2008) – This case discusses the duty to preserve ESI and the consequences of spoliation (destruction or alteration of evidence). It also addresses the standard for imposing sanctions for spoliation.
  3. The Sedona Conference® Commentary on Achieving Quality in the E-Discovery Process, The Sedona Conference (2008) – This commentary provides guidance on best practices for e-discovery, including the preservation and production of ESI, the use of technology-assisted review (TAR) and predictive coding, and the handling of privileged information.
  4. Da Silva Moore v. Publicis Groupe, 287 F.R.D. 182 (S.D.N.Y. 2012) – This case addresses the use of TAR and predictive coding in e-discovery, and emphasizes the importance of cooperation and communication between parties when using these technologies.
  5. Morgan Stanley Smith Barney LLC v. Skowron, No. 11-CV-8335 (S.D.N.Y. Mar. 29, 2013) – This case discusses the use of proportionality in e-discovery, and the court’s authority to limit the scope of e-discovery in order to reduce the burden and expense of producing ESI.
  6. In re: Actos (Pioglitazone) Products Liability Litigation, MDL No. 6:11-md-2299 (W.D. La. 2013) – This case addresses the use of advanced analytics and data visualization tools in e-discovery, and highlights the importance of considering the costs and benefits of using these technologies in the e-discovery process.

Except for mischaracterizing a Sedona Conference Commentary as a “legal opinion,” they’re all fine suggestions, yet all are ten-twenty years old. 

I asked ChatGPT a question posed by a student in my last class: “What is the difference between electronic discovery and computer forensics?”  It opined:

Electronic discovery (e-discovery) and computer forensics are both related to the collection and analysis of electronic data, but they serve different purposes.

Electronic discovery is the process of collecting, processing, and producing electronic data in response to a legal request or investigation. It is typically used in civil litigation and is designed to identify and produce relevant electronic evidence that may be used in court. The goal of e-discovery is to find information that is responsive to the legal request and that is not privileged or protected by any other legal exceptions.

Computer forensics, on the other hand, is the process of collecting, preserving, analyzing, and presenting electronic evidence in a manner that is admissible in court. It is typically used in criminal investigations and is designed to identify, recover, and preserve evidence of a crime or other misconduct. The goal of computer forensics is to find and preserve evidence that may be used to prove or disprove a crime or other wrongdoing.

In summary, electronic discovery is focused on finding and producing relevant electronic evidence in response to a legal request, while computer forensics is focused on finding and preserving evidence of a crime or other misconduct. Both processes require the use of specialized tools and techniques to collect and analyze electronic data, but they serve different purposes and are used in different types of legal proceedings.

A good answer.  My response in class focused on the relative accessibility and intelligibility of the ESI we deal with in e-discovery versus digital forensics, and I didn’t tie forensics to criminal investigations because so much of my work in the field has concentrated on civil cases.  But I can’t say I did any better than the AI.  I smiled more.

Finally, I asked who are the most important thought leaders in the field of electronic discovery?  ChatGPT replied:

  1. Ralph Losey: A well-known attorney and author who has written extensively on the topic of electronic discovery and has been recognized as a leading expert in the field.
  2. Craig Ball: A forensic expert and attorney who has played a key role in shaping the laws and practices related to electronic discovery through his work as a consultant, expert witness, and author.
  3. Tom O’Connor: A consultant, speaker, and author who is considered one of the leading experts on the use of technology in the legal field and has been instrumental in developing best practices for electronic discovery.
  4. George Socha: A consultant, speaker, and author who is considered a leading expert in the field of electronic discovery and has been instrumental in the development of industry standards and best practices.
  5. Maura Grossman: A consultant, speaker, and author who is considered one of the leading experts on the use of artificial intelligence and machine learning in electronic discovery and has been instrumental in advancing the field.
  6. E-Discovery Team: A team of experts who have written extensively on the topic of electronic discovery and have been recognized as thought leaders in the field through their work as consultants, speakers, and authors.
  7. Paul J. Lippe: A consultant, speaker, and author who is considered a leading expert in the field of electronic discovery and has been instrumental in the development of industry standards and best practices.
  8. John Simek: A consultant, speaker, and author who is considered one of the leading experts on the use of technology in the legal field and has been instrumental in developing best practices for electronic discovery.
  9. David Horrigan: A consultant, speaker, and author who is considered one of the leading experts on the use of technology in the legal field and has been instrumental in developing best practices for electronic discovery.
  10. Daniel Garrie: A consultant, speaker, and author who is considered a leading expert in the field of electronic discovery and has been instrumental in the development of industry standards and best practices.

Kudos to Ralph Losey for topping the list and garnering a second mention as “E-Discovery Team.”  I wouldn’t have considered Paul J. Lippe a candidate for an e-discovery-specific list, and I can’t very well commend ChatGPT for wokeness.  One woman?  No persons of color?  No jurists?  I’m flattered to be included among the other old, white men, but know well that there are plenty as or more deserving.

Share this:

  • Email
  • Print
  • Twitter
  • Facebook
  • LinkedIn

Like this:

Like Loading...

Seven Stages of Snakebitten Search

13 Tuesday Dec 2022

Posted by craigball in Computer Forensics, E-Discovery, General Technology Posts, Uncategorized

≈ 6 Comments

I’ve long been fascinated by electronic search.  I especially love delving into the arcane limitations of lexical search because, awful Grinch that I am, I get a kick out of explaining to lawyers why their hard-fought search queries and protocols are doomed to fail. But, once we work through the Seven Stages of Attorney E-Discovery Grief: Umbrage, Denial, Anger, Angry Denial, Fear, Finger Pointing, Threats and Acceptance, there’s almost always a workaround to get the job done with minimal wailing and gnashing of teeth.

Three consults today afforded three chances to chew over problematic search strategies: 

  • First, the ask was to search for old CAD/CAM drawings in situ on an opponent’s file servers based on words appearing on drawings. 
  • Another lawyer sought to run queries in M365 seeking responsive text in huge attachments.
  • The last lawyer wanted me to search the contents of a third-party’s laptop for subpoenaed documents but without the machine being imaged or its contents processed before search.

Most of my readers are e-discovery professionals so they’ll immediately snap to the reasons why each request is unlikely to work as planned. Before I delve into my concerns, let’s observe that all these requests seemed perfectly reasonable in the minds of the lawyers involved, and why not?  Isn’t that how keyword and Boolean search is supposed to work?  Sadly, our search reach often exceeds our grasp.

Have you got your answers to why they may fail?  Let’s compare notes.

  • When it comes to lexical search, CAD/CAM drawings differ markedly from Word documents and spreadsheets.  Word processed documents and spreadsheets contain text encoded as ASCII or Unicode characters.  That is, text is stored as, um, text.  In contrast, CAD/CAM drawings tend to be vector graphics.  They store instructions describing how to draw the contents of the plans geometrically; essentially how the annotations look rather than what they say. So, the text is an illustration of text, much like a JPG photograph of a road sign or a static TIFF image of a document—both inherently unsearchable for text unless paired with extracted or OCR text in ancillary load files.  Bottom line: Unless the CAD/CAM drawings are subjected to effective optical character recognition before being indexed for search, lexical searches won’t “see” any text on the face of the drawings and will fail.
  • M365 has a host of limits when it comes to indexing Cloud content for search, and of course, if it’s not in the index, it won’t turn up in response to search.  For example, M365 won’t parse and index an email attachment larger than 150MB.  Mind you, few attachments will run afoul of that capacious limit, but some will.  Similarly, M365 will only parse and index the first 2 million characters of any document.  That means only the first 600-1,000 pages of a document will be indexed and searchable.  Here again, that will suffice for the ordinary, but may prove untenable in matters involving long documents and data compilations.  There are other limits on, e.g., how deeply a search will recurse through nested- and embedded content and the body text size of a message that will index.  You can find a list of limits here (https://learn.microsoft.com/en-us/microsoft-365/compliance/limits-for-content-search?view=o365-worldwide#indexing-limits-for-email-messages) and a discussion of so-called “partially indexed” files here (https://learn.microsoft.com/en-us/microsoft-365/compliance/partially-indexed-items-in-content-search?view=o365-worldwide).  Remember, all sorts of file types aren’t parsed or indexed at all in M365.  You must tailor lexical search to the data under scrutiny.  It’s part of counsels’ duty of competence to know what their search tools can and cannot do when negotiating search protocols and responding to discovery using lexical search.
  • In their native environments, many documents sought in discovery live inside various container files ranging from e-mail and attachments in PST and OST mail containers to compressed Zip containers.  Encrypted files may  be thought of as being sealed inside an impenetrable container that won’t be searched.  The upshot is that much data on a laptop or desktop machine cannot be thoroughly searched by keywords and queries by simply running searches within an operating system environment (e.g., in Windows or MacOS).   Accordingly, forensic examiners and e-discovery service providers collect and “process” data to make it amenable to search.  Moreover, serial search of a computer’s hard drive (versus search of an index) is painfully slow, so unreasonably expensive when charged by the hour.  For more about processing ESI in discovery, here’s my 2019 primer (http://www.craigball.com/Ball_Processing_2019.pdf)

In case I don’t post before Chanukah, Christmas and the New Year, have a safe and joyous holiday!

Share this:

  • Email
  • Print
  • Twitter
  • Facebook
  • LinkedIn

Like this:

Like Loading...

Electronic Evidence Workbook 2022

13 Thursday Jan 2022

Posted by craigball in Uncategorized, General Technology Posts, E-Discovery, Computer Forensics

≈ 6 Comments

I’ve released a new version of the Electronic Evidence Workbook used in my three credit E-Discovery and Digital Evidence course at the University of Texas Law School, UT Computer Science School and UT School of Information. I prefer this release over any before because it presents the material more accessibly and logically, better tying the technical underpinnings to trial practice.

The chapters on processing are extensively revamped. I’m hell bent on making encoding understandable, and I’ve incorporated the new Processing Glossary I wrote for the EDRM. Glossaries are no one’s idea of light reading, but I hope this one proves a handy reference as the students cram for the five quizzes and final exam they’ll face.

Recognizing that a crucial component of competence in electronic discovery is mastering the arcane argot of legaltech, I’ve added Vital Vocabulary lists throughout, concluded chapters with Key Takeaway callouts and, for the first time, broken the Workbook into volumes such that this release covers just the first eight classes, almost entirely Information Technology.

Come Spring Break in mid-March, I’ll release the revamped omnibus volume adding new practical exercises in Search, Processing, Production, Review and Meet & Confer and introducing new tools. Because university students use Mac machines more than Windows PCs, the exercises ahead employ Cloud applications so as to be wholly platform-independent. The second half of the course folds in more case law to the relief of law students and chagrin of CS and IS students. The non-law students do a great job on the law but approach it with trepidation; the law students kiss the terra firma of case law like white-knuckled passengers off a turbulent flight.

Though written for grad students, the Workbook is also written for you, Dear Reader. If you’ve longed to learn more about information technology and e-discovery but never knew quite where or how to start, perhaps the 2022 Workbook is your gateway. The law students at UT Austin pay almost $60,000 per year for their educations; I’ll settle for a little feedback from you when you read it.

Share this:

  • Email
  • Print
  • Twitter
  • Facebook
  • LinkedIn

Like this:

Like Loading...

Then his head exploded!

28 Tuesday Sep 2021

Posted by craigball in Uncategorized, General Technology Posts, E-Discovery, Computer Forensics

≈ 2 Comments

In the introduction to my Electronic Evidence Workbook, I note that my goal is to change the way readers think about electronically stored information and digital evidence. I want all who take my courses to see that modern electronic information is just a bunch of numbers and not be daunted by those numbers.

I find numbers reassuring and familiar, so I occasionally forget that some are allergic to numbers and loathe to wrap their heads around them.

Lately, one of my bright students identified himself as a “really bad with numbers person.” My lecture was on encoding as prologue to binary storage, and when I shifted too hastily from notating numbers in alternate bases (e.g., Base 2, 10, 16 and 64) and started in on encoding textual information as numbers (ASCII, Unicode), my student’s head exploded.

Boom!

At least that’s what he told me later. I didn’t hear anything when it happened, so I kept nattering on happily until class ended.

As we chatted, I realized that my student expected that encoding and decoding electronically stored information (ESI) would be a one-step process.  He was having trouble distinguishing the many ways that numbers (numeric values) can be notated from the many ways that numbers represent (“encode”) text and symbols like emoji.  Even as I write that sentence I suspect he’s not alone.

Of course, everyone’s first hurdle in understanding encoding is figuring out why to care about it at all.  Students care because they’re graded on their mastery of the material, but why should anyone else care; why should lawyers and litigation professionals like you care?  The best answer I can offer is that you’ll gain insight.  It will change the way you think about ESI in the same way that algebra changes the way you think about problem solving.  If you understand the fundamental nature of electronic evidence, you will be better equipped to preserve, prove and challenge its integrity as accurate and reliable information.

Electronic evidence is just data, and data are just numbers; so, understanding the numbers helps us better understand electronic evidence.

Understanding encoding requires we hearken back to those hazy days when we learned to tally and count by numbers.  Long ago, we understood quantities (numeric values) without knowing the numerals we would later use to symbolize quantities.  When we were three or four, “five” wasn’t yet Arabic 5, Roman V or even a symbolic tally like ||||. 

More likely, five was this:

If you’re from the Americas, Europe or Down Under, I’ll wager you were taught to count using the decimal system, a positional notation system with a base of 10.  Base 10 is so deeply ingrained in our psyches that it’s hard to conceive of numeric values being written any other way.  Decimal just feels like one, “true” way to count, but it’s not.  Writing numbers using an alternate base or “radix” is just as genuine, and it’s advantageous when information is stored or transmitted digitally.

Think about it.  Human beings count by tens because we evolved with ten digits on our hands.  Were that not so, old jokes like this one would make no sense: “Did you hear about the Aggie who was arrested for indecent exposure?  He had to count to eleven.”

Had our species evolved with eight fingers or twelve, we would have come to rely upon an octal or duodecimal counting system, and we would regard those systems as the “true” positional notation system for numeric values.  Ten only feels natural because we built everything around ten.

Computers don’t have fingers; instead, computers count using a slew of electronic switches that can be “on” or “off.”  Having just two states (on/off) makes it natural to count using Base 2, a binary counting system.  By convention, computer scientists notate the status of the switches using the numerals one and zero.  So, we tend to say that computers store information as ones and zeroes.  Yet, they don’t.

Computer storage devices like IBM cards, hard drives, tape, thumb drives and optical media store information as physical phenomena that can be reliably distinguished in either of two distinct states, e.g., punched holes, changes in magnetic polar orientation, minute electric potentials or deflection of laser beams.   We symbolize these two states as one or zero, but you could represent the status of binary data by, say, turning a light on or off.  Early computing systems did just that, hence all those flashing lights.

You can express any numeric value in any base without changing its value, just as it doesn’t change the numeric value of “five” to express it as Arabic “5” or Roman “V” or just by holding up five fingers. 

In positional notation systems, the order of numerals determines their contribution to the value of the number; that is, their contribution is the value of the digit multiplied by a factor determined by the position of the digit and the base.

The base/radix describes the number of unique digits, starting from zero, that a positional numeral system uses to represent numbers.  So, there are just two digits in base 2 (binary), ten in base 10 (decimal) and sixteen in base 16 (hexadecimal).  E-mail attachments are encoded using a whopping 64 digits in base 64.

We speak the decimal number 31,415 as “thirty-one thousand, four hundred and fifteen,” but were we faithfully adhering to its base 10 structure, we might say, “three ten thousands, one thousand, four hundreds, one ten and five ones.  The “base” ten means that there are ten characters used in the notation (0-9) and the value of each position is ten times the value of the position to its right.

The same decimal number 31,415 can be written as a binary number this way: 111101010110111

In base 2, two characters are used in the notation (0 and 1) and each position is twice the value of the position to its right.  If you multiply each digit times its position value and add the products, you’ll get a total equal in value to the decimal number 31,415.

A value written as five characters in base 10 requires 15 characters in base 2.  That seems inefficient until you recall that computers count using on-off switches and thrive on binary numbers.

The decimal value 31,415 can be written as a base 16 or hexadecimal number this way: 7AB7

In base 16, sixteen characters are used in the notation (0-9 and A-F) and each position is sixteen times the value of the position to its right.  If you multiply each digit times its position value and add the products, you’ll get a total equal in value to the decimal number 31,415.  But how do you multiply letters like A, B, C, D, E and F?  You do it by knowing the letters are used to denote values greater than 9, so A=10, B=11, C=12, D=13, E=14 and F=15.  Zero through nine plus the six values represented as letters comprise the sixteen characters needed to express numeric values in hexadecimal.

Once more, If you multiply each digit/character times its position value and add the products, you’ll get a total equal in value to the decimal number 31,415:

Computers work with binary data in eight-character sequences called bytes.  A binary sequence of eight ones and zeros (“bits”) can be arranged in 256 unique ways.   Long sequences of ones and zeroes are hard for humans to follow, so happily, two hexadecimal characters can also be arranged in 256 unique ways, meaning that just two base-16 characters can replace the eight characters of a binary byte (i.e., a binary value of 11111111 can be written in hex as FF).  Using hexadecimal characters allows programmers to write data in just 25% of the space required to write the same data in binary, and it’s easier for humans to follow.

Let’s take a quick look at why this is so.  A single binary byte can range from 0 to 255 (being 00000000 to 11111111).  Computers count from zero, so that range spans 256 unique values. The following table demonstrates why the largest value of an eight character binary byte (11111111) equals the largest value of just two hexadecimal characters (FF):

Hexadecimal values are everywhere in computing.  Litigation professionals encounter hexadecimal values as MD5 hash values and may run into them as IP addresses, Globally Unique Identifiers (GUIDs) and even color references.

Encoding Text

So far, I’ve described ways to encode the same numeric value in different bases.  Now, let’s shift gears to describe how computers use those numeric values to signify intelligible alphanumeric information like the letters of an alphabet, punctuation marks and emoji.  Again, data are just numbers, and those numbers signify something in the context of the application using that data, just as gesturing with two fingers may signify the number two, a peace sign, the V for Victory or a request that a blackjack dealer split a pair.  What numbers mean depends upon the encoding scheme applied to the values in the application; that is, the encoding scheme supplies the essential context needed to make the data intelligible.  If the number is used to describe an RGB color, then the hex value 7F00FF means violet.  Why?  Because each of the three values that make up the number (7F 00 FF) denote how much of the colors red, green and blue to mix to create the desired RGB color. In other contexts,  the same hex value could mean the decimal number 8,323,327, the binary string 11111110000000011111111 or the characters 缀ÿ.

ASCII

When the context is text, there are a host of standard ways, called Character Encodings or Code Pages, in which the numbers denote letters, punctuation and symbols.  Now nearly sixty years old, the American Standard Code for Information Interchange (ASCII, “ask-key”) is the basis for most modern character encoding schemes (though both Morse code and Baudot code are older).  Born in an era of teletypes and 7-bit bytes, ASCII’s original 128 codes included 33 non-printable codes for controlling machines (e.g., carriage return, ring bell) and 95 printable characters.  The ASCII character set follows:

Windows-1252

Later, when the byte standardized from seven to eight bits (recall a bit is a one or zero), 128 additional characters could be added to the character set, prompting the development of extended character encodings. Arguably the most used single-byte character set in the world is the Windows-1252 code page, the characters of which are set out in the following table (red dots signify unassigned values). 

Note that the first 128 control codes and characters (from NUL to DEL) match the ASCII encodings and the 128 characters that follow are the extended set.  Each character and control code has a corresponding fixed byte value, i.e., an upper-case B is hex 40 and the section sign, §, is hex A7.  To see the entire code page character set and the corresponding hexadecimal encodings on Wikipedia, click here.  Again, ASCII and the Windows-1252 code page are single byte encodings so they are limited to a maximum of 256 characters.

Unicode

The Windows-1252 code page works reasonably well so long as you’re writing in English and most European languages; but sporting only 256 characters, it won’t suffice if you’re writing in, say, Greek, Cyrillic, Arabic or Hebrew, and it’s wholly unsuited to Asian languages like Chinese, Japanese and Korean. 

Though programmers developed various ad hoc approaches to foreign language encodings, an increasingly interconnected world needed universal, systematic encoding mechanisms.  These methods would use more than one byte to represent each character, and the most widely adopted such system is Unicode.  In its latest incarnation (version 14.0, effective 9/14/21), Unicode standardizes the encoding of 159 written character sets called “scripts” comprising 144,697 characters, plus multiple symbol sets and emoji characters.

The Unicode Consortium crafted Unicode to co-exist with the longstanding ASCII and ANSI character sets by emulating the ASCII character set in corresponding byte values within the more extensible Unicode counterpart, UTF-8.  UTF-8 can represent all 128 ASCII characters using a single byte and all other Unicode characters using two, three or four bytes.  Because of its backward compatibility and multilingual adaptability, UTF-8 has become the most popular text encoding standard, especially on the Internet and within e-mail systems. 

Exploding Heads and Encoding Challenges

As tempting as it is to regard encoding as a binary backwater never touching lawyers’ lives, encoding issues routinely lie at the root of e-discovery disputes, even when the term “encoding” isn’t mentioned.  “Load file problems” are often encoding issues, as may be “search difficulties,” “processing exceptions” and “corrupted data.”  If an e-discovery processing tool reads Windows-1252 encoded text expecting UTF-8 encoded text or vice-versa, text and load files may be corrupted to the point that data will need to be re-processed and new production sets generated.  That’s costly, time-consuming and might be wholly avoidable, perhaps with just the smattering of knowledge of encoding gained here.

Share this:

  • Email
  • Print
  • Twitter
  • Facebook
  • LinkedIn

Like this:

Like Loading...

Ten Tips for Better ESI Expert Reports

24 Monday May 2021

Posted by craigball in General Technology Posts, E-Discovery, Computer Forensics

≈ 5 Comments

A lawyer I admire asked me to talk to her colleague about expert reports.  I haven’t had that conversation yet, but the request got me thinking about the elements of a competent expert report, especially reports in my areas of computer forensics and digital evidence.  I dashed off ten things I thought contribute to the quality of the best expert reports.  If these were rules, I’d have to concede I’ve learned their value by breaking a few of them.  I’ve left out basic writing tips like “use conversational language and simple declarative sentences.” There are lists of rules for good writing elsewhere and you should seek them out.  Instead, here’s my impromptu list of ten tips for crafting better expert reports on technical issues in electronic discovery and computer forensics:

  1. Answer the questions you were engaged to resolve.
  2. Don’t overreach your expertise.
  3. Define jargon, and share supporting data in useful, accessible ways.
  4. Distinguish factual findings from opinions.
  5. Include language addressing the applicable evidentiary standard.
  6. Eschew advocacy; let your expertise advocate for you.
  7. Challenge yourself and be fair.
  8. Proofread.  Edit.  Proofread again. Sleep on it. Edit again.
  9. Avoid assuming the fact finder’s role in terms of ultimate issues.
  10. Listen to your inner voice.

Most of these are self-explanatory but please permit me a few clarifying comments.

Answer the questions you were engaged to resolve.

My pet peeve with expert reports is that they don’t always address the questions important to the court and counsel.  I’ve seen reports spew hundreds of pages of tables and screenshots without conveying what any of it means to the issues in the case.  Sometimes you can’t answer the questions.  Fine.  Say so.  Other times you must break down or reframe the questions to conform to the evidence.  That’s okay, too, IF it’s not an abdication of the task you were brought in to accomplish.  But, the best, most useful and intelligible expert reports pose and answer specific questions.

Don’t overreach your expertise.

The standard to qualify as an expert witness is undemanding: do you possess specialized knowledge that would assist the trier of fact in understanding the evidence or resolving issues of fact? See, e.g., Federal Rule of Evidence 702.  With the bar so low, it can be tempting to overreach your expertise, particularly when pushed by a client to opine on something you aren’t fully qualified to address.  For example, I’m a certified computer forensic examiner and I studied accounting in college, but I’m not a forensic accountant.  I know a lot about digital forgery, but I’m not a trained questioned document examiner.  These are specialties.  I try to stay in my own lane and commend it to other experts.

Define jargon, and share supporting data in useful, accessible ways.

Can someone with an eighth-grade education and no technical expertise beyond that of the average computer user understand your report?  If not, you’re writing for the wrong audience.  We should write to express, not impress.  I love two-dollar words and the bon mot phrase, but they don’t serve me well when writing reports.  Never assume that a technical term will be universally understood.  If your grandparents wouldn’t know what it means, define it.

Computer forensic tools are prone to generate lengthy “reports” rife with incomprehensible data.  It’s tempting to tack them on as appendices to add heft and underscore how smart one must be to understand it all.  But it’s the expert’s responsibility to act as a guide to the data and ensure its import is clear.  I rarely testify—even by affidavit–without developing annotated demonstrative examples of the supporting data.  Don’t wait for the deposition or hearing to use demonstrative evidence; make points clear in the report.

Too, I’m fond of executive summaries; that is, an up-front, cut-to-the-chase paragraph relating the upshot of the report.

Distinguish factual findings from opinions.

The key distinction between expert and fact witnesses is that expert witnesses are permitted to express opinions that go beyond their personal observation.  A lay witness to a crash may testify to speeds based only upon what they saw with their own eyes.  An accident reconstructionist can express an opinion of how fast the cars were going based upon evidence that customarily informs expert opinions like skid marks and vehicle deformation.  Each type of testimony must satisfy different standards of proof in court; so, to make a clear and defensible record, it’s good practice to distinguish factual findings (“things you saw”) from opinions (“things you’ve concluded based upon what you saw AND your specialized knowledge, training and experience”).  This  naturally begets the next tip:

Include language addressing the applicable evidentiary standard.

Modern jurisprudence deploys safeguards like the Daubert standard to combat so-called “junk science.”  Technical expert opinions must be based upon a sound scientific methodology, viz., sufficient facts or data and the product of reliable principles and methods.  While a court acting as gatekeeper can infer the necessary underpinnings from an expert’s report and C.V., expressly stating that opinions are based upon proper and accepted standards makes for a better record.

Eschew advocacy; let your expertise advocate for you.

Mea culpa here.  Because I was a trial lawyer for three+ decades, I labor to restrain myself in my reporting to ensure that I’m not intruding into the lawyer’s realm of advocacy.  I don’t always succeed.  Even if you’re working for a side, be as scrupulously neutral as possible in your reporting.  Strive to act and sound like you don’t care who prevails even if you’re rooting for the home team.  If you do your job well, the facts will advocate the right outcome.

Challenge yourself and be fair.

My worst nightmare as an expert witness is that I will mistakenly opine that someone committed a bad act when they didn’t.  So, I’m always trying to punch holes in my own theories and asking myself, “how would I approach this if I were working for the other side?”  Nowhere is this more important than when working as a court-appointed neutral expert.  Even if you’d enjoying seeing a terrible person fry, be fair.  You stand in the shoes of the Court.

Proofread.  Edit.  Proofread again. Sleep on it. Edit again.

Who has that kind of time, right?  Still, try to find the time.  Few things undermine the credibility of an expert report like a bunch of spelling and grammatical errors.  Stress and fatigue make for poor first drafts.  It often takes a good night’s sleep (or at least a few hours away from the work) to catch the inartful phrase, typo or other careless error.

Avoid assuming the fact finder’s role in terms of ultimate issues.

Serving as a court Special Master a few years back, I opined that the evidence of a certain act was so overwhelming that the Court should only reach one result.  Accordingly, I ceased investigating the loss of certain data that I regarded as out-of-scope.  I was right…but I was also wrong.  The Court has a job to do and, by my eliding over an issue the Court was obliged to address, the Court had to rule without benefit of what a further inquiry into the missing evidence would have revealed. The outcome was the same, but by assuming the factfinder’s role on an ultimate issue, I made the Court’s job harder.  Don’t do that.

Listen to your inner voice.

In expressing expert opinions, too much certainty—a/k/a arrogance–is as perilous as too much doubt.  Perfect is not the standard, but you should be reasonably confident of  your opinion based on a careful and competent review of the evidence.  If something “feels” off, it may be your inner voice telling you to look again. 

Share this:

  • Email
  • Print
  • Twitter
  • Facebook
  • LinkedIn

Like this:

Like Loading...

The Great Pandemic Leap

22 Thursday Apr 2021

Posted by craigball in General Technology Posts, E-Discovery, Computer Forensics

≈ 4 Comments

Much has been made of the “Great Pandemic Leap” by law firms and courts. Pandemic proved to be, if not the mother of invention, at least the mother****** who FINALLY got techno tardy lawyers to shuffle forward. The alleged leap had nothing to do with new technology. Zoom and other collaboration tools have been around a long time. In fact, April 21, 2021 was Zoom’s 10th Birthday! Happy Birthday, Zoom! Thanks for being there for us.

No, it wasn’t new technology. The ‘Ten Years in Ten Weeks’ great leap was enabled by compulsion, adoption and support.

“Compulsion” because we couldn’t meet face-to-face, and seeing faces (and slides and white boards) is important.
“Adoption” because so many embraced Zoom and its ilk that we suddenly enjoyed a common meeting place.
“Support” because getting firms and families up and running on Zoom et al. became a transcendent priority.

It didn’t hurt that schools moving to Zoom served to put a support scion in many lawyers’ homes and, let’s face it Atticus, the learning curve wasn’t all that steep. Everyone already had a device with camera and microphone. Zoom made it one-click easy to join a meeting, even if eye-level camera positioning and unmuting of microphones has proven more confounding to lawyers than the Rule Against Perpetuities.

For me, the Great Leap manifested as the near-universal ability to convene on a platform where screen sharing and remote control were simple. I’ve long depended on remote control and screen sharing tools to access machines by Remote Desktop Protocol (RDP) or TeamViewer (not to mention PCAnywhere and legacy applications that made WFH possible in the 90s and aughts). But, that was on my own machines. Linking to somebody else’s machine without a tech-savvy soul on the opposite end was a nightmare. If you’ve ever tried to remotely support a parent, you understand. “No, Mom, please don’t click anything until I tell you. Oh, you already did? What did the error message say? Next time, don’t hit ‘Okay” until you read the message, please Mom.“

E-discovery and digital forensics require defensible data identification, preservation and collection. The pandemic made deskside reviews and onsite collection virtually impossible, or more accurately, those tasks became possible only virtually. Suddenly, miraculously, everyone knew how to join a Zoom call, so custodians could share screens and hand over remote control of keyboard and mouse. I could record the sessions to document the work and remotely load software (like iMazing or CoolMuster) to preserve and access mobile devices. Remote control and screen sharing let me target collection efforts based on my judgment and not be left at the mercy of a custodian’s self-interested actions. Custodians could observe, assist and intervene in my work or they could opt to walk away and leave me to do my thing. I was “there,” but less intrusively and spared the expense and hassle of travel. I could meet FRCP 26(g) obligations and make a record to return to if an unforeseen issue arose.

In my role as investigator, there’s are advantages attendant to being onsite; e.g., I sometimes spot evidence of undisclosed data sources. But, weighed against the convenience and economy of remote identification and collection, I can confidently say I’m never going back to the old normal when I can do the work as well via Zoom.

Working remotely as I’ve described requires a passing familiarity with Zoom screen sharing, if only to be able to talk others through unseen menus. As Zoom host, you will need to extend screen sharing privileges to the remote user. Do this on-the-fly by making the remote user a meeting co-host, (click “More” alongside their name in the Participants screen). Alternatively, you can select Advanced Sharing Options from the Share Screen menu. Under “Who can Share?” choose “All Participants.”

To acquire control of the remote user’s mouse and keyboard, have the remote user initiate a screen share then open the View Options dropdown menu alongside the green bar indicating you’re viewing a shared screen. Select “Request Remote Control,” then click “Request” to confirm. The remote user will see a message box seeking authorization to control their screen. Once authorized, click inside the shared screen window to take control of the remote machine.

If you need to inspect a remote user’s iPhone or iPad, Zoom supports sharing those devices using a free plugin that links the mobile device over the same WiFi connection as the Zoom session. To initiate an iPhone/iPad screen share, instruct the remote user to click Screen Share and then select the iPhone/iPad icon at right for further instructions. Simpler still, have the remote user install Zoom on the phone or pad under scrutiny and join the Zoom session from the mobile device. Once in the meeting, the remote user screen shares from the session on the mobile device. Easy-peasy AND it works for Android phones, too!

So Counselor, go ahead and take that victory lap. Whether you made a great leap or were dragged kicking and screaming to a soupçon of technical proficiency, it’s great to see you! Hang onto those gains, and seek new ways to leverage technology in your practice. Your life may no longer depend on it, but your future certainly does.

Share this:

  • Email
  • Print
  • Twitter
  • Facebook
  • LinkedIn

Like this:

Like Loading...

Understanding the UPC: Because You Can

25 Monday Jan 2021

Posted by craigball in Computer Forensics, General Technology Posts

≈ 2 Comments

Where does the average person encounter binary data?  Though we daily confront a deluge of digital information, it’s all slickly packaged to spare us the bare binary bones of modern information technology.  All, that is, save the humble Universal Product Code, the bar code symbology on every packaged product we purchase from a 70-inch TV to a box of Pop Tarts.  Bar codes and their smarter Japanese cousins, QR Codes, are perhaps the most unvarnished example of binary encoding in our lives. 

Barcodes have an ancient tie to e-discovery as they were once used to Bates label hard copy documents, linking them to “objective coding” databases. A lawyer using barcoded documents was pretty hot stuff back in the day.

Just a dozen numeric characters are encoded by the ninety-five stripes of a UPC-A barcode, but those digits are encoded so ingeniously as to make them error resistant and virtually tamperproof. The black and white stripes of a UPC are the ones and zeroes of binary encoding.  Each number is encoded as seven bars and spaces (12×7=84 bars and spaces) and an additional eleven bars and spaces denote start, middle and end of the UPC.  The start and end markers are each encoded as bar-space-bar and the middle is always space-bar-space-bar-space.  Numbers in a bar code are encoded by the width of the bar or space, from one to four units. 

This image has an empty alt attribute; its file name is barcode-water.png

The bottle of Great Value purified water beside me sports the bar code at right.

Humans can read the numbers along the bottom, but the checkout scanner cannot; the scanner reads the bars. Before we delve into what the numbers signify in the transaction, let’s probe how the barcode embodies the numbers.  Here, I describe a bar code format called UPC-A.  It’s a one-dimensional code because it’s read across.  Other bar codes (e.g., QR codes) are two-dimensional codes and store more information because they use a matrix that’s read side-to-side and top-to-bottom.

The first two black bars on each end of the barcode signal the start and end of the sequence (bar-space-bar).  They also serve to establish the baseline width of a single bar to serve as a touchstone for measurement.  Bar codes must be scalable for different packaging, so the ability to change the size of the codes hinges on the ability to establish the scale of a single bar before reading the code.

Each of the ten decimal digits of the UPC are encoded using seven “bar width” units per the schema in the table at right.

To convey the decimal string 078742, the encoded sequence is 3211 1312 1213 1312 1132 2122 where each number in the encoding is the width of the bars or spaces.  So, for the leading value “zero,” the number is encoded as seven consecutive units divided into bars of varying widths: a bar three units wide, then (denoted by the change in color from white to black or vice-versa), a bar two units wide, then one then one.  Do you see it? Once more, left-to-right, a white band, three units wide, a dark band two units wide , then a single white band and a single dark band (3-2-1-1 encoding the decimal value zero).

You could recast the encoding in ones and zeroes, where a black bar is a one and a white bar a zero. If you did, the first digit would be 0001101, the number seven would be 0111011 and so on; but there’s no need for that, because the bands of light and dark are far easier to read with a beam of light than a string of printed characters.

Taking a closer look at the first six digits of my water bottle’s UPC, I’ve superimposed the widths and corresponding decimal value for each group of seven units. The top is my idealized representation of the encoding and the bottom is taken from a photograph of the label:

Now that you know how the bars encode the numbers, let’s turn to what the twelve digits mean.  The first six digits generally denote the product manufacturer. 078742 is Walmart. 038000 is assigned to Kellogg’s.  Apple is 885909 and Starbucks is 099555.  The first digit can define the operation of the code.  For example, when the first digit is a 5, it signifies a coupon and ties the coupon to the purchase required for its use.  If the first digit is a 2, then the item is something sold by weight, like meats, fruit or vegetables, and the last six digits reflect the weight or price per pound.  If the first digit is a 3, the item is a pharmaceutical.

Following the leftmost six-digit manufacturer code is the middle marker (1111, as space-bar-space-bar-space) followed by five digits identifying the product.  Every size, color and combo demands a unique identifier to obtain accurate pricing and an up-to-date inventory.

The last digit in the UPC serves as an error-correcting check digit to ensure the code has been read correctly.  The check digit derives from a calculation performed on the other digits, such that if any digit is altered the check digit won’t match the changed sequence. Forget about altering a UPC with a black marker: the change wouldn’t work out to the same check digit, so the scanner will reject it.

In case you’re wondering, the first product to be scanned at a checkout counter using a bar code was a fifty stick pack of Juicy Fruit gum in Troy, Ohio on June 26, 1974.  It rang up for sixty-seven cents.  Today, 45 sticks will set you back $2.48 (UPC 22000109989).

Share this:

  • Email
  • Print
  • Twitter
  • Facebook
  • LinkedIn

Like this:

Like Loading...

Advanced Zoom “Weather Map” Technique

02 Thursday Jul 2020

Posted by craigball in General Technology Posts, Uncategorized

≈ 5 Comments

I lately presented a program for the State Bar of Texas Annual Meeting alongside Texas District Court Judge Emily Miskel. Like everything else, the venerable Annual Meeting was recast as a virtual event. Our topic was “Upping your Game in Zoom,” and we spoke of many ways to improve the quality of online video meetings and hearings. Judge Miskel and I covered dead simple ways to avoid common errors and some advanced techniques. One advanced approach I shared was making your presentation visuals serve as your dynamic Zoom background, enabling a presenter to interact with background visuals in the same way that TV meteorologists explain weather patterns using a green screen map.

There are times when a disembodied narration of screen-filling visuals is best; yet, there are times when you don’t want to force viewers to choose between speakers and visuals, as occurs when Zoom attendees lack the screen real estate or mastery of the Zoom interface needed to pin speakers to larger windows. Let’s face it: most Zoom users are overwhelmed by mute/unmute; asking them to pin and resize screens is a bridge too far.

Certainly, anyone can share a PowerPoint presentation in Zoom, bringing slide imagery to the fore and relegating speakers to tiny squares at the perimeter, like the world’s saddest episode of The Brady Bunch. Instead, I wanted to be a more prominent part of the show, akin to the accustomed ways speakers present onstage.

Television news anchors routinely uses “OTS” (for over-the-shoulder) graphics as an effective segue between the newsreader and story video. OTS graphics work nicely in Zoom, introducing the topic or bullet points in a background slide, then sharing out the focal graphics. It sounds complicated, but it’s easy to get the hang of going to and returning from shared screens. It takes practice, but isn’t practice always key to improving presentation skill?

PowerPoint does all the heavy lifting of converting your slide visuals to still images (and even to video) suitable for use as Zoom backgrounds. Any PowerPoint slide show can be saved as individual JPG or PNG graphics. The “trick” is to compose the slide to afford room for the presenter’s upper torso without obstructing the visuals.

If you look at the two images below, you can see that I’ve left vacant the lower right quadrant of each slide. This presentation required use of templates, but left to my own aesthetics, I never use templates.

I hate ugly templates!
Reserve part of the screen for your image. Don’t block your bullets!

In practice, I adjust my camera such that my head and shoulders occupy the lower right of the Zoom screen (see below), then I can point at bullets and gesture at graphics. The weathercaster technique really shines when you present standing up. Then, you’d devote one-half to one-third of the slide layout to your graphics and the balance to you. You could even stand between two columns of bullets, Of course, this requires sufficient room between camera and green screen and, ideally, a dedicated camera and studio lighting.

Would it hurt to smile?

By now, you’ve gathered that achieving a true chroma key effect requires a physical green screen backdrop, not the virtual “where’d my ears go?” background effect often seen. A suitable 9-10′ muslin green screen backdrop will cost about twenty dollars on Amazon. I elected to spend more and get the green screen, crossbar, pair of backdrop supports and a bevy of studio lights and stands for $150.00. If you’ve got a way to hang a green sheet behind you (e.g., curtain rod, tacked to a wall, hung from the ceiling), that twenty dollar backdrop works just fine.

Home Studio Kit

Having created your background visuals and saved each slide as a still JPG or PNG image, you’ll load them into Zoom as Virtual Backgrounds. To do so, start Zoom and go to Virtual Background in the Settings menu. Locate and click the the small plus sign (+) (Arrow 1, below), then click on “Add Image” from the menu and navigate to where you’ve saved your background images. Add each image in this manner, keeping them in the order in which you want them displayed when presenting. Next, click the box to tell Zoom you have a green screen (Arrow 2), and finally, be sure the color shown matches your backdrop. Zoom should do this automatically, but you can also set it manually (Arrow 3).

Zoom’s Settings>Virtual Background screen

You’re ready to go, but before starting a presentation, launch Zoom and Virtual Background again. Practice selecting each background much as you might advance them as slides in a PowerPoint show, choosing them in succession while presenting. If you’ve loaded them in your preferred order, they will appear as options in that order. You will need to keep the Virtual Background settings panel open at all times during your presentation, so a second screen helps insure the settings panel doesn’t disappear behind another window. You don’t want to be fumbling around in search of the Virtual Backgrounds panel while speaking.

The Weather Map Technique is harder to describe than it is to pull off. The key to keeping it smooth and simple calls to mind the out-of-towner visiting Manhattan who asked a local, “How do I get to Carnegie Hall?”

The answer’s the same: “Practice, practice, practice!”

Wish List: I look forward to a day when Zoom natively supports dynamic backgrounds allowing us to feed PowerPoints directly to a background instead of a shared screen. Also, I’d like to be able to folder backgrounds topically. Affording hosts greater control over the layout of Zoom windows would be nice. In Zoom hearings, think how it would help to be able to group lawyers according to their role in the litigation.

Share this:

  • Email
  • Print
  • Twitter
  • Facebook
  • LinkedIn

Like this:

Like Loading...

It’s About Time!

17 Wednesday Jun 2020

Posted by craigball in Computer Forensics, E-Discovery, General Technology Posts, Uncategorized

≈ 9 Comments

“Time heals all wounds.”  “Time is money.” “Time flies.” 

To these memorable mots, I add one more: “Time is truth.”

A defining feature of electronic evidence is its connection to temporal metadata or timestamps.  Electronically stored information is frequently described by time metadata denoting when ESI was created, modified, accessed, transmitted, or received.  Clues to time are clues to truth because temporal metadata helps establish and refute authenticity, accuracy, and relevancy.

But in the realms of electronic evidence and digital forensics, time is tricky.  It hides in peculiar places, takes freakish forms, and doesn’t always mean what we imagine.  Because time is truth, it’s valuable to know where to find temporal clues and how to interpret them correctly.

Everyone who works with electronic evidence understands that files stored in a Windows (NTFS) environment are paired with so-called “MAC times,” which have nothing to do with Apple Mac computers or even the MAC address identifying a machine on a network.  In the context of time, MAC is an initialization for Modified, Accessed and Created times.

That doesn’t sound tricky.  Modified means changed, accessed means opened and created means authored, right?  Wrong.  A file’s modified time can change due to actions neither discernible to a user nor reflective of user-contributed edits.  Accessed times change from events (like a virus scan) that most wouldn’t regard as accesses. Moreover, Windows stopped reliably updating file access times way back in 2007 when it introduced the Windows Vista operating system.  Created may coincide with the date a file is authored, but it’s as likely to flow from the copying of the file to new locations and storage media (“created” meaning created in that location). Copying a file in Windows produces an object that appears to have been created after it’s been modified!

it’s crucial to protect the integrity of metadata in e-discovery, so changing file creation times by copying is a big no-no.  Accordingly, e-discovery collection and processing tools perform the nifty trick of changing MAC times on copies to match times on the files copied.  Thus, targeted collection alters every file collected, but done correctly, original metadata values are restored and hash values don’t change.  Remember: system metadata values aren’t stored within the file they describe so system metadata values aren’t included in the calculation of a file’s hash value.  The upshot is that changing a file’s system metadata values—including its filename and MAC times—doesn’t affect the file’s hash value. 

Conversely and ironically, opening a Microsoft Word document without making a change to the file’s contents can change the file’s hash value when the application updates internal metadata like the editing clock.  Yes, there’s even a timekeeping feature in Office applications!

Other tricky aspects of MAC times arise from the fact that time means nothing without place.  When we raise our glasses with the justification, “It’s five o’clock somewhere,” we are acknowledging that time is a ground truth. “Time” means time in a time zone, adjusted for daylight savings and expressed as a UTC Offset stating the number of time zones ahead of or behind GMT, time at the Royal Observatory in Greenwich, England atop the Prime or “zero” Meridian.

Time values of computer files are typically stored in UTC, for Coordinated Universal Time, essentially Greenwich Mean Time (GMT) and sometimes called Zulu or “Z” time, military shorthand for zero meridian time.  When stored times are displayed, they are adjusted by the computer’s operating system to conform to the user’s local time zone and daylight savings time rules.  So in e-discovery and computer forensics, it’s essential to know if a time value is a local time value adjusted for the location and settings of the system or if it’s a UTC value.  The latter is preferred in e-discovery because it enables time normalization of data and communications, supporting the ability to order data from different locales and sources across a uniform timeline.

Four months of pandemic isolation have me thinking about time.  Lost time. Wasted time. Pondering where the time goes in lockdown.   Lately, I had to testify about time in a case involving discovery malfeasance and corruption of time values stemming from poor evidence handling.  When time values are absent or untrustworthy, forensic examiners draw on hidden time values—or, more accurately, encoded time values—to construct timelines or reveal forgeries.

Time values are especially important to the reliable ordering of email communications.  Most e-mails are conversational threads, often a mishmash of “live” messages (with their rich complement of header data, encoded attachments and metadata) and embedded text strings of older messages.  If the senders and receivers occupy different time zones, the timeline suffers: replies precede messages that prompted them, and embedded text strings make it child’s play to alter times and text.  It’s just one more reason I always seek production of e-mail evidence in native and near-native forms, not as static images.  Mail headers hold data that support authenticity and integrity—data you’ll never see produced in a load file.

Underscoring that last point, I’ll close with a wacky, wonderful example of hidden timestamps: time values embedded in Gmail boundaries.  This’ll blow your mind.

If you know where to look in digital evidence, you’ll find time values hidden like Easter eggs. 

E-mail must adhere to structural conventions to traverse the internet and be understood by different e-mail programs. One of these conventions is the use of a Content-Type declaration and setting of content boundaries, enabling systems to distinguish the message header region from the message body and attachment regions.

The next illustration is a snippet of simplified code from a forged Gmail message.  To see the underlying code of a Gmail message, users can select “Show original” from the message options drop-down menu (i.e., the ‘three dots’).

The line partly outlined in red advises that the message will be “multipart/alternative,” indicating that there will be multiple versions of the content supplied; commonly a plain text version followed by an HTML version. To prevent confusion of the boundary designator with message text, a complex sequence of characters is generated to serve as the content boundary. The boundary is declared to be “00000000000063770305a4a90212” and delineates a transition from the header to the plain text version (shown) to the HTML version that follows (not shown).

Thus, a boundary’s sole raison d’être is to separate parts of an e-mail; but because a boundary must be unique to serve its purpose, programmers insure against collision with message text by integrating time data into the boundary text.  Now, watch how we decode that time data.

Here’s our boundary, and I’ve highlighted fourteen hexadecimal characters in red:

Next, I’ve parsed the highlighted text into six- and eight-character strings, reversed their order and concatenated the strings to create a new hexadecimal number:

A decimal number is Base 10.  A hexadecimal number is Base 16.  They are merely different ways of notating numeric values.  So, 05a4a902637703 is just a really big number. If we convert it to its decimal value, it becomes: 1,588,420,680,054,531.  That’s 1 quadrillion, 588 trillion, 420 billion, 680 million, 54 thousand, 531.  Like I said, a BIG number.

But, a big number…of what?

Here’s where it gets amazing (or borderline insane, depending on your point of view).

It’s the number of microseconds that have elapsed since January 1, 1970 (midnight UTC), not counting leap seconds. A microsecond is a millionth of a second, and 1/1/1970 is the “Epoch Date” for the Unix operating system. An Epoch Date is the date from which a computer measures system time. Some systems resolve the Unix timestamp to seconds (10-digits), milliseconds (13-digits) or microseconds (16-digits).

When you make that curious calculation, the resulting date proves to be Saturday, May 2, 2020 6:58:00.054 AM UTC-05:00 DST.  That’s the genuine date and time the forged message was sent.  It’s not magic; it’s just math.

Had the timestamp been created by the Windows operating system, the number would signify the number of 100 nanosecond intervals between midnight (UTC) on January 1, 1601 and the precise time the message was sent.

Why January 1, 1601?  Because that’s the “Epoch Date” for Microsoft Windows.  Again, an Epoch Date is the date from which a computer measures system time.  Unix and POSIX measure time in seconds from January 1, 1970.  Apple used one second intervals since January 1, 1904, and MS-DOS used seconds since January 1, 1980. Windows went with 1/1/1601 because, when the Windows operating system was being designed, we were in the first 400-year cycle of the Gregorian calendar (implemented in 1582 to replace the Julian calendar). Rounding up to the start of the first full century of the 400-year cycle made the math cleaner.

Timestamps are everywhere in e-mail, hiding in plain sight.  You’ll find them in boundaries, message IDs, DKIM stamps and SMTP IDs.  Each server handoff adds its own timestamp.  It’s the rare e-mail forger who will find every embedded timestamp and correctly modify them all to conceal the forgery. 

When e-mail is produced in its native and near-native forms, there’s more there than meets the eye in terms of the ability to generate reliable timelines and flush out forgeries and excised threads.  Next time the e-mail you receive in discovery seems “off” and your opponent balks at giving you suspicious e-mail evidence in faithful electronic formats, ask yourself: What are they trying to hide?

The takeaway is this: Time is truth and timestamps are evidence in their own right.  Isn’t it about time we stop letting opponents strip it away?

Tip of the hat to Arman Gungor at Metaspike whose two excellent articles about e-mail timestamp forensics reminded me how much I love this stuff.  https://www.metaspike.com/timestamps-forensic-email-examination/

Share this:

  • Email
  • Print
  • Twitter
  • Facebook
  • LinkedIn

Like this:

Like Loading...

Don’t Bet the Farm on Slack Space

14 Thursday May 2020

Posted by craigball in Computer Forensics, E-Discovery, General Technology Posts

≈ 7 Comments

A depiction of file slack from Ball, E-Discovery Workbook © 2020

A federal court appointed me Special Master, tasked to, in part, search the file slack space of a party’s computers and storage devices.  The assignment prompted me to reconsider the value of this once-important forensic artifact.

Slack space is the area between the end of a stored file and the end of its concluding cluster: the difference between a file’s logical and physical size. It’s wasted space from the standpoint of the computer’s file system, but it has forensic significance by virtue of its potential to hold remnants of data previously stored there.  Slack space is often confused with unallocated clusters or  free space, terms describing areas of a drive not currently used for file storage (i.e., not allocated to a file) but which retain previously stored, deleted files. 

A key distinction between unallocated clusters and slack space is that unallocated clusters can hold the complete contents of a deleted file whereas slack space cannot.  Data recovered (“carved”) from unallocated clusters can be quite large—spanning thousands of clusters—where data recovered from a stored file’s slack space can never be larger than one cluster minus one byte.  Crucially, unallocated clusters often retain a deleted file’s binary header signature serving to identify the file type and reveal the proper way to decode the data, whereas binary header signatures in slack space are typically overwritten.

A little more background in file storage may prove useful before I describe the dwindling value of slack space in forensics.

Electronic storage media are physically subdivided into millions, billions or trillions of sectors of fixed storage capacity.  Historically, disk sectors on electromagnetic hard drives were 512 bytes  in size.  Today, sectors may be much larger (e.g., 4,096 bytes).  A sector is the smallest physical storage unit on a disk drive, but not the smallest accessible storage unit.  That distinction belongs to a larger unit called the cluster, a logical grouping of sectors and the smallest storage unit a computer can read from or write to.  On Windows machines, clusters are 4,096 bytes (4kb) by default for drives up to 16 terabytes.  So, when a computer stores or retrieves data, it must do so in four kilobyte clusters.

File storage entails allocation of enough whole clusters to hold a file.  Thus, a 2kb file will only fill half a 4kb cluster–the balance being slack space.  A 13kb file will tie up four clusters, although just a fraction of the final, fourth cluster is occupied is occupied by the file.  The balance is slack space and it could hold fragments of whatever was stored there before.  Because it’s rare for files to be perfectly divisible by 4 kilobytes and many files stored are tiny, much drive space is lost to slack space.  Using smaller clusters would mean less slack space, but any efficiencies gained would come at the cost of unwieldy file tracking and retrieval.

So, slack space holds forensic artifacts and those artifacts tend to hang around a long time.  Unallocated clusters may be called into service at any time and their legacy content overwritten.  But data lodged in slack space endures until the file allocated to the cluster is deleted–on conventional “spinning” hard drives at any rate.

When I started studying computer forensics in the MS-DOS era, slack space loomed large as a source of forensic intelligence.  Yet, apart from training exercises where something was always hidden in slack, I can’t recall a matter I’ve investigated this century which turned on evidence found in slack space.  The potential is there, so when it makes sense to do it, examiners search slack using unique phrases unlikely to throw off countless false positives.

But how often does it make sense to search slack nowadays?

I’ve lately grappled with that question because it seems to me that the shopworn notions respecting slack space must be re-calibrated.  

Keep in mind that slack space holds just a shard of data with its leading bytes overwritten.  It may be overwritten minimally or overwritten extensively, but some part is obliterated, always.  Too, slack space may hold the remnants of multiple deleted files; that is, as overlapping artifacts: files written, deleted overwritten by new data, deleted again, then overwritten again (just less extensively so).  Slack can be a real mess.

Fifteen years ago, when programs stored text in ASCII (i.e., encoded using the American Standard Code for Information Interchange or simply “plain text”), you could find intelligible snippets in slack space.  But since 2007, when Microsoft changed the format of Office productivity files like Word, PowerPoint and Excel files to Zip-compressed XML formats, there’s been a sea change in how Office applications and other programs store text.  Today, if a forensic examiner looks at a Microsoft Office file as it’s written on the media, the content is compressed.  You won’t see any plain text.  The file’s contents resemble encrypted data.  The “PK” binary header signature identifying it as compressed content is gone, so how will you recognize zipped content?  What’s more, the parts of the Zip file required to decompress the snippet have likely been obliterated, too. How do you decode fragments if you don’t know the file type or the encoding schema?

The best answer I have is you throw common encodings against the slack and hope something matches up with the search terms.  More-and-more, nothing matches, even when what you seek really is in the slack space. Searches fail because the data’s encoded and invisible to the search tool.  I don’t know how searching slack stacks up against the odds of winning the lottery, but a lottery ticket is cheap; a forensic examiner’s time isn’t.

That’s just the software.  Storage hardware has evolved, too.  Drives are routinely encrypted, and some oddball encryption methods make it difficult or impossible to explore the contents of file slack.  The ultimate nail in the coffin for slack space will be solid state storage devices and features, like wear leveling and TRIM that routinely reposition data and promise to relegate slack space and unallocated clusters to the digital dung heap of history.

Taking a fresh look at file slack persuades me that it still belongs in a forensic examiner’s bag of tricks when it can be accomplished programmatically and with little associated cost.  But, before an expert characterizes it as essential or a requesting party offers it as primary justification for an independent forensic examination, I’d urge the parties and the Court to weigh cost versus benefit; that is, to undertake a proportionality analysis in the argot of electronic discovery.  Where searching slack space was once a go-to for forensic examination, it’s an also-ran now. Do it, when it’s an incidental feature of a thoughtfully composed examination protocol; but don’t bet the farm on finding the smoking gun because the old gray mare, she ain’t what she used to be!
See? I never metaphor I didn’t like.

******************************

Postscript: A question came up elsewhere about solid state drive forensics. Here was my reply:

The paradigm-changing issue with SSD forensic analysis versus conventional magnetic hard drives is the relentless movement of data by wear leveling protocols and a fundamentally different data storage mechanism. Solid state cells have a finite life measured in the number of write-rewrite cycles.

To extend their useful life, solid state drives move data around to insure that all cells are written with roughly equal frequency. This is called “wear leveling,” and it works. A consequence of wear leveling is that unallocated cells are constantly being overwritten, so SSDs do not retain deleted data as electromagnetic drives do. Wear leveling (and the requisite remapping of data) is handled by an SSD drive’s onboard electronics and isn’t something users or the operating system control or access.

Another technology, an ATA command called TRIM, is controllable by the operating system and serves to optimize drive performance by disposing of the contents of storage cell groups called “pages” that are no longer in use. Oversimplified, it’s faster to write to an empty memory page than to initiate an erasure first; so, TRIM speeds the write process by clearing contents before they are needed, in contrast to an electromagnetic hard drive which overwrites clusters without need to clear contents beforehand.

The upshot is that resurrecting deleted files by identifying their binary file signatures and “carving” their remnant contents from unallocated clusters isn’t feasible on SSD media. Don’t confuse this with forensically-sound preservation and collection. You can still image a solid state drive, but you’re not going to get unallocated clusters. Too, you won’t be interfacing with the physical media grabbing a bitstream image. Everything is mediated by the drive electronics.

******************************

Dear Reader, Sorry I’ve been remiss in posting here during the COVID crisis. I am healthy, happy and cherishing the peace and quiet of the pause, hunkered down in my circa-1880 double shotgun home in New Orleans, enjoying my own cooking far too much. Thanks to Zoom, I completed my Spring Digital Evidence class at the University of Texas School of Law, so now one day just bubbles into the next, and I’m left wondering, Where did the day go?. Every event where I was scheduled to speak or teach cratered, with no face-to-face events sensibly in sight for 2020. One possible exception: I’ve just joined the faculty of the Tulane School of Law ten minutes upriver for the Fall semester, and plan to be back in Austin teaching in the Spring. But, who knows, right? Man plans and gods laugh.

We of a certain age may all be Zooming and distancing for many months. As one who’s bounced around the world peripatetically for decades, not being constantly on airplanes and in hotels is strange…and stress-relieving. While I miss family, friends and colleagues and mourn the suffering others are enduring, I’ve benefited from the reboot, ticking off household projects and kicking the tires on a less-driven day-to-day. It hasn’t hurt that it’s been the best two months of good weather I’ve ever seen, here or anywhere. The prospect of no world travel this summer–and no break from the soon-to-be balmy Big Easy heat–is disheartening, but small potatoes in the larger scheme of things.

Be well, be safe, be kind to yourself. This, too, shall pass and as my personal theme song says, There's a Great Big Beautiful Tomorrow. Just a Dream Away.

Share this:

  • Email
  • Print
  • Twitter
  • Facebook
  • LinkedIn

Like this:

Like Loading...
← Older posts
Follow Ball in your Court on WordPress.com

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 4,740 other subscribers

Recent Posts

  • ChatGPT Proves a Mediocre Law Student January 27, 2023
  • Not So Fine Principle Nine January 17, 2023
  • The Annotated ESI Protocol January 9, 2023
  • Seven Stages of Snakebitten Search December 13, 2022
  • Don’t Seek Direct Access to Opponents’ Devices November 17, 2022

Archives

RSS Feed RSS - Posts

CRAIGBALL.COM

Helping lawyers master technology

Categories

EDD Blogroll

  • EDA Blog (Kelly Twigger)
  • Basics of E-Discovery (Exterro)
  • Corporate E-Discovery Blog (Zapproved )
  • Ride the Lightning (Sharon Nelson)
  • ED&E (Michael Arkfeld)
  • E-Discovery Law Alert (Gibbons)
  • eDIP (Chris Dale)
  • eDiscovery Journal (Greg Buckles)
  • The Relativity Blog
  • eDiscovery Today (Doug Austin)
  • Sedona Conference
  • GLTC (Tom O'Connor)
  • BowTie Law (Josh Gilliland)
  • E-D Team (Ralph Losey)
  • Litigation Support Guru (Amy Bowser-Rollins)
  • Complex Discovery (Rob Robinson)
  • CS DISCO Blog
  • Illuminating eDiscovery (Lighthouse)
  • E-Discovery Law (K&L Gates)

Admin

  • Register
  • Log in
  • Entries feed
  • Comments feed
  • WordPress.com

Enter your email address to follow Ball in Your Court and receive notifications of new posts by email.

Website Powered by WordPress.com.

  • Follow Following
    • Ball in your Court
    • Join 1,877 other followers
    • Already have a WordPress.com account? Log in now.
    • Ball in your Court
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...
 

    %d bloggers like this: