I’ve long been fascinated by electronic search. I especially love delving into the arcane limitations of lexical search because, awful Grinch that I am, I get a kick out of explaining to lawyers why their hard-fought search queries and protocols are doomed to fail. But, once we work through the Seven Stages of Attorney E-Discovery Grief: Umbrage, Denial, Anger, Angry Denial, Fear, Finger Pointing, Threats and Acceptance, there’s almost always a workaround to get the job done with minimal wailing and gnashing of teeth.
Three consults today afforded three chances to chew over problematic search strategies:
- First, the ask was to search for old CAD/CAM drawings in situ on an opponent’s file servers based on words appearing on drawings.
- Another lawyer sought to run queries in M365 seeking responsive text in huge attachments.
- The last lawyer wanted me to search the contents of a third-party’s laptop for subpoenaed documents but without the machine being imaged or its contents processed before search.
Most of my readers are e-discovery professionals so they’ll immediately snap to the reasons why each request is unlikely to work as planned. Before I delve into my concerns, let’s observe that all these requests seemed perfectly reasonable in the minds of the lawyers involved, and why not? Isn’t that how keyword and Boolean search is supposed to work? Sadly, our search reach often exceeds our grasp.
Have you got your answers to why they may fail? Let’s compare notes.
- When it comes to lexical search, CAD/CAM drawings differ markedly from Word documents and spreadsheets. Word processed documents and spreadsheets contain text encoded as ASCII or Unicode characters. That is, text is stored as, um, text. In contrast, CAD/CAM drawings tend to be vector graphics. They store instructions describing how to draw the contents of the plans geometrically; essentially how the annotations look rather than what they say. So, the text is an illustration of text, much like a JPG photograph of a road sign or a static TIFF image of a document—both inherently unsearchable for text unless paired with extracted or OCR text in ancillary load files. Bottom line: Unless the CAD/CAM drawings are subjected to effective optical character recognition before being indexed for search, lexical searches won’t “see” any text on the face of the drawings and will fail.
- M365 has a host of limits when it comes to indexing Cloud content for search, and of course, if it’s not in the index, it won’t turn up in response to search. For example, M365 won’t parse and index an email attachment larger than 150MB. Mind you, few attachments will run afoul of that capacious limit, but some will. Similarly, M365 will only parse and index the first 2 million characters of any document. That means only the first 600-1,000 pages of a document will be indexed and searchable. Here again, that will suffice for the ordinary, but may prove untenable in matters involving long documents and data compilations. There are other limits on, e.g., how deeply a search will recurse through nested- and embedded content and the body text size of a message that will index. You can find a list of limits here (https://learn.microsoft.com/en-us/microsoft-365/compliance/limits-for-content-search?view=o365-worldwide#indexing-limits-for-email-messages) and a discussion of so-called “partially indexed” files here (https://learn.microsoft.com/en-us/microsoft-365/compliance/partially-indexed-items-in-content-search?view=o365-worldwide). Remember, all sorts of file types aren’t parsed or indexed at all in M365. You must tailor lexical search to the data under scrutiny. It’s part of counsels’ duty of competence to know what their search tools can and cannot do when negotiating search protocols and responding to discovery using lexical search.
- In their native environments, many documents sought in discovery live inside various container files ranging from e-mail and attachments in PST and OST mail containers to compressed Zip containers. Encrypted files may be thought of as being sealed inside an impenetrable container that won’t be searched. The upshot is that much data on a laptop or desktop machine cannot be thoroughly searched by keywords and queries by simply running searches within an operating system environment (e.g., in Windows or MacOS). Accordingly, forensic examiners and e-discovery service providers collect and “process” data to make it amenable to search. Moreover, serial search of a computer’s hard drive (versus search of an index) is painfully slow, so unreasonably expensive when charged by the hour. For more about processing ESI in discovery, here’s my 2019 primer (http://www.craigball.com/Ball_Processing_2019.pdf)
In case I don’t post before Chanukah, Christmas and the New Year, have a safe and joyous holiday!
Jeff Kerr said:
Howdy Craig. Great post. CAD files truly are nightmare for e-discovery purposes. I’ve recently became a user of Fusion360, a popular CAD application, and the “documents” created in this tool have layers upon layers of data that would be almost impossible to preview in conventional e-discovery software. The layers include such things as user logs and descriptions related to each version of a design. As these tools become more collaborative, more and more kinds of data become embedded in them, including commentary and messages between collaborators. At a certain I expect the only way to review these files (in a complete way) will be in the original application.
LikeLike
craigball said:
Perhaps, although there were some free, robust viewer tools available for AutoCAD files last time I had to deal with them in earnest. Thanks for weighing in and best regards.
LikeLike
Amy Sellars said:
Happy Holidays, Craig, and thank you for the blog. I always learn something – even if what I learn is how to better explain to others.
LikeLiked by 1 person
Pingback: Three Search Consults Are Craig Ball’s Holiday Gift to Us
Pierre Chamberland said:
Before becoming part of IPRO, we had explored a partnership with https://www.vdr.com/cadnection – they had great tools for indexing and search of multiple CAD formats.
LikeLike
Pingback: Week 51 – 2022 – This Week In 4n6