AI Prompt to Improve Keyword Search

04 Sunday Aug 2024

Posted by craigball in Computer Forensics, E-Discovery, General Technology Posts

Twenty years ago, I dreamed up a website where you would submit a list of eDiscovery keywords and queries and the site would critique the searches and suggest improvements to make them more efficient and effective. It would flag stop words, propose alternate spellings, and alert the user to pitfalls making searches less effective or noisy. I even envisioned it testing queries against a benign dataset to identify overly broad terms and false hits.

I believed this tool would be invaluable for helping lawyers enhance their search skills and achieve greater efficiency. Over the years, I tried to bring this idea to life, seeking proposals from offshore developers and pitching it to e-discovery software publishers as a value-add. In the end, a pipe dream. Even now, nothing like it exists.

The emergence of AI-powered Large Language Models like ChatGPT made me think what I’d hoped to bring to life years ago might finally be feasible. I wondered if I could create a prompt for ChatGPT that would achieve much of what I envisioned. So, I dedicated a sunny Sunday morning to playing “prompt engineer,” a whole cloth term for those who craft AI prompts to achieve desired outcomes.

The result was promising, a significant step forward for lawyers who struggle with search queries without understanding why some fail. Most search errors I encounter aren’t subtle. I’ve written about ways to improve lexical search, and the techniques aren’t rocket science, though they require some familiarity with how electronically stored information is indexed and how search syntaxes differ across platforms. Okay, maybe a little rocket science. But if you’re using a tool for critical tasks, shouldn’t you know what it can and cannot do?

Some believe refining keywords and queries is a waste of time, casting keyword search as obsolete. Perhaps on your planet, Klaatu, but here on Earth, lawyers continue using keywords with reckless abandon. I’m not defending that but neither will I ignore lawyers’ penchant for lexical search. Until the cost, reliability, and replicability of AI-enabled discovery improve, keywords will remain a tool for sifting through large datasets. However, we can use AI LLMs right now to enhance the performance and efficiency of shopworn approaches.

How Does It Work?

The prompt below was developed and tested on ChatGPT 4o (for Omni), a subscription product that costs $20.00/month. I ran it in the free versions, too, and it seemed to work; but my experience is with 4o, and I commend the latest version to you as twenty bucks well spent.

To use the prompt, log in to ChatGPT and copy and paste the prompt below into the chat window (don’t hit “Enter” yet) then use the paperclip button to upload a discrete list of the keywords and queries for assessment. You can upload them in plain text, rich text, or as a Word document or PDF. Now, hit “Enter.” Depending upon the length of ChatGPT’s response, you may need to click “Continue Generating” or type “continue” into the chat box to force the application to complete its response.

——-Start of Prompt to Paste (start on next line)——–

### AI Prompt for Analyzing Keywords and Boolean Queries

Introduction
Purpose of the analysis: to enhance the efficiency and accuracy of keyword and Boolean queries in retrieving relevant documents during discovery in litigation. Highlight the need to balance recall and precision to ensure all relevant documents are identified without disproportionate noise. The analysis aims to optimize search strategies based on a comprehensive set of parameters, offering query-specific feedback as a table and general guidance as a narrative to improve recall and precision for lexical search.
Analysis Framework
1. Stop Word Identification
Objective: Determine if proposed terms are likely to be stop words in e-discovery tools, which may be ignored during indexing and search.
Approach: Review each term against common stop word (also called noise word) lists used by popular e-discovery platforms, as follows:

Relativity (dtSearch) Default Stop Words: The default noise word list consists of punctuation marks, single letters and numbers, and the following words: a, about, after, all, also, an, and, another, any, are, as, at, be, because, been, before, being, between, both, but, by, came, can, come, could, did, do, each, even, for, from, further, furthermore, get, got, had, has, have, he, her, here, hi, him, himself, his, how, however, i, if, in, indeed, into, is, it, its, just, like, made, many, me, might, more, moreover, most, much, must, my, never, not, now, of, on, only, or, other, our, out, over, said, same, see, she, should, since, some, still, such, take, than, that, the, their, them, then, there, therefore, these, they, this, those, through, thus, to, too, under, up, very, was, way, we, well, were, what, when, where, which, while, who, will, with, would, you, your
DISCO Default Stop Words: Stop words are words that are not indexed by DISCO search and will not get hits in search results. Matters created after April 29, 2019 will index all words and no longer remove stop words from the search index. Matters created prior to April 29, 2019 do not index the following stop words: a, an, and, are, as, at, be, by, for, if, in, is, it, of, on, or, that, the, their, then, there, these, they, to, was, with
Everlaw Default Stop Words: There are no stop or noise words; Everlaw indexes all words for content searches.
dtSearch Default Stop Words: The default noise word list consists of punctuation marks, single letters and numbers, and the following words: a, about, after, all, also, an, and, another, any, are, as, at, be, because, been, before, being, between, both, but, by, came, can, come, could, did, do, each, even, for, from, further, furthermore, get, got, had, has, have, he, her, here, hi, him, himself, his, how, however, i, if, in, indeed, into, is, it, its, just, like, made, many, me, might, more, moreover, most, much, must, my, never, not, now, of, on, only, or, other, our, out, over, said, same, see, she, should, since, some, still, such, take, than, that, the, their, them, then, there, therefore, these, they, this, those, through, thus, to, too, under, up, very, was, way, we, well, were, what, when, where, which, while, who, will, with, would, you, your
Logikcull Default Stop Words: There are no stop or noise words; Logikcull indexes all words for content searches.
IPRO ZyLAB ONE Default Stop Words: and, exclude, not, number range, or, precedes, quorum, to, within. If a term or combination of terms you are searching for contains a hyphen, that term will be found, even if you did not include a hyphen in your search query. For example, when you search for ’email’ or ‘e mail’, it will also find ‘e-mail’. However, ‘e-mail’ will only retrieve ‘e-mail’. In addition, ‘e mail’ will not find ’email’ or the other way around (’email’ will not find ‘e mail’). It is not possible to search for capitalized letters, since all terms in the dictionary are stored in lower case.
IBM Discovery Default Stop Words: a, about, above, after, again, am, an, and, any, are, as, at, be, because, been, before, being, below, between, both, but, by, can, did, do, does, doing, don, down, during, each, few, for, from, further, had, has, have, having, he, her, here, hers, herself, him, himself, his, how, i, im, if, in, into, is, it, its, itself, just, me, more, most, my, myself, no, nor, not, now, of, off, on, once, only, or, other, our, ours, ourselves, out, over, own, s, same, she, should, so, some, such, t, than, that, the, their, theirs, them, themselves, then, there, these, they, this, those, through, to, too, under, until, up, very, was, we, were, what, when, where, which, while, who, whom, why, will, with, you, your, yours, yourself, yourselves
Nuix Discover (formerly Ringtail) Default Stop Words: a, about, after, all, also, an, and, another, any, are, as, at, be, because, been, before, being, both, but, by, came, can, come, could, did, do, each, even, for, further, furthermore, get, got, had, has, have, he, her, here, hi, him, himself, how, however, i, if, in, indeed, into, is, it, its, just, like, made, many, me, might, more, moreover, most, much, must, my, never, not, now, of, on, only, or, other, our, out, over, said, same, see, she, should, since, some, still, such, take, than, that, the, their, them, then, there, therefore, these, they, this, those, through, thus, too, under, up, very, was, way, we, well, were, what, when, where, which, while, who, will, with, would, you, your
Exterro FTK Default Stop Words: a, able, about, across, after, ain’t, all, almost, also, am, among, an, and, any, are, aren’t, as, at, be, because, been, but, by, can, can’t, cannot, could, could’ve, couldn’t, dear, did, didn’t, do, does, doesn’t, don’t, either, else, ever, every, for, from, get, got, had, hadn’t, has, hasn’t, have, haven’t, he, her, hers, him, his, how, however, i, if, in, into, is, isn’t, it, it’s, its, just, least, let, like, likely, may, me, might, most, must, my, neither, no, nor, not, of, off, often, on, only, or, other, our, own, rather, said, say, says, she, should, shouldn’t, since, so, some, than, that, the, their, them, then, there, these, they, they’re, this, tis, to, too, twas, us, wants, was, wasn’t, we, we’re, we’ve, were, weren’t, what, when, where, which, while, who, whom, why, will, with, would, would’ve, wouldn’t, yet, you, you’d, you’ll, you’re, you’ve, your

2. Synonyms and Variants
Objective: Identify synonyms, spelling variants, British alternative spellings, related terms, common misspellings, and transpositions.
Approach: Use linguistic databases and thesauri to expand each term into potential variants that could capture relevant documents. Supply alternative spellings and common misspellings.
3. Industry-Specific Jargon and Abbreviations
Objective: Incorporate industry-specific language that might be used in relevant documents.
Approach: Consult industry glossaries and articles by experts to identify terms and abbreviations commonly used in the field (as specified or in the absence of a specification, as may be gleaned from the context of the queries submitted here)
4. Boolean Query Structure and Logic
Objective: Evaluate the logic and structure of each Boolean query to ensure alignment with search objectives.
Approach: Analyze each query for logical consistency, correct operator usage, and alignment with intended search parameters.
5. Search Syntax and Connectors
Objective: Ensure compatibility with the syntax used by specific e-discovery tools and the proper use of connectors and parentheses for logical grouping of operators
Approach: Adjust query syntax to match the requirements of different platforms (e.g., Relativity, OpenText Insight, DISCO, Nuix Discover, Everlaw, Logikcull).
– Identify common syntactic errors across tools, noting variations like:
– Relativity/dtSearch: “w/n”
– OpenText Insight: “NEAR/n”
– DISCO: “/n” for unordered terms, “+n” for ordered terms.
6. Wildcards and Stemming
Objective: Utilize wildcards and stemming to broaden term inclusion without sacrificing precision.
Approach: Evaluate opportunities to use wildcards or stemming effectively within each query and articulate such uses.
7. Special Characters and Indexing
Objective: Ensure that queries do not include characters that are excluded from indexing or reserved for special purposes.
Approach: Identify and remove or adapt special characters or reserved operators in queries.
8. Spaces and Punctuation
Objective: Understand how spaces and punctuation are treated in the index being searched.
Approach: Analyze the treatment of these elements within the tool’s indexing process and adjust queries accordingly.
9. Numeric Values and Short Words
Objective: Address potential indexing limitations for numeric values and short words.
Approach: Determine whether these elements are indexed and consider alternative search strategies if not.
10. Diacritical Marks
Objective: Address alternative spellings of words incorporating diacritical characters.
Approach: Evaluate whether the tool creates equivalencies for diacritical variations and adjust queries as necessary.
11. Case Sensitivity
Objective: Determine if the search tool supports different letter cases (e.g., SAT vs. sat).
Approach: Test queries for case sensitivity and adjust strategies accordingly.

**Objective:**

Evaluate the effectiveness of each keyword and Boolean query in retrieving relevant documents for litigation-related discovery. The analysis aims to optimize search strategies based on a comprehensive set of parameters, offering query-specific feedback and general guidance to improve recall and precision for lexical search.

**Instructions:**

– **Review Each Term:** Analyze each keyword against the specified Objectives and Approaches above, considering tool compatibility regarding syntax and character handling.

– **Analyze Boolean Queries:** Evaluate the structure and logic of each query for effectiveness and adherence to the tool’s syntax rules in furtherance of the specified Objectives and Approaches above.

– **Presentation of Review and Analysis:** Temperature 0. Present the results in a tabular format with each keyword/query presented as a row and each numbered Objective above addressed in a column.

– **Provide Feedback:** After individual assessments, supply a comprehensive essay with guidance on improving recall and precision in e-discovery lexical searches, incorporating insights from experts like Craig Ball¹ (craigball.com and craigball.net) and The Sedona Conference Working Group 1, with attribution.

—EXAMPLE: ### Guidance Essay: Improving Recall and Precision in Lexical Search for eDiscovery0

**1. Pre-Search Preparation:**

– Understand the dataset’s sources, types, and organization. Engage subject matter experts for relevant terminology insights.

**2. Crafting Comprehensive Keyword Lists:**

– Develop exhaustive lists with synonyms, acronyms, and industry-specific jargon. Account for linguistic variations and common misspellings.

**3. Optimizing Boolean Logic and Search Syntax:**

– Refine Boolean logic for precision. Understand tool-specific syntax, like proximity search differences.

**4. Incorporating Wildcards and Stemming:**

– Use wildcards and stemming to broaden parameters without overreach.

**5. Handling Index Exclusions and Special Characters:**

– Recognize special character treatment and indexing criteria to avoid missed documents.

**6. Addressing Diacriticals and Case Sensitivity:**

– Ensure searches accommodate diacriticals and case variations.

**7. Continuous Refinement and Documentation:**

– Iterate and document search strategies for consistency and defensibility.

These strategies enhance eDiscovery processes by improving lexical search precision and recall. As Craig Ball highlights, “The efficacy of eDiscovery lies not just in technology, but in the thoughtful application of that technology to the unique demands of each case” (Ball, 2024).

———–End of Prompt (ends on prior line)———-

What Do You Think? Can You Do Better?

In my experience with AI-powered Large Language Models (LLMs), they often excel at some tasks while underperforming in others. When testing query sets, I was modestly pleased by the results and felt they could be valuable for users looking to avoid common mistakes in search formulation. However, in other trials, issues with list formatting caused ChatGPT to struggle, resulting in less-than-optimal outcomes.²

The prompt provided here serves as a starting point for further development. Don’t hesitate to reformat the output, incorporate your own assessment criteria, or include specifics about your e-discovery platform and its unique features. I saw improved results when I uploaded pertinent additional information, such as my own Primer on Processing in E-Discovery found here. Ideally, I’d also supply the search syntax for the discovery platform used for search. Experiment with different LLMs and tailor the prompt (and uploads) to fit each case. I am confident that my readers can build upon these ideas, and I encourage you to share your findings for the benefit of the legal community

_.Before you conclude that I egomaniacally injected myself into the mix, I asked ChatGPT to assist with drafting and refining the prompt (because it tends to do a good job refining its own prompts) and the AI sucked me in from whatever dark recesses it explores. Feel free to take me out, coach ↩︎
Oddly, the more I tweaked the parameters (or invited ChatGPT to do so) the less-and-less useful the output. It was almost as if the LLM started to get bored with the project. Ultimately, I scrapped the more heavily refined version of the prompt in favor of an early iteration. That encapsulates my frustration with AI LLMs–they seem to reach a point at which improvement is elusive. Take, for example, the AI-generated illustration accompanying this post. No amount of prompting succeeded in cajoling the system to change “lexial” to “lexical” or generate something without robots. I’m getting pretty darn tired of all the robots. ↩︎

15 thoughts on “AI Prompt to Improve Keyword Search”

Eric Fookes said:

August 5, 2024 at 5:32 AM

Your article about leveraging AI for improving keyword searches in eDiscovery resonates with our own experiences at Aid4Mail. We’ve been exploring similar possibilities, particularly in using GPTs (from ChatGPT) to craft email search scripts and lists. While the potential is clear, as you’ve highlighted, our journey has also revealed the current limitations of AI in this domain.

We’ve spent several months developing a GPT model to create email search scripts and search lists tailored for Aid4Mail. The GPT is equipped with a 64-page manual detailing Aid4Mail’s syntax rules, operators, and examples. Despite additional instructions to address recurrent errors, the model still struggles with accuracy.

For instance, our GPT is designed to handle prompts like:

“I need to identify emails that potentially suggest insider trading activities, including vague references or implicit mentions. Please assist in crafting a comprehensive filter script with appropriate keywords, considering spelling variations and related terms. Additionally, if beneficial, develop a linked search list to enhance the script’s effectiveness. The search period spans from September 2021 to April 2024, focusing on emails associated with john.doe@aid4mail.com and jane.doe@aid4mail.com. Ensure the search terms are thorough and optimized to capture all relevant communications.”

Unfortunately, the GPT still makes too many mistakes, so we haven’t released it publicly. However, with the rapid advancement of AI, we’re optimistic that a reliable model will soon be within reach. These are exciting times!

LikeLike
- craigball said:
  
  August 5, 2024 at 10:19 AM
  
  Dear Eric:
  
  Thanks for weighing in with your work. I agree that our reach exceeds our grasp when it comes to AI LLMs just now. They do make too many mistakes to allow them to take the tiller…and the damn things never seem to supply the same answer twice in succession! What’s THAT about!?!?
  
  Nonetheless, if running a set of proposed search terms against an LLM serves to surface just one or two errors or generate even a single useful alternative or improvement, it’s a worthwhile effort. If we could monetize the impact of a single bad search term that slips by, I’d hazard the guess that it equates to thousands of Swiss francs; at any rate, a sufficient sum to justify using a tool costing US$20.00 or less. Be well and please keep up the good work on your splendid tool, Aid4Mail.
  
  LikeLike
  - John Tredennick said:
    
    August 5, 2024 at 10:38 AM
    
    Hey Craig:
    
    You made this statement: “They do make too many mistakes to allow them to take the tiller…and the damn things never seem to supply the same answer twice in succession! What’s THAT about!?!?”
    
    I wanted to respond.
    
    First, ChatGPT was configured for a high degree of creativity. They included a temperature setting that represents the level of creativity. I believe the default is 10 (or was when I looked a year ago).
    
    We set creativity in GPT to zero. That reduces variability in answers although there will always be some variability given that the system has no memory of its earlier answer. The same is true if you gave the assignment to a second associate.
    
    We don’t see very few mistakes with our system. I have read about Lexis and West having hallucinations through the Stanford study. For reasons longer than this post allows, that doesn’t happen with a properly instructed RAG system, particularly using the top end LLMs.
    
    LikeLiked by 1 person
  - craigball said:
    
    August 5, 2024 at 11:32 AM
    
    Thanks, John. I didn’t appreciate the ability to alter the temperature in the prompt, but found a post by Ralph from a year ago elucidating that capability. I really need to get out more. 😉
    
    LikeLike
Pingback: AI Prompt to Improve Keyword Search by Craig Ball
John Tredennick said:

August 5, 2024 at 9:36 AM

Hey Craig. I enjoyed your latest.

I also experimented with using LLMs to refine keywords back in the early months of GPT 3.5. These algorithms are amazing for all kinds of purposes including refining keyword search.

We found that an even better way to do this is to submit a lot of relevant documents to the LLM to read, analyze and suggest keywords from the documents. Essentially, let the relevant documents speak to you about how to find more.

We use different forms of AI to find more relevant documents combining natural language search, algorithmic keyword search and even a TAR classifier to analyze and rank documents for likely relevance. Our users find this a lot easier that trying to build complex keyword searches and our research shows it is more effective.

Keyword search is still important but far less so than in the past.

It is an exciting world these days.

JT

LikeLike
- craigball said:
  
  August 5, 2024 at 10:09 AM
  
  Dear John: Thanks for your comment and sharing your experience. Many, perhaps most, lawyers are understandably leery of pointing an LLM at client documents “in the wild.” They don’t yet have confidence that the contents of the documents won’t become part of the fabric of the LLM or in some manner be compromised. We can scoff at those concerns or esteem them, but how they feel–and what they fear–drives the practice for the moment. My post wasn’t a missive about how to develop keywords but how to assess a proposed set against a narrow set of lexical parameters.
  
  Too, there’s a practical limit for mere mortals like me and other small firm and solo practitioners when it comes to the ability to ingest a significant collection of documents into an LLM for training. As I wrote in a different post last week, “no AI can undertake an assessment of the evidence without facing the data.” With practical limits of say, twenty files each smaller than 512mb, it’s a heavy lift to get a collection to face the model in ChatGPT. Not everything should be viewed through the big firm/big budget lens if we seek to serve all who depend upon the court system.
  
  E-discovery has become a plaything of the rich, like litigation in general. If we do not democratize the tools and techniques of modern discovery, then we might as well slam shut the courthouse doors. They’re closed to most as it is.
  
  LikeLike
  - John Tredennick said:
    
    August 5, 2024 at 10:29 AM
    
    Those are all good points Craig.
    
    There is a lot of fear about sending information to a commercial LLM like GPT or Claude. As we have pointed out in several articles and webinars, that fear is misplaced if you access these tools through a commercial license (including the $20 a month subscriptions).
    
    First, an LLM cannot learn from your posting or even remember it. One training has completed, it cannot do either. And training is completed before the model is released.
    
    Most commercial licenses contain provisions that prohibit the company from using prompt information to train later models or from even holding the data longer than required to give your answer. The Microsoft Azure license provides one example.
    
    I point out the irony that most law firms use Office 365 which definitely holds confidential data on servers for months or years at a time. We rely on Microsoft’s license provisions to provide comfort that the data is safe and won’t be shared or used by Microsoft. The data sits, however, on servers that potentially could be hacked or shared.
    
    I understand the concern that only large firms reap the benefits of an LLM. I think the opposite is true. A solo can access these tools and take full advantage of them on a pay as you need it basis. The big firms, in my opinion, are making a larger mistake trying to build private LLMs for their documents. I listed five reasons why this is true in a recent EDRM article.
    
    These are fun times for legal tech. Thanks for all of your insightful writing.
    
    LikeLike
  - craigball said:
    
    August 5, 2024 at 11:04 AM
    
    Agreed on misplaced security concerns, but if all fears were rational, it would be a very different world than the one we inhabit. Thank you for always driving us forward and in good directions.
    
    LikeLike
davidkeithtobin said:

August 6, 2024 at 9:17 AM

wow! wow! – great stuff – just ran a test with a long list of keywords for a medical case – after generating it gave me a couple of useful prompts – I chose Generate optimized Boolean queries.

LikeLiked by 1 person
Pingback: Week 32 – 2024 – This Week In 4n6
Pingback: Doveryai, No Proveryai! | Ball in your Court
Pingback: This Week in eDiscovery: Concerns in Google Antitrust Case, Email Attachments and Privilege - Array
Pingback: AI Prompt Techniques and Strategies: Master Precision
Sciencegajab.com said:

December 25, 2024 at 4:27 AM

Nice

LikeLike

Share this:

Related

15 thoughts on “AI Prompt to Improve Keyword Search”