I recently published an AI prompt to run against search terms then get the AI to propose improvements. Among the pitfalls I’d hoped to expose was the presence of “stop” or “noise” words; terms routinely excluded from search indices. Searches incorporating stop words fail because terms not in the index won’t be found. Ensuring your searches don’t include stop words is an essential step in framing effective queries.
To help the AI recognize stop words, the prompt included a list of default stop words for well-known eDiscovery tools. That is, I thought I’d done that, but what I included in error (and have now replaced) was ChatGPT’s rendition of stop words for the major tools. I’d made a mental note to check the lists supplied but—DOH!—I plugged it into the prompt and then forgot to do my due diligence.
I was feeling pretty good about the post and getting some nice feedback. Last night, my dear friend and e-discovery Empress Mary Mack commented on the novelty of seeing the various stop word lists broken out in a ready reference. I think echoes of Mary’s kind comment woke me at 4:00am, my subconscious screaming, “HEY DUMMY! Did you verify those stop words? Tell me you didn’t blindly trust an AI?!?”
So, long before sunrise, I was manually checking each stop word list against product websites and—lo and behold—every list was off: some merely incomplete but others not even close. ChatGPT hallucinated the lists, and I failed to do the crucial thing lawyers must do when using AI as a research assistant: Trust but verify.
No harm done, but I share my chagrin here to underscore that you just cannot trust an AI generative large language model to do your research without careful human assessment of the output. I know this and let it slip my mind. Last time for that. I’ve corrected the prompt on my blog and hope I’ve gotten it right. I post this to remind my readers that AI LLMs are great—USE THEM–but they are no substitute for you. Doveryai, no proveryai!
Pingback: Doveryai, No Proveryai! - TQT Group
Josh Headley said:
Definitely thought those stop word lists seemed a bit… brief. After ~17 years in this game and having used a full half of the tools ever written, my go-to test is usually a search for “just between you and me” or a close variant. That phrase is always on the naughty list and, by default, most tools aren’t going to pick it up within the confines of an index-based query.
Nuking the stop word list irks the IT people because the index size can swell and it may affect reindexing speeds and the like. Sure beats the risk of missing a key communication that should have atty eyes on it, though.
Glad you dove into the AI space! Looking forward to your research and plain-English navigation to guide “the rest of us.”
LikeLike
craigball said:
Thanks for your kind words and sharing that great test phrase. Love it!
LikeLike
Doug Austin said:
Craig,
At least you caught the omissions before anyone else who read the post (including me) pointed it out. You weren’t the only one who trusted, but didn’t verify. When I read it, I thought the dtSearch list was smaller than I remembered, but didn’t go check. Kudos to you for correcting and acknowledging the error. Now I need to go back and re-run it with my test set of terms! 🙂
LikeLiked by 1 person