Does the LLMperor Have New Clothes? Some Thoughts on the Use of LLMs in eDiscovery

By Maura R. Grossman, Gordon V. Cormack, and Jason R. Baron¹

I. Introduction: A Parable

As Hans Christian Andersen’s parable goes, an emperor was—above all else—obsessed with showing off his new clothes. Approached by a pair of swindlers, who purported to be weavers of the most magnificent and uncommonly fine fabrics, he was convinced by them that their cloth had the magical quality of being invisible to anyone who was either unfit for office or unusually stupid. Thinking that such an outfit would be just the thing for him to tell wise men from fools, the emperor commissioned the weavers—for a handsome sum—to fashion him an outfit forthwith.

Eager to hear news on the progress of his new costume, the emperor dispatched his most-trusted advisors to see how things were going. While neither was able to see anything on the weavers’ empty looms, both returned to report on the unparalleled beauty of the cloth, describing—as the swindlers had done for them—the gorgeous colors and intricate patterns of the woven fabrics. Finally, attended by a retinue of his most ardent followers, the emperor went to try on his new clothes. “Magnificent!” “What beautiful colors!” “What a fabulous design!” they decried, pointing to the empty hangers. “It has my highest approval” proclaimed the emperor, assuming he could not see what the others could, but unwilling to admit it.

The emperor decided to wear his new clothes to a grand procession he was about to lead before his fellow countrymen. Remarking on the wonderful fit, off he went to the procession on his splendid canopy. Everyone in the village exclaimed how amazing were the emperor’s new clothes; no one dared to admit they couldn’t see them for that would prove them either unfit for their position or a fool. No costume ever worn by the emperor was such a complete success, except for one small issue. “He hasn’t got anything on!” exclaimed a small child. But the procession continued, proudly as before, with the emperor’s noblemen holding high the train that did not exist.

Note: This article is scheduled to appear in 109 Texas Advocate (forthcoming Winter 2024).

II. Large Language Models and eDiscovery

Artificial intelligence (“AI”) in the form of Large Language Models (“LLMs”) has recently emerged as the shiny new object for use in a variety of legal settings and operations. LLMs have been touted as a new form of legal “Swiss Army knife,” capable of removing much of the need for the human element involved in such varied legal tasks as summarizing or translating documents, performing research, constructing arguments, and reviewing and drafting contracts. While LLMs have shown early promise in performing relatively straightforward, ministerial tasks, it is also evident that overreliance on LLMs—and sloppy lawyering—have led to grave mishaps, where faulty LLM use resulted in the misrepresentation of case law, including hallucinations of fake case citations.²

With respect to the use of LLMs in eDiscovery, on almost a daily basis, claims are being made by lawyers and commercial solution providers that LLMs either can or will soon replace not only traditional methods of identifying responsive electronically stored information (“ESI”) using keyword searches, but also newer methods using technology-assisted review (“TAR”). As part of these assertions, suggestions have been made that LLMs eliminate the need to follow sophisticated protocols that have come to be associated with search methods and the complex statistical efforts aimed at validating the results of particular TAR efforts.

But are the tasks involved in eDiscovery sufficiently similar to those for which LLMs have been shown to hold promise; that is, is it reasonable to expect that LLMs can be substituted for current search methods, including what have come to be known as “TAR 1.0” and “TAR 2.0”(discussed further below)? What kind of benchmarking and validation protocols are necessary when using LLMs? And how should trial lawyers go about evaluating the effectiveness of LLMs, as well as the defensibility of using them in eDiscovery to satisfy the legal obligations imposed on counsel by Fed. R. Civ. P. 26(g) (and state-law equivalents), to respond to discovery requests (including RFPs) “to the best of the person’s knowledge, information, and belief formed after a reasonable inquiry.” [cont’d ↗]

Maura R. Grossman

University of Waterloo

Gordon V. Cormack

University of Waterloo

Jason R. Baron

University of Maryland

III. What are Large Language Models?

Some definitions are in order. LLMs are computer programs that train on an enormous corpus of online text to be able to recognize human language. They use what is known as “deep learning,” a type of machine learning capable of recognizing patterns in terabytes of unstructured data. Recent breakthroughs have employed particular LLMs known as “transformer models.” Transformer models learn the statistical properties of data supplied in a prompt-and-response format and use those learned statistical properties to predict likely responses to new human-supplied prompts. LLMs and transformer models represent a subset of text-based applications within the larger domain of “Generative AI,” which encompasses not only the production of text, but also images, audio, video, and other forms of mixed media. For example, one might prompt a transformer model to respond with a poem in a particular style, a picture of a kitten wearing a tutu, or the translation of a phrase into another language. Alternatively, one might prompt a transformer model to categorize a poem as a sonnet or a limerick, a picture as a kitten or a ballerina, or a phrase as English or Spanish.

The examples above illustrate the ability of LLMs to harness and reproduce general knowledge—information whose essence can be found in the training corpus. The format of an eDiscovery task seems, at first blush, to resemble that of the examples above: “Is this document responsive to this request for production (“RFP”)?” or “Is this document material to this case?” But the correct response relies on case-specific information, such as names, dates, filings, and a nuanced understanding of the legal issues. Are LLMs able to answer these questions as well as or better than existing practices involving human review, search terms, or TAR? How can this claim be evaluated, both in general and in any particular case? Does the LLMperor have new clothes, or are we all imagining them because we are loath to admit they may not be there?

IV. Machine Learning and Technology-Assisted Review

Over the past 15 years or so, the legal profession has become increasingly aware of the availability of various forms of AI used specifically to find responsive documents in complex litigation. The two most-established methods—commonly dubbed “TAR 1.0” and “TAR 2.0”—employ supervised machine learning to distinguish responsive documents from non-responsive documents. Like all supervised machine-learning methods, both rely on human reviewers to code a certain number of exemplar training documents as responsive or not. The TAR 1.0 method, after being given a sufficient number of training examples, either categorizes the remaining uncoded documents as responsive or not, or scores and ranks them according to their likelihood of responsiveness. The TAR 2.0 method, on the other hand, continuously presents likely-responsive documents for review and coding, until substantially all responsive documents have been identified.

Supervised machine learning may be contrasted with unsupervised machine learning, which requires no labeled training examples. Common applications of unsupervised machine learning are clustering and latent feature analysis. Clustering groups documents into several groups (i.e., clusters) of similar documents, while latent feature analysis uses statistical techniques to reduce the information in a document to a small number of essential features. Early methods of latent feature analysis were known as latent semantic analysis or indexing (“LSA” or “LSI”), probabilistic latent semantic analysis (“PLSA”), and latent Dirichlet analysis (“LDA”). More recently, deep learning has been employed to create word embeddings, phrase embeddings, and document embeddings that map words, phrases, or documents to their latent features. LLMs are largely unsupervised machine-learning methods, as they are derived from vast quantities of unlabeled data. But they can also be fine-tuned, by adding application-specific data, or prompts and responses, to the unlabeled training data. They can be further improved through Retrieval Augmented Generation (“RAG”) and Reinforcement Learning with Human Feedback (“RLHF”). The former involves confirming the LLM’s response with information stored in an external database, and perhaps providing links to the external sources, while the latter involves humans providing positive or negative feedback in response to an LLM’s output.

The use of TAR was first recognized by the courts for use in eDiscovery in the seminal Da Silva Moore decision, issued in 2012, where the Court held that “computer-assisted review now can be considered judicially-approved for use in appropriate cases.”³ As authorities, the Court relied on two studies, one by Roitblat et al.,⁴ and the other by Grossman and Cormack,⁵ indicating that certain TAR methods could be at least as effective as exhaustive manual review, at a fraction of the effort and cost. Importantly, the Court recognized that with any “technological solution” in eDiscovery, “counsel must design an appropriate process” with “appropriate quality control testing” to review and produce relevant ESI.⁶ In line with this prescription, the judiciary has signaled the desirability that standard search protocols be followed by the parties, in at least two different ways. First, through the adoption of local rules and standing orders in connection with the meet-and-confer process under Fed. R. Civ. P. 26(f), where the specific parameters of proposed searches and their validation are expected to be discussed by the parties.⁷ And second, through the acceptance of sophisticated protocols, proposed by the parties or by special masters—often either stipulated, or adopted, at least in part, over the objections of one or both parties. [cont’d ↗]

Ko IP & AI Law Blog Editor’s Note: Proper validation of any GenAI tool should be prerequisite to its implementation and is prerequisite to mitigating any liability for harm resulting from any inaccuracies or “hallucinations.” The Sedona Conference has been down this road and back in leading the way for well over a decade with Technology-Assisted Review / Machine Learning for litigation eDiscovery. There are important lessons to be learned—both in terms of practical applications and as a cautionary tale—that apply for all Gen-AI application contexts, in particular any that are mission-critical.

Many thanks to our guest authors—all long-standing members of The Sedona Conference’s Working Group 1 on eDiscovery—for this foundational contribution to its forthcoming launch of Working Group 13 on AI Law and to this (re-)launch of this blog.

V. Are Large Language Models New Clothes for eDiscovery?

LLM tools and protocols have not yet been demonstrated to be as effective as currently recognized methods for legal research,⁸ nor for TAR. The first step towards such recognition should be empirical studies akin to those cited in Da Silva Moore, demonstrating the effectiveness of TAR for eDiscovery tasks on a meaningful number of varied and representative RFPs and datasets. The second step should be to demonstrate, through the use of a statistically sound and well-accepted validation protocol, that each particular eDiscovery effort using a recognized LLM tool and protocol is reasonably effective.⁹

We consider the second step first, as it is not specific to any one eDiscovery method, be it the use of keyword search, manual review, TAR, or LLMs. For example, in 2018, the In re Broiler Chicken Antitrust Litigation case set forth a validation protocol to be followed, regardless of the review method employed—TAR or manual review.¹⁰ An essential aspect of the Broiler TAR protocol was evaluation of the effectiveness of the review method using an independent review of a stratified statistical sample representing all documents in the collection, whether reviewed or excluded by software, and, if reviewed, whether coded responsive or not by a human. This independent review was to be conducted blind, meaning that the reviewers were to be given no indication of whether any document in the sample was previously reviewed, and, if so, whether it had been coded responsive or not. It is well known that reviewers are influenced by the dearth or abundance of responsive documents,¹¹ as well as by their knowledge of how a document was previously treated. These sources of bias are mitigated by the inclusion of a reasonable number of responsive and non-responsive documents in the validation sample, combined with blind review.

Returning to the first step towards recognition of the use of LLMs for eDiscovery, we must address the question: How does the use of LLMs in eDiscovery measure up against the proven track record and acceptance of TAR methods? As of the date of this writing, the answer is, at best, unknown.

How can we validate GenAI if it generates different outputs to the same inputs at different times?x

Many of the articles promoting the use of LLMs for eDiscovery mention uses that are peripheral to the core eDiscovery task of identifying substantially all responsive or material documents. Summarization, translation, and case-law search may be useful, but they do not help to identify substantially all responsive or material documents. As noted above, LLMs might be used to answer questions like “Is this document responsive to this RFP?” This could possibly be accomplished in one of two ways: (1) one could compose a prompt of the form “Is this document [fill in the document] responsive to this RFP [fill in the RFP]; or (2) first, fine-tune the LLM on data of the form “[fill in the document] is [fill in responsive or not].” The question would then be posed and answered by the LLM, for each document in turn. Method (1) relies heavily on the skill of a “prompt engineer” in much the same way that keyword search relies on the skill of the searcher. Slightly different prompt formats can lead to wildly varying responses, and without fine-tuning, different state-of-the-art LLMs may show very different success rates from each other.¹² As far as we are aware, the impact of this phenomenon on eDiscovery search has neither been researched nor reported. Method (2) is in effect supervised machine learning. No study has yet shown either approach to be superior to state-of-the-art TAR methods.

…

Non-specific, conclusory pronouncements of stellar LLM performance abound. But empirical research—particularly that which has been subject to rigorous peer-review—has yet to demonstrate a well-defined eDiscovery protocol employing LLMs that improves on current TAR practice in eDiscovery.¹³ develop a set of prompts to classify a subset of documents from the National Institute of Standards and Technology’s (“NIST’s”) Text Retrieval Evaluation Conference (“TREC”) 2011 Legal Track. The subset of documents is work product from TREC, consisting almost exclusively of documents that had already been deemed relevant to the evaluation task. For this reason, the results are not comparable to the findings from the TREC Legal Track, or to those of subsequent experiments on the same data.¹⁴ develop prompts to classify an undescribed, non-public, uncharacteristically high-prevalence dataset according to non-public RFPs. Their results show that, according to a second review, the LLM process yielded greater recall but lesser precision than a first-pass human review; however, no comparison to an established TAR process is provided.¹⁵ investigate the use of fine-tuning on another non-public dataset, concluding that fine-tuning provides some benefit in a TAR 1.0 process, with overall results comparable to that of logistic regression—a well-established machine-learning method. While extravagant pronouncements about recall and precision results achieved by LLMs have been made by some lawyers and commercial eDiscovery service providers, the more serious research efforts conducted to date have not shown that LLMs to improve on the state-of-the-art for TAR 1.0, or for TAR 2.0.¹⁶

Until valid testing demonstrates that LLMs are at least as effective as established practice for concrete eDiscovery tasks,¹⁷ they should be treated with caution.

VI. Conclusion

The bottom line is that, at the time of this writing, there is no well-defined protocol for how to employ LLMs to find substantially all documents responsive to matter-specific requirements (e.g., RFPs) in a matter-specific collection of documents. The selection of tools, the engineering of prompts, and protocols for fine-tuning are largely unspecified, inscrutable, and no such selection has been demonstrated to improve on established TAR tools and practice. Further research is necessary to develop and document such protocols, and large-scale evaluation comparable to that of the TREC Legal Track (2006 – 2011)¹⁸ and TREC Total Recall Track (2015 – 2016)¹⁹ efforts are necessary to establish their effectiveness. As a first step, tools should be compared against the benchmarks established by these TREC evaluation workshops, using the same test methodology. If and when LLMs can be shown to improve on these benchmarks, they can and should be tested—and compared to established methods—on new datasets. The use of new datasets is necessary to avoid the problem that legacy benchmarks are likely to have been included in the online corpus used to train the LLMs in the first place.

Once these studies give us all good reason to believe that a specific LLM tool and protocol will be effective, it can and should be employed subject to the same statistically sound validation as any other eDiscovery protocol. Otherwise, we are all at risk of being convinced by the LLMperor—and his fans—that he is wearing the finest threads imaginable, when they are, in fact, imaginary.

Until such validation of GenAI becomes possible, how can we contemplate implementing GenAI for any decision-making that implicates individual rights, etc.?x

The views and opinions expressed herein are solely those of the authors and do not necessarily reflect the consensus policy or position of The Sedona Conference, or any organizations or clients with which the authors may be affiliated.

Maura R. Grossman, J.D., Ph.D., is Research Professor in the David R. Cheriton School of Computer Science at the University of Waterloo and Adjunct Professor at Osgoode Hall Law School at York University, both in Ontario, Canada. She is also Principal at Maura Grossman Law, in Buffalo, N.Y. Gordon V. Cormack, Ph.D., is Professor Emeritus in the David R. Cheriton School of Computer Science at the University of Waterloo. Jason R. Baron, J.D., is Professor of the Practice in the College of Information at the University of Maryland. ↩︎
See, e.g., Mata v. Avianca, No. 22-CV-1461 (PKC), 2023 WL 4114965 (S.D.N.Y. June 22, 2023); Ex parte Lee, No. 10-22-00281-CR, 2023 WL 4624777 (Tex. App. Jul. 19, 2023); Thomas v. Pangburn, 23-CV-0046, 2023 WL 9425765 (S.D. Ga. Oct. 6, 2023); Morgan v. Community Against Violence, No. 23-CV-0353 , 2023 WL 6976510 (D. N.M. Oct. 23, 2023); U.S. v. Cohen, No. 18-CR-602 (JMF), 2023 WL 8635521 (S.D.N.Y. Dec. 12, 2023) and U.S. v. Cohen, No. 18-CR-602 (JMF), 2024 WL 1193604 (S.D.N.Y. Mar. 20, 2024); Kruse v. Karlen, et al., No. ED111172, 2024 WL 559497 (Mo. Ct. App. Feb. 23, 2024); Park v. Kim, No. 22-2057, 2024 WL 332478 (2d Cir. 2024). ↩︎
Da Silva Moore v. Publicis Groupe, 287 F.R.D. 182, 193 (S.D.N.Y. 2012) (Peck, M. J.), aff’d, 2012 WL 1446534 (S.D.N.Y. Apr. 26, 2012). ↩︎
Herbert L. Roitblat et al., Document Categorization in Legal Electronic Discovery: Computer Classification v. Manual Review, 61 J. Am. Soc’y for Info. Sci. & Tech. 70 (2010). ↩︎
Maura R. Grossman & Gordon V. Cormack, Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, 17:3 Rich. J.L. & Tech. art. 5 (2011). ↩︎
Da Silva Moore, supra n.3 at 193. ↩︎
For examples of local rules and guidelines that discuss search processes, see, e.g., U.S. District Court for N.D. Cal. Checklist for Rule 26(f) Meet and Confer Regarding ESI (Rev. Dec. 1, 2015); U.S. District Court for D. Kan. Guidelines for Cases Involving Electronically Stored Information [ESI]; Seventh Circuit Council on eDiscovery & Digital Information (formerly The Sevent Circuit Pilot Project) Model Standing Order; U.S. District Court for D. Md. Suggested Protocol for Discovery of Electronically Stored Information; Rule 11-c on Discovery of Electronic Stored Information in § 202.70(g) of the Rules of the Commercial Div. of the N.Y.S Sup. Ct. For an example of a (now retired) individual district judge’s standing order that addressed search processes, see Hon. Paul W. Grimm (D. Md.) Discovery Order (Jan. 29, 2013). ↩︎
For an example of a study showing high rates of hallucination (17-33%) in commercial legal research tools employing LLMs, see Varun Magesh et al., Hallucination-Free? Assessing the Reliability of Leading Legal Research Tools (Stanford Univ. HAI May 30, 2024). ↩︎
For good examples of statistically well-grounded search validation protocols, see Bruce Hedin & Samuel Curtis, Model Protocol for Electronically Stored Information (ESI) – Guidelines for Practitioner (The Future Society and IEEE Oct. 2023); Maura R. Grossman & Gordon V. Cormack, Vetting and Validation of AI-Enabled Tools for Electronic Discovery, ch. 13 in Jesse Beatson et al. (eds.), Litigating Artificial Intelligence (Edmond Pub. May 2020). An Aug. 2020 Review Copy of the latter chapter is available at https://grossman.uwaterloo.ca/grossman_cormack_vetting.pdf. ↩︎
Order Regarding Search Methodology for Electronically Stored Information, In re Broiler Chicken Antitrust Litig., No. 1:16-cv-08637, 2018 WL 1146371 (N.D. Ill. Jan. 3, 2018). ↩︎
Adam Roegiest & Gordon V. Cormack, Impact of Review Set Selection on Human Assessment for Text Classification, SIGIR ’16: Proc. of the 39th Int’l ACM SIGIR Conf. on Rsch. & Dev. In IR 861 (2016). ↩︎
See Aisha Khatun, Uncovering the Reliability and Consistency of Language Models: A Systematic Study (Univ. of Waterloo Thesis Repository Aug. 22, 2024). ↩︎
Sumit Pai et al., Exploration of Open Large Language Models for eDiscovery, NLLP ’23: Proc. of the Natural Legal Language Processing Workshop 166 (2023). ↩︎
Roshanak Omrani et al., Beyond the Bar: Generative AI as a Transformative Component in Legal Document Review (Relativity ODA LLC 2024). ↩︎
Fusheng Wei et al., Empirical Study of LLM Fine-Tuning for Text Classification in Legal Document Review, IEEE Big Data ’23: Proc. of the IEEE Int’l Conf. on Big Data 2786 (Dec. 2023). ↩︎
In fact, research results suggest otherwise for at least TAR 2.0. See Nima Sadri & Gordon V. Cormack, Continuous Active Learning Using Pretrained Transformers, arXiv:2208.06955 [cs.IR] (Aug. 15, 2022). ↩︎
For an example with respect to TAR 2.0, see Gordon V. Cormack et al., Unbiased Validation of Technology-Assisted Review for eDiscovery, SIGIR ’24: Proc. of the 47th Int’l ACM SIGIR Conf. on Rsch. & Dev. in IR 2677 (July 11, 2024) (showing CAL® as at least as effective as manual review in an actual large-scale litigation matter). ↩︎
TREC Legal Track Website (last modified May 10, 2012). ↩︎
TREC Total Recall Corpora Website (last modified Apr. 23, 2020). ↩︎

Ko IP & AI Law PLLC