Ko IP & AI Law PLLC

Arizona patent lawyer focused on intellectual property & artificial intelligence law. Own your ideas, implement your AI, and mitigate the risks.

Does the LLMperor Have New Clothes? Some Thoughts on the Use of LLMs in eDiscovery

By Maura R. Grossman, Gordon V. Cormack, and Jason R. Baron1

I. Introduction: A Parable

As Hans Christian Andersenโ€™s parable goes, an emperor wasโ€”above all elseโ€”obsessed with showing off his new clothes. Approached by a pair of swindlers, who purported to be weavers of the most magnificent and uncommonly fine fabrics, he was convinced by them that their cloth had the magical quality of being invisible to anyone who was either unfit for office or unusually stupid. Thinking that such an outfit would be just the thing for him to tell wise men from fools, the emperor commissioned the weaversโ€”for a handsome sumโ€”to fashion him an outfit forthwith.

Eager to hear news on the progress of his new costume, the emperor dispatched his most-trusted advisors to see how things were going. While neither was able to see anything on the weaversโ€™ empty looms, both returned to report on the unparalleled beauty of the cloth, describingโ€”as the swindlers had done for themโ€”the gorgeous colors and intricate patterns of the woven fabrics. Finally, attended by a retinue of his most ardent followers, the emperor went to try on his new clothes. โ€œMagnificent!โ€ โ€œWhat beautiful colors!โ€ โ€œWhat a fabulous design!โ€ they decried, pointing to the empty hangers. โ€œIt has my highest approvalโ€ proclaimed the emperor, assuming he could not see what the others could, but unwilling to admit it. 

The emperor decided to wear his new clothes to a grand procession he was about to lead before his fellow countrymen. Remarking on the wonderful fit, off he went to the procession on his splendid canopy. Everyone in the village exclaimed how amazing were the emperorโ€™s new clothes; no one dared to admit they couldnโ€™t see them for that would prove them either unfit for their position or a fool. No costume ever worn by the emperor was such a complete success, except for one small issue. โ€œHe hasnโ€™t got anything on!โ€ exclaimed a small child. But the procession continued, proudly as before, with the emperorโ€™s noblemen holding high the train that did not exist. 

Note: This article is scheduled to appear in 109 Texas Advocate (forthcoming Winter 2024).


II. Large Language Models and eDiscovery

Artificial intelligence (โ€œAIโ€) in the form of Large Language Models (โ€œLLMsโ€) has recently emerged as the shiny new object for use in a variety of legal settings and operations. LLMs have been touted as a new form of legal โ€œSwiss Army knife,โ€ capable of removing much of the need for the human element involved in such varied legal tasks as summarizing or translating documents, performing research, constructing arguments, and reviewing and drafting contracts. While LLMs have shown early promise in performing relatively straightforward, ministerial tasks, it is also evident that overreliance on LLMsโ€”and sloppy lawyeringโ€”have led to grave mishaps, where faulty LLM use resulted in the misrepresentation of case law, including hallucinations of fake case citations.2

With respect to the use of LLMs in eDiscovery, on almost a daily basis, claims are being made by lawyers and commercial solution providers that LLMs either can or will soon replace not only traditional methods of identifying responsive electronically stored information (โ€œESIโ€) using keyword searches, but also newer methods using technology-assisted review (โ€œTARโ€). As part of these assertions, suggestions have been made that LLMs eliminate the need to follow sophisticated protocols that have come to be associated with search methods and the complex statistical efforts aimed at validating the results of particular TAR efforts.

But are the tasks involved in eDiscovery sufficiently similar to those for which LLMs have been shown to hold promise; that is, is it reasonable to expect that LLMs can be substituted for current search methods, including what have come to be known as โ€œTAR 1.0โ€ and โ€œTAR 2.0โ€(discussed further below)? What kind of benchmarking and validation protocols are necessary when using LLMs? And how should trial lawyers go about evaluating the effectiveness of LLMs, as well as the defensibility of using them in eDiscovery to satisfy the legal obligations imposed on counsel by Fed. R. Civ. P. 26(g) (and state-law equivalents), to respond to discovery requests (including RFPs) โ€œto the best of the personโ€™s knowledge, information, and belief formed after a reasonable inquiry.โ€

Maura R. Grossman

University of Waterloo

Gordon V. Cormack

University of Waterloo

Jason R. Baron

University of Maryland

III. What are Large Language Models?

Some definitions are in order. LLMs are computer programs that train on an enormous corpus of online text to be able to recognize human language. They use what is known as โ€œdeep learning,โ€ a type of machine learning capable of recognizing patterns in terabytes of unstructured data. Recent breakthroughs have employed particular LLMs known as โ€œtransformer models.โ€ Transformer models learn the statistical properties of data supplied in a prompt-and-response format and use those learned statistical properties to predict likely responses to new human-supplied prompts. LLMs and transformer models represent a subset of text-based applications within the larger domain of โ€œGenerative AI,โ€ which encompasses not only the production of text, but also images, audio, video, and other forms of mixed media. For example, one might prompt a transformer model to respond with a poem in a particular style, a picture of a kitten wearing a tutu, or the translation of a phrase into another language. Alternatively, one might prompt a transformer model to categorize a poem as a sonnet or a limerick, a picture as a kitten or a ballerina, or a phrase as English or Spanish.

The examples above illustrate the ability of LLMs to harness and reproduce general knowledgeโ€”information whose essence can be found in the training corpus. The format of an eDiscovery task seems, at first blush, to resemble that of the examples above: “Is this document responsive to this request for production (โ€œRFPโ€)?โ€ or โ€œIs this document material to this case?โ€ But the correct response relies on case-specific information, such as names, dates, filings, and a nuanced understanding of the legal issues. Are LLMs able to answer these questions as well as or better than existing practices involving human review, search terms, or TAR? How can this claim be evaluated, both in general and in any particular case? Does the LLMperor have new clothes, or are we all imagining them because we are loath to admit they may not be there?


IV. Machine Learning and Technology-Assisted Review

Over the past 15 years or so, the legal profession has become increasingly aware of the availability of various forms of AI used specifically to find responsive documents in complex litigation. The two most-established methodsโ€”commonly dubbed โ€œTAR 1.0โ€ and โ€œTAR 2.0โ€โ€”employ supervised machine learning to distinguish responsive documents from non-responsive documents. Like all supervised machine-learning methods, both rely on human reviewers to code a certain number of exemplar training documents as responsive or not. The TAR 1.0 method, after being given a sufficient number of training examples, either categorizes the remaining uncoded documents as responsive or not, or scores and ranks them according to their likelihood of responsiveness. The TAR 2.0 method, on the other hand, continuously presents likely-responsive documents for review and coding, until substantially all responsive documents have been identified.

Supervised machine learning may be contrasted with unsupervised machine learning, which requires no labeled training examples. Common applications of unsupervised machine learning are clustering and latent feature analysis. Clustering groups documents into several groups (i.e., clusters) of similar documents, while latent feature analysis uses statistical techniques to reduce the information in a document to a small number of essential features. Early methods of latent feature analysis were known as latent semantic analysis or indexing (โ€œLSAโ€ or โ€œLSIโ€), probabilistic latent semantic analysis (โ€œPLSAโ€), and latent Dirichlet analysis (โ€œLDAโ€). More recently, deep learning has been employed to create word embeddings, phrase embeddings, and document embeddings that map words, phrases, or documents to their latent features. LLMs are largely unsupervised machine-learning methods, as they are derived from vast quantities of unlabeled data. But they can also be fine-tuned, by adding application-specific data, or prompts and responses, to the unlabeled training data. They can be further improved through Retrieval Augmented Generation (โ€œRAGโ€) and Reinforcement Learning with Human Feedback (โ€œRLHFโ€). The former involves confirming the LLMโ€™s response with information stored in an external database, and perhaps providing links to the external sources, while the latter involves humans providing positive or negative feedback in response to an LLMโ€™s output.

The use of TAR was first recognized by the courts for use in eDiscovery in the seminal Da Silva Moore decision, issued in 2012, where the Court held that โ€œcomputer-assisted review now can be considered judicially-approved for use in appropriate cases.โ€3 As authorities, the Court relied on two studies, one by Roitblat et al.,4 and the other by Grossman and Cormack,5 indicating that certain TAR methods could be at least as effective as exhaustive manual review, at a fraction of the effort and cost. Importantly, the Court recognized that with any โ€œtechnological solutionโ€ in eDiscovery, โ€œcounsel must design an appropriate processโ€ with โ€œappropriate quality control testingโ€ to review and produce relevant ESI.6 In line with this prescription, the judiciary has signaled the desirability that standard search protocols be followed by the parties, in at least two different ways. First, through the adoption of local rules and standing orders in connection with the meet-and-confer process under Fed. R. Civ. P. 26(f), where the specific parameters of proposed searches and their validation are expected to be discussed by the parties.7 And second, through the acceptance of sophisticated protocols, proposed by the parties or by special mastersโ€”often either stipulated, or adopted, at least in part, over the objections of one or both parties.

The Sedona Conference

V. Are Large Language Models New Clothes for eDiscovery?

LLM tools and protocols have not yet been demonstrated to be as effective as currently recognized methods for legal research,8 nor for TAR. The first step towards such recognition should be empirical studies akin to those cited in Da Silva Moore, demonstrating the effectiveness of TAR for eDiscovery tasks on a meaningful number of varied and representative RFPs and datasets. The second step should be to demonstrate, through the use of a statistically sound and well-accepted validation protocol, that each particular eDiscovery effort using a recognized LLM tool and protocol is reasonably effective.9

We consider the second step first, as it is not specific to any one eDiscovery method, be it the use of keyword search, manual review, TAR, or LLMs. For example, in 2018, the In re Broiler Chicken Antitrust Litigation case set forth a validation protocol to be followed, regardless of the review method employedโ€”TAR or manual review.10 An essential aspect of the Broiler TAR protocol was evaluation of the effectiveness of the review method using an independent review of a stratified statistical sample representing all documents in the collection, whether reviewed or excluded by software, and, if reviewed, whether coded responsive or not by a human. This independent review was to be conducted blind, meaning that the reviewers were to be given no indication of whether any document in the sample was previously reviewed, and, if so, whether it had been coded responsive or not. It is well known that reviewers are influenced by the dearth or abundance of responsive documents,11 as well as by their knowledge of how a document was previously treated. These sources of bias are mitigated by the inclusion of a reasonable number of responsive and non-responsive documents in the validation sample, combined with blind review. 

Returning to the first step towards recognition of the use of LLMs for eDiscovery, we must address the question: How does the use of LLMs in eDiscovery measure up against the proven track record and acceptance of TAR methods? As of the date of this writing, the answer is, at best, unknown.
0
How can we validate GenAI if it generates different outputs to the same inputs at different times?x

Many of the articles promoting the use of LLMs for eDiscovery mention uses that are peripheral to the core eDiscovery task of identifying substantially all responsive or material documents. Summarization, translation, and case-law search may be useful, but they do not help to identify substantially all responsive or material documents. As noted above, LLMs might be used to answer questions like โ€œIs this document responsive to this RFP?โ€ This could possibly be accomplished in one of two ways: (1) one could compose a prompt of the form โ€œIs this document [fill in the document] responsive to this RFP [fill in the RFP]; or (2) first, fine-tune the LLM on data of the form โ€œ[fill in the document] is [fill in responsive or not].โ€ The question would then be posed and answered by the LLM, for each document in turn. Method (1) relies heavily on the skill of a โ€œprompt engineerโ€ in much the same way that keyword search relies on the skill of the searcher. Slightly different prompt formats can lead to wildly varying responses, and without fine-tuning, different state-of-the-art LLMs may show very different success rates from each other.12 As far as we are aware, the impact of this phenomenon on eDiscovery search has neither been researched nor reported. Method (2) is in effect supervised machine learning. No study has yet shown either approach to be superior to state-of-the-art TAR methods.


Non-specific, conclusory pronouncements of stellar LLM performance abound. But empirical researchโ€”particularly that which has been subject to rigorous peer-reviewโ€”has yet to demonstrate a well-defined eDiscovery protocol employing LLMs that improves on current TAR practice in eDiscovery.13 develop a set of prompts to classify a subset of documents from the National Institute of Standards and Technologyโ€™s (โ€œNISTโ€™sโ€) Text Retrieval Evaluation Conference (โ€œTRECโ€) 2011 Legal Track. The subset of documents is work product from TREC, consisting almost exclusively of documents that had already been deemed relevant to the evaluation task. For this reason, the results are not comparable to the findings from the TREC Legal Track, or to those of subsequent experiments on the same data.14 develop prompts to classify an undescribed, non-public, uncharacteristically high-prevalence dataset according to non-public RFPs. Their results show that, according to a second review, the LLM process yielded greater recall but lesser precision than a first-pass human review; however, no comparison to an established TAR process is provided.15 investigate the use of fine-tuning on another non-public dataset, concluding that fine-tuning provides some benefit in a TAR 1.0 process, with overall results comparable to that of logistic regressionโ€”a well-established machine-learning method. While extravagant pronouncements about recall and precision results achieved by LLMs have been made by some lawyers and commercial eDiscovery service providers, the more serious research efforts conducted to date have not shown that LLMs to improve on the state-of-the-art for TAR 1.0, or for TAR 2.0.16

Until valid testing demonstrates that LLMs are at least as effective as established practice for concrete eDiscovery tasks,17 they should be treated with caution.


VI. Conclusion

The bottom line is that, at the time of this writing, there is no well-defined protocol for how to employ LLMs to find substantially all documents responsive to matter-specific requirements (e.g., RFPs) in a matter-specific collection of documents. The selection of tools, the engineering of prompts, and protocols for fine-tuning are largely unspecified, inscrutable, and no such selection has been demonstrated to improve on established TAR tools and practice. Further research is necessary to develop and document such protocols, and large-scale evaluation comparable to that of the TREC Legal Track (2006 โ€“ 2011)18 and TREC Total Recall Track (2015 โ€“ 2016)19 efforts are necessary to establish their effectiveness. As a first step, tools should be compared against the benchmarks established by these TREC evaluation workshops, using the same test methodology. If and when LLMs can be shown to improve on these benchmarks, they can and should be testedโ€”and compared to established methodsโ€”on new datasets. The use of new datasets is necessary to avoid the problem that legacy benchmarks are likely to have been included in the online corpus used to train the LLMs in the first place.

Once these studies give us all good reason to believe that a specific LLM tool and protocol will be effective, it can and should be employed subject to the same statistically sound validation as any other eDiscovery protocol. Otherwise, we are all at risk of being convinced by the LLMperorโ€”and his fansโ€”that he is wearing the finest threads imaginable, when they are, in fact, imaginary.
0
Until such validation of GenAI becomes possible, how can we contemplate implementing GenAI for any decision-making that implicates individual rights, etc.?x

ยฉ 2024 The Sedona Conference


  1. Maura R. Grossman, J.D., Ph.D., is Research Professor in the David R. Cheriton School of Computer Science at the University of Waterloo and Adjunct Professor at Osgoode Hall Law School at York University, both in Ontario, Canada. She is also Principal at Maura Grossman Law, in Buffalo, N.Y. Gordon V. Cormack, Ph.D., is Professor Emeritus in the David R. Cheriton School of Computer Science at the University of Waterloo. Jason R. Baron, J.D., is Professor of the Practice in the College of Information at the University of Maryland. โ†ฉ๏ธŽ
  2. See, e.g.,ย Mata v. Avianca, No. 22-CV-1461 (PKC), 2023 WL 4114965 (S.D.N.Y. June 22, 2023); Ex parte Lee, No. 10-22-00281-CR, 2023 WL 4624777 (Tex. App. Jul. 19, 2023);ย Thomas v. Pangburn, 23-CV-0046, 2023 WL 9425765 (S.D. Ga. Oct. 6, 2023);ย Morgan v. Community Against Violence, No. 23-CV-0353 , 2023 WL 6976510 (D. N.M. Oct. 23, 2023);ย U.S. v. Cohen, No. 18-CR-602 (JMF), 2023 WL 8635521 (S.D.N.Y. Dec. 12, 2023) andย U.S. v. Cohen, No. 18-CR-602 (JMF), 2024 WL 1193604 (S.D.N.Y. Mar. 20, 2024);ย Kruse v. Karlen, et al., No. ED111172, 2024 WL 559497 (Mo. Ct. App. Feb. 23, 2024);ย Park v. Kim, No. 22-2057, 2024 WL 332478 (2d Cir. 2024).ย  โ†ฉ๏ธŽ
  3. Da Silva Moore v. Publicis Groupe, 287 F.R.D. 182, 193 (S.D.N.Y. 2012) (Peck, M. J.),ย affโ€™d,ย 2012 WL 1446534 (S.D.N.Y. Apr. 26, 2012). โ†ฉ๏ธŽ
  4. Herbert L. Roitblat et al.,ย Document Categorization in Legal Electronic Discovery: Computer Classification v. Manual Review, 61 J. Am. Socโ€™y for Info. Sci. & Tech. 70 (2010). โ†ฉ๏ธŽ
  5. Maura R. Grossman & Gordon V. Cormack,ย Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, 17:3 Rich. J.L. & Tech. art. 5 (2011). โ†ฉ๏ธŽ
  6. Da Silva Moore,ย supraย n.3 at 193. โ†ฉ๏ธŽ
  7. For examples of local rules and guidelines that discuss search processes,ย see, e.g.,ย U.S. District Court for N.D. Cal. Checklist for Rule 26(f) Meet and Confer Regarding ESIย (Rev. Dec. 1, 2015);ย U.S. District Court for D. Kan. Guidelines for Cases Involving Electronically Stored Information [ESI];ย Seventh Circuit Council on eDiscovery & Digital Information (formerly The Sevent Circuit Pilot Project) Model Standing Order;ย U.S. District Court for D. Md. Suggested Protocol for Discovery of Electronically Stored Information; Rule 11-c on Discovery of Electronic Stored Information inย ยง 202.70(g) of the Rules of the Commercial Div. of the N.Y.S Sup. Ct. For an example of a (now retired) individual district judgeโ€™s standing order that addressed search processes, seeย Hon. Paul W. Grimm (D. Md.) Discovery Orderย (Jan. 29, 2013). โ†ฉ๏ธŽ
  8. For an example of a study showing high rates of hallucination (17-33%) in commercial legal research tools employing LLMs,ย seeย Varun Magesh et al.,ย Hallucination-Free? Assessing the Reliability of Leading Legal Research Toolsย (Stanford Univ. HAI May 30, 2024). โ†ฉ๏ธŽ
  9. For good examples of statistically well-grounded search validation protocols,ย seeย Bruce Hedin & Samuel Curtis,ย Model Protocol for Electronically Stored Information (ESI) โ€“ Guidelines for Practitionerย (The Future Society and IEEE Oct. 2023); Maura R. Grossman & Gordon V. Cormack,ย Vetting and Validation of AI-Enabled Tools for Electronic Discovery, ch. 13 in Jesse Beatson et al. (eds.),ย Litigating Artificial Intelligenceย (Edmond Pub. May 2020). An Aug. 2020 Review Copy of the latter chapter is available atย https://grossman.uwaterloo.ca/grossman_cormack_vetting.pdf. โ†ฉ๏ธŽ
  10. Order Regarding Search Methodology for Electronically Stored Information,ย In re Broiler Chicken Antitrust Litig., No. 1:16-cv-08637, 2018 WL 1146371 (N.D. Ill. Jan. 3, 2018). โ†ฉ๏ธŽ
  11. Adam Roegiest & Gordon V. Cormack,ย Impact of Review Set Selection on Human Assessment for Text Classification, SIGIR โ€™16: Proc. of the 39th Intโ€™l ACM SIGIR Conf. on Rsch. & Dev. In IR 861 (2016). โ†ฉ๏ธŽ
  12. Seeย Aisha Khatun,ย Uncovering the Reliability and Consistency of Language Models: A Systematic Studyย (Univ. of Waterloo Thesis Repository Aug. 22, 2024). โ†ฉ๏ธŽ
  13. Sumit Pai et al.,ย Exploration of Open Large Language Models for eDiscovery, NLLP โ€™23: Proc. of the Natural Legal Language Processing Workshop 166 (2023). โ†ฉ๏ธŽ
  14. Roshanak Omrani et al.,ย Beyond the Bar: Generative AI as a Transformative Component in Legal Document Reviewย (Relativity ODA LLC 2024). โ†ฉ๏ธŽ
  15. Fusheng Wei et al.,ย Empirical Study of LLM Fine-Tuning for Text Classification in Legal Document Review, IEEE Big Data โ€™23: Proc. of the IEEE Intโ€™l Conf. on Big Data 2786 (Dec. 2023). โ†ฉ๏ธŽ
  16. In fact, research results suggest otherwise for at least TAR 2.0.ย Seeย Nima Sadri & Gordon V. Cormack,ย Continuous Active Learning Using Pretrained Transformers, arXiv:2208.06955 [cs.IR] (Aug. 15, 2022). โ†ฉ๏ธŽ
  17. For an example with respect to TAR 2.0,ย seeย Gordon V. Cormack et al.,ย Unbiased Validation of Technology-Assisted Review for eDiscovery, SIGIR โ€™24: Proc. of the 47th Intโ€™l ACM SIGIR Conf. on Rsch. & Dev. in IR 2677 (July 11, 2024) (showing CALยฎ as at least as effective as manual review in an actual large-scale litigation matter). โ†ฉ๏ธŽ
  18. TREC Legal Track Websiteย (last modified May 10, 2012). โ†ฉ๏ธŽ
  19. TREC Total Recall Corpora Websiteย (last modified Apr. 23, 2020). โ†ฉ๏ธŽ
Loading
0 0 votes
Article Rating
simple-ad

This is a demo advert, you can use simple text, HTML image or any Ad Service JavaScript code. If you’re inserting HTML or JS code make sure editor is switched to ‘Text’ mode.

Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x