‘Sequel’ Concurrence on AI Use in Legal Interpretation | Insights

Go-To Guide:

Eleventh Circuit judge uses large language models (LLMs) to assist in determining the ordinary meaning of a composite, multi-word phrase.
The judge also ran a “mini experiment” feeding the same query to LLMs multiple times to see how their responses changes and what effect, if any, those changes have on his analysis.
The judge reiterates his endorsement for LLMs as complementary tools, not replacements, in legal interpretation.

Judge Kevin Newsom used a concurrence to the Eleventh Circuit’s opinion in United States v. Deleon (Sept. 5, 2024) to pen a “sequel of sorts” to his separate Snell opinion,¹ where he made a “modest proposal” for the use of AI-powered LLMs in legal interpretation.

The underlying issue in Deleon was whether a robbery victim was “physically restrained” during the crime. If the victim was “physically restrained,” then the U.S. Sentencing Guidelines impose an enhanced sentence on the defendant.²

While researching the issues in Deleon, Judge Newsom recognized an issue that made the interpretive issue in the case “perhaps a little bit harder than usual.” In Deleon, the court was required to interpret “a composite, multi-word phrase” – “physically restrained” – rather than a single word. That phrase, Judge Newsom noted, is not defined in any reputable dictionary, the default tool for plain-language interpreters.

Because “physically restrained” does not appear in a dictionary, and because LLMs aim to reflect how people ordinarily use multi-word phrases, and “because, well, [he] couldn’t help [himself],” Judge Newsom asked a generative LLM what it thought “physically restrained” meant. Specifically, Judge Newsom asked, “What is the ordinary meaning of ‘physically restrained’?” In response, the LLM provided a definition essentially reflecting what Judge Newsom had already assumed the phrase meant. As the judge did not want to rely on the LLM just because it matched his prior assumption, he then fed the same query to a second generative LLM, which provided a meaning for the term that largely mirrored what the first LLM produced.

But next, for reasons Judge Newsom didn’t “specifically recall – but that can presumably be chalked up to a ‘better safe than sorry’ instinct – [Judge Newsom] asked [the second LLM] the exact same question again.” To his surprise, the second LLM’s subsequent answer was “basically the same,” but “ever so slightly different.” While the second response was substantively similar to its first response, the style, structure, length, and detail all varied.

Concerned by this result, Judge Newsom ran, in his words, “a humble little mini experiment” where he asked the first, second, and a third LLM the same question 10 times apiece: “What is the ordinary meaning of ‘physically restrained’?” The 30 responses were, reassuringly to the judge, largely consistent. As Judge Newsom described it, “When defining ‘physically restrained,’ the models all tended to emphasize ‘physical force,’ ‘physical means,’ or ‘physical barriers.’” He further noted that while there were some minor variations in the responses’ structure and phrasing, overall, the responses coalesced around a “common core:” “the LLMs consistently defined the phrase ‘physically restrained’ to require the application of tangible force, either through direct bodily contact or some other device or instrument.” The common core of the LLMs’ responses, Judge Newsom noted, aligned with the meaning of “physically restrained” if the phrase was broken up, each word’s meaning gauged using conventional interpretative tools (such as dictionaries), and the meanings were then pieced together to form a whole.

Performing more research and considering the issue further, Judge Newsom drew two conclusions about the small variations among the LLMs’ answers. First, there is a technical explanation for those variations. LLMs are not designed to provide the same answer each time they are asked the same question. Rather, as Judge Newsom discovered, “an LLM’s response reflects its best statistical, probabilistic prediction about the answer to the user’s query.” The LLMs that Judge Newsom used have all have settings that allow the users to introduce variation into their responses. Depending on whether a model’s “creativity dial” is turned up or down, the model will produce more repetitive responses or more creative responses. The default for many LLMs is to have their creativity settings “dialed up,” introducing more variation into their responses.

Judge Newsom’s second conclusion was that the LLMs’ similar but not identical responses “underscore the models’ utility in the ordinary-meaning analysis” by mimicking everyday speech patterns. As Judge Newsom explained, the aim of plain-language interpreters is to suss out the ordinary meaning of words and phrases. If it were possible to survey every living speaker of American English on the meaning of “physically restrained,” Judge Newsom hypothesized, the interviewer would not get the same verbatim answer over and over. Rather, the responses would likely share a “common core,” but differ around the margins, just as the LLM responses had provided.

Judge Newsom concluded his concurrence with four takeaways. First, he continues to believe that LLMs, while imperfect, can assist in discovering the ordinary meaning of words or phrases. Second, LLMs can decipher and explain the meaning of composite, multi-word phrases like “physically restrained” in a way that standard tools like dictionaries cannot. Third, there are valid technical explanations for LLMs’ sometimes varying responses to identical queries, and the LLMs’ fluctuating responses reflect everyday speech patterns, potentially making the models more accurate discerners of ordinary meaning. Fourth, while Judge Newsom remains convinced that LLMs serve a role in discerning ordinary meaning, he does not advocate abandoning traditional interpretive tools, including dictionaries, semantic canons, and others.

¹ Snell v. United Specialty Ins. Co., 102 F.4th 1208 (11th Cir. 2024).

² The Eleventh Circuit ultimately found that it was bound by precedent to affirm the application of the physical-restraint enhancement, although it disagreed with the (binding) precedent.

Attachments

GT Alert_‘Sequel’ Concurrence on AI Use in Legal Interpretation