Discovering significant topics from legal decisions with selective inference

We propose and evaluate an automated pipeline for discovering significant topics from legal decision texts by passing features synthesized with topic models through penalized regressions and post-selection significance tests. The method identifies case topics significantly correlated with outcomes, topic-word distributions which can be manually interpreted to gain insights about significant topics, and case-topic weights which can be used to identify representative cases for each topic. We demonstrate the method on a new dataset of domain name disputes and a canonical dataset of European Court of Human Rights violation cases. Topic models based on latent semantic analysis as well as language model embeddings are evaluated. We show that topics derived by the pipeline are consistent with legal doctrines in both areas and can be useful in other related legal analysis tasks. This article is part of the theme issue ‘A complexity science approach to law and governance’.


INTRODUCTION
Most legal information is stored exclusively in natural language texts.The complexity of language means extracting such information is typically a labour-intensive exercise primarily performed by specially-trained persons ("lawyers").This poses significant barriers to computational representations and analysis of law [50,59].Researchers have increasingly sought to develop automated processes for converting unstructured legal texts to structured variables [33,2].Depending on the texts involved and variables required, these have included term frequency counts [16], regular expressions [71], topic models [1,25,10,23,66], word embeddings [20,39], and language models [12,62,48].Given their centrality in legal analysis, court decisions in particular have attracted significant scholarly attention.Many studies have attempted to identify, categorise, or forecast case outcomes using decision texts, often relying on opaque algorithms such as support vector machines and neural networks [1,49,11,55,54].Other researchers have prioritised more explainable methods over end-to-end prediction.Typically, algorithms are first developed to automatically extract case attributes and other legally-relevant variables before using these variables to model outcomes [3,25,8,30,31].The goal is not necessarily predictive accuracy alone, but also to identify and explain what motivates legal decisions.
In this work, we propose and evaluate a new automated pipeline for discovering significant topics from decision texts, a task we define more formally in part 2.1.The pipeline takes decision texts and case outcomes as inputs and returns estimates for statistically significant decision topics as well as the cases, words, and phrases most strongly associated those topics.This allows researchers to quickly identify potential variables, patterns, and cases of interest in unfamiliar areas of law.The pipeline comprises four steps: pre-processing and masking (part 2.2.1), topic modelling (part 2.2.2), selective regression and inference (part 2.2.3), and topic evaluation (part 2.2.4).We demonstrate and evaluate the pipeline on a new dataset of cases resolved under the Uniform Domain Name Dispute Resolution Policy ("UDRP").To explore how the pipeline generalises, we further test it on a canonical dataset of European Convention of Human Rights ("ECHR") cases.For both datasets, we experiment with Latent Semantic Analysis ("LSA") [46] as well as two BERTopic ("BTO") models [34] primed with general and legally-finetuned embeddings respectively.
We show that topics discovered by the pipeline contain interpretable and legally-sound information on topics correlated with legal outcomes (part 3).Along the way, we identify several interesting patterns and case archetypes in UDRP and ECHR case law.Thus, our key contributions are as follows.First, we extend prior work analyzing legal outcomes from a topic modelling perspective [1,25,66].To be sure, the notion that topics synthesized from case decisions could carry meaningful information about legal outcomes is not new.Neither do we propose entirely new algorithms for, say, legal topic modelling.Our incremental contribution lies in integrating several existing techniques (e.g.masking [74], topic modelling [1,66], and selective inference [76,75]) into a pipeline that can be adapted to study other legal areas.Second, we demonstrate the utility of selective inference techniques in the legal domain.This has not, to our best knowledge, been studied in prior work.Finally, we add to legal knowledge on UDRP and ECHR cases.

Discovering Significant Topics
This work relates to existing literature on the automated extraction of legal factors from legal cases [25,30,31].Legal factors are generally seen as "stereotypical patterns of fact" [25] or more abstract "intermediate concepts" [9] which influence case outcomes.However, as used in that literature, the concept of legal factors has a specific meaning which does not overlap perfectly with our present focus.We thus use the term "predictors" here to refer broadly to variables which predict case outcomes.Drawing inspiration from [14], suppose a legal outcome Y is given by Y = f (X, W ), where X is a matrix of legal predictors, W a matrix of non-legal predictors (e.g.political ideologies [65]), and f some adjudication function that maps cases to outcomes.To identify individual predictors, we might collect data on hypothesised variables X, Ŵ (i.e.approximations of X and W ), and estimate the model Ŷ = f ( X, Ŵ ).Weights computed for each x, ŵ would capture the strength and polarity of their correlation with outcomes.Variables assigned significant, non-zero weights can be understood as potential legal (or non-legal) predictors.They may further be seen as causal predictors, if the model is causally-identified, or correlative predictors otherwise.
The challenge with legal applications is that the xs and ws are not available as structured data but found only in some natural language corpus D. Typically these are decision texts written to state and justify outcomes for each case i, though other documents including submissions, affidavits, and procedural records may also be relevant.We must apply a "codebook function" g : D → Q, Q ∈ R n×m that maps n texts to m variables [33].Where the variables desired are known ex ante based on legal domain knowledge, the researcher's aim is to extract observations of xi .But in unfamiliar legal areas where candidate predictors are not already known, the goal shifts from filling observations or estimating coefficients to discovering such predictors to begin with.There are therefore three different tasks related to legal predictors (table 1).Notably, these are not mutually exclusive and must often be performed in tandem to answer the research question.Suppose as in [72] that we want to know if case origin influences the probability of a certiorari grant by the US Supreme Court.The variable of interest is known but potential confounders remain to be identified.We would need to extract observations for case origin, discover (and thereafter extract observations for) potential confounders, and finally analyse coefficients for case origin while controlling for these confounders.This work is chiefly concerned with the discovery task, though extraction and analysis are by-products of the proposed method.Tasks involving legal predictors.This work focuses on the discovery task.

Pre-processing and Masking
We begin with a text corpus D and structured categorical outcomes Y for n cases in some legal area of interest.In theory, any corpus with sufficient case information, such as case briefs and affidavits, could be used.In practice, most legal analysis is based solely on decision texts.Other legal documents are usually not accessible at scale.Thus, we tailor the approach assuming D is a decision corpus.The use of decision texts has important implications for the kind of analyses possible and the pre-processing steps necessary.Specifically, fitting legal outcome models on decisions is problematic because decisions are written by judges, after observing case facts, to justify case outcomes [72].Extracted features could therefore contain both post-treatment and post-outcome information, making them "bad controls" [17].Formally, suppose decisions are generated by the process D = t(Y, X, W, J), where J accounts for the judges' individual writing styles and t some text-generation function.Substituting this into the model Ŷ = g(D) gives Ŷ = g(t(Y, X, W, J)).
Since we are indirectly modelling Y on itself, we should expect the model to produce large, significant estimates for features still containing hints of Y after the transformations g and t instead of unbiased estimates for xs and ws.
As we do not control t, the natural solution, other than switching to some pre-outcome corpus, is to build into g processes for masking information on Y .We follow standard steps from the legal prediction literature in masking outcome-revealing sections of and phrases in the text from the model by deleting them entirely at the start of the pipeline [1,74].This may over-inclusively remove otherwise informative words, but is however taken as a necessary and non-fatal trade-off [74].It also may not remove all outcome information from D. Since decisions are written to justify case outcomes, even seemingly innocuous sections such as "Case Facts" could be arranged in a way that favours the writer's preferred disposition.Indeed, lawyers are typically taught to present facts persuasively [64].This pertains especially to case briefs, but we cannot preclude its occurrence in decisions.As such, we emphasise that predictors discovered by our method should be interpreted as correlative.
Where required, we then pre-process the masked corpus in standard fashion by lowercasing, stopping, and lemmatisation.This applies mainly to LSA as BTO is trained on raw texts.1

Topic Modelling
Topic models are suitable codebooks because of their readability: each q ∈ Q can be manually interpreted based on representative n-grams, and documents with higher q weights can be read as being more heavily or likely 'about' q.Of the numerous topic models in the literature, here we experiment with one hot encoding (i.e.indicators for each n-gram in the corpus' overall vocabulary) ("OHE"), LSA, and BTO to cover a range of traditional and emerging approaches.As topic models are well-documented elsewhere, below we provide a condensed description of those we test.
LSA first computes a term-frequency/inverse-document-frequency ("TFIDF") encoding [40,67,21].The TFIDF matrix is compressed into m desired topics (explained below) by applying Singular Value Decomposition ("SVD") and keeping only features corresponding to the largest m singular values.The SVD of a matrix W = U m S m Λ T m where rank(W ) = m, m rank(W ) [5].When W is a TF-IDF matrix, U m corresponds to n-gram vectors, S m to singular values of W , and Λ m to document vectors [21].The corpus is thus represented through Λ m as a distribution of m topics across n documents [46].These "topics" are represented in U m as distributions across n-grams.For intuition, observe that an optimal compression of term frequency matrices should squeeze coinformative terms together, forming said topics.We use LSA here because of its prominence in [1]'s influential work on legal outcome prediction for ECHR cases as well as subsequent related work.
BTO [34] is modular framework which starts with paragraph embeddings typically derived from a language model.Depending on the LM's context window, longer documents may be partitioned into smaller chunks if necessary [70].Chunk embeddings undergo dimensionality reduction via a standard algorithm such as UMAP [52] (the default) or principal components analysis before clustering via another algorithm such as HDBSCAN [51] or k-Means.Topics are extracted from these clusters using a bag-of-words vectorizer followed by a "class-based" TFIDF implementation given by cT F IDF (c) = ||tf w,c || × log(1 + A fw ) where tf w,c is the frequency of n-gram x in cluster c, f w is w's' frequency across all clusters, and A is the average number of tokens per cluster.This produces an arbitrary number of topics which can be iteratively merged based on topic frequency and cT F IDF similarity until a desired number remains.The resulting chunk-topic matrix can then be re-constituted into document-level topics in several ways.For instance, by assigning a document to the one topic which contains the largest number of its chunks (i.e.max-pooling).Following [70], we take chunk-topic counts normalised at document level.We test two BTO models primed with chunk embeddings from (1) all-MiniLM-L6-v2 [38], a sentence transformer based on [78] and recommended by [34] ("BTO M "), and (2) legalBERT [12], a BERT [19] extension fine-tuned on UK, EU, and US legal documents ("BTO L ").Inspiration for using BTO in the legal context comes from [66] which used a multilingual MiniLM-embedded BTO model to study Canadian housing law court decisions written in French.
Here we generate topics comprising 1, 2, 3-grams for all topic models.For LSA, we generally take only the 2500 most frequent n-grams at the TFIDF step before reducing the matrix to a desired topic number based on corpus size.As context, [1]'s best predictive models for ECHR cases generally used LSA topics creating with the 2000 top 1, 2, 3, 4-grams.However, in our (unreported) exploratory tests, we noted that 4-grams do not add new interpretable information as they usually repeated terms already seen in 1, 2, 3-grams.We also set minimum document frequency cutoffs of 5 or 10 (depending on dataset and topic model) in LSA's TFIDF and BTO's cTFIDF steps to limit computational and memory overheads.Other parameters follow recommendations and defaults from the sklearn [69] and bertopic [34] libraries.

LASSO Regression and Selective Inference
We use a LASSO [76] regression model to associate topics with outcomes.The LASSO uses the coefficient vector's L1-norm as a penalty term when optimising the model, such that the objective function becomes L(β) * = L(β) − λ β j where λ is a user-specified "shrinkage parameter" that controls penalisation magnitude, and j > 0 (the intercept is not penalised).The LASSO is suitable for legal outcome models in three ways.First, as the goal is to discover interpretable legal topics rather than inexplicably predict legal outcomes, regression models are preferable to more opaque approaches like neural networks.Second, the LASSO overcomes two common, related problems with legal outcome models.First, as text feature matrices are typically large and sparse, and legal corpora often yield few observations, legal outcome models are prone to the k ≫ n problem [82,35]: as k approaches and eventually exceeds n standard regression models relying on maximum likelihood estimation are liable to produce biased estimates or failing to converge entirely.Second, legal areas often present highly imbalanced response classes, forcing us to estimate "rare events" [81,4,63].For instance, in our UDRP dataset, ∼90 percent of the cases are decided in the complainant's favour (table 2).Coupled with k ≫ n, legal outcome models could be perfectly separated -outcomes can be perfectly predicted with a subset of features -preventing model convergence.
Penalised regressions are one standard countermeasure to both problems [26,37,81,36,35].In bioinformatics and chemometrics, LASSO regressions have been successfully deployed in studies involving large feature matrices and rare events [79,80].Third, the LASSO lets us exploit emerging methods for selective inference.Conventionally, significance tests are not done with penalised regressions since regularisation means estimates are biased toward zero and not consistent [35].Nonetheless, LASSO regressions were demonstrably capable of selecting the most significant regressors, particularly in a k ≫ n setting (see [35]).More recently, [47] devised a method for conducting valid post-selection significance tests which [75] extend to the LASSO.P-values are computed after de-biasing the model post-selection [47,75].Coefficient estimates must still be interpreted in light of the penalty, but p-values and standard errors remain valid and have been shown to be more reliable than non-adjusted values from subset-selected models [47,35].Notably, if as cautioned above we confine ourselves to discovery correlative rather than causal predictors, significance test validity is less of a concern.We use [75]'s R package selectiveInference [77] and following their documentation estimate the LASSO with glmnet [27].

Evaluation
We test several model specifications for the primary UDRP dataset, varying whether topic features are included and the topic model used (see part 2.3).Each specification is also evaluated on standard measures of fit including the area under the receiver operating characteristic curve ("AUROC") and the median deviance ratio ("MDR").The latter summarises all deviance ratios reported by glmnet along the λ fitting path and can be interpreted as the pseudo-R2 [27].We manually evaluate selected specifications by delving into topics with the largest positive or negative coefficients and the smallest p-values for those specifications.The author, who is legally-trained, then studied the topics' n-gram distributions and the cases most strongly associated with them to see how far they corresponded with topics known to be significant in legal doctrine.Notice that even if they do not, topics discovered this way could point to some yet unknown X or W driving legal outcomes.This step should therefore be informed by legal theory.To be sure, we do not suggest it can be fully-automated, nor that the method is sufficient to identify all legally-significant topics.
Other than evaluation, the method requires structured data in only two respects.First, labelled case outcomes are needed.While not considered in this work, existing methods for automated legal outcome extraction (e.g.[1,41,11,54]) could be incorporated at an earlier pipeline step.Second, tailored pre-processing work is necessary to sectionise documents and to mask outcomeleaking information.Other than in these two areas, topics correlated with legal outcomes are automatically synthesized from the corpus, selected by the LASSO, and surfaced by post-selection significance tests.Prior domain knowledge of potential legal predictors within the given legal area is neither assumed nor required, though it would certainly be a bonus.Likewise, while structured case metadata is not strictly needed, any available variables can easily be included as additional covariates at the regression stage.

Domain Name Disputes
The UDRP is a mandatory policy instituted in 1999 by the Internet Corporation for Assigned Names and Numbers ("ICANN") for resolving disputes over generic top-level domains ("GTLD"s).Several countries have adopted similar policies for their country-coded top-level domains ("CCTLDs") [15].Disputes are administered by ICANN-appointed Dispute Resolution Providers ("DRP"s).The largest DRP by disputes resolved is the World Intellectual Property Organisation ("WIPO").Under the Rules for Uniform Domain Name Dispute Resolution Policy, a case begins when a trademark holder files a complaint with a DRP.The DRP will ask the respondent for a written response, and thereafter assemble an adjudication panel of 1 or 3 panellists, depending on the parties' preferences.Under UDRP Article 4a, the complainant must show that (1) the contested domain is "identical or confusingly similar" to the complainant's trade or service mark; (2) the respondent does not have any "rights or legitimate interests" in the contested domain; and the contested domain was "registered and used in bad faith".While parties may be represented by lawyers, all procedures are written and there are no physical hearings.If a complaint succeeds, the panel may order the domain to be transferred to the complainant or be cancelled altogether.Decisions are communicated to and enforced by the relevant domain name registrar [58].
We obtained from WIPO's online database 2 decision texts for WIPO-administered UDRP dis-putes decided on and between 1999 and 2016.Regular expressions were developed, by iterative testing on randomly-sampled decisions, to partition the texts into 8 archetypal sections.Case outcomes are typically stated in a final section titled "The Decision", and occasionally in a preceding section generally titled "Discussion and Findings".The latter details the panel's legal reasoning and analysis.Both sections were masked.Outcome labels "transfer", "cancel", and "deny" and linguistic variants thereof were also removed.This left only sections on case facts, parties involved, procedural history, and arguments presented for downstream processing.Decisions where fewer than all 8 sections could be detected, either because they were not in English or because of exceptional or missing headers, were excluded.This reduced the initially downloaded set of 27,634 raw cases into 22,653 usable observations. 3abelled outcomes and other structured variables were extracted from case summary tables on WIPO's website.Each table contains case number, decision date, the domains, parties, and panellists involved, and outcome.While only three outcomes (i.e.transfer, cancellation, or complaint denied) are possible per domain, cases with multiple domains could present mixtures (e.g."Complaint denied, transfer in part with dissenting opinion").Nonetheless, the vast majority (98.87%) of cases involved singular outcomes.By studying the data we found that outcome statements start with the outcome assigned to a majority of the contested domains (i.e. in the example above most domains would not have been transferred).We thus binarised outcomes by recording 1 when the outcome statement begins with "Transfer" or "Cancellation", and 0 when it begins with "Complaint denied".Basic string methods were used to extract other variables from the tables, including the number of panellists, complainants, respondents, and domain names involved, whether the case involved GTLDs or CCTLDs, and year and month indicators.We also created indicators for repeat complainants (respondents) appearing in >100 (30) cases.
Identity indicators were also created for all panellists.We use this to demonstrate how the method could be instrumental for studying how judge identity influences legal outcomes, a staple in "judicial behaviour" research [24].Legal scholars have debated the UDRP's merits [7], with critics alleging structural pro-complainant biases in the UDRP procedural rules [57,28,56,42].Proponents [22,60,45] countered that critics fail to account for specific case attributes.Empirical analyses have offered different explanations for high complainant success rates.[43] argued that case resolution efficiency was as important as apparent bias in determining provider choice, while [44]'s used an alternative linear regression methodology on [43]'s dataset of 2000-2001 cases.
Table 2 summarises the dataset.It contains information on significantly more cases and variables than an earlier UDRP corpus compiled by [8].On this data we run the penalised logit regression: where complainantwon i is a indicator for complaint success, panelistidentity i an indicator matrix for panellist involvement, panelsize i indicates if the case involved three panellists or one, textf eatures i is either an OHE, LSA, or BTO document-topic matrix.controls i are indicators for year and month, repeat player involvement, and whether the case involved GTLDs or CCTLDs.As indirect controls for dispute complexity, we also included the raw and processed word counts of the relevant decision, as well as the number of complainants, respondents, and domain names involved.
To investigate the topic models' impact, we estimate regressions with/without topic features across three settings: (A) only 1-panellist GTLD cases, (B) all GTLD cases, and (C) all cases.We partition the data by panel size and domain type because these give rise to qualitatively different  case types.To evaluate models in the same regression setting on similar bases, we extract exactly 250 topics with each topic model.We chose 250 after some iterative testing with LSA because it represented a 90% compression of the original TFIDF matrix (recall that the top 2,500 n-grams were used) but, as computed by the SVD, explained about 61% of the variance in the same.Around the 250 mark, reducing (increasing) the number of topics led to more (less) than proportionate losses (gains) in variance explained.We used LSA rather than BTO models to experiment with topic number because re-estimating BTO models requires significantly more compute.There is some inevitable arbitrariness here as identifying the appropriate number of topics is a known challenge in topic modelling [32].Future work could study how emerging techniques for doing so (e.g.[29,68]) could be incorporated into our method.All topic models are trained using only decisions within the relevant partition.This except for BTO chunk embeddings (only the first step) which are pre-computed only once on the entire corpus and used across all settings, as the embedding process is computationally expensive.We also precomputed the shrinkage parameter λ to be used using specifications without text features following the guideline suggested in [61,1] to set λ = 2E[ X T ǫ ∞ ] where ǫ ∼ N (0, σ2 ) and σ2 is the residual sum of squares from a simple linear regression of y on all regressors.The same λs were then used for mirror specifications with text features.As a further baseline, we also tested specifications with white noise placebos [53].

European Convention on Human Rights violations
The ECHR establishes fundamental human rights for signatory jurisdictions, including the prohibition of torture (Article 3), right to a fair trial (Article 6), and right to respect for private and family life (Article 8).The European Court of Human Rights ("ECtHR") adjudicates complaints.The court publishes decision texts and "case detail" tables on its "HUDOC" database. 4ECHR cases have been studied in several prior works [1,55] and included in the benchmark LexGLUE [13].
While LexGLUE provides a large number of processed ECHR texts and outcomes, that dataset is not linked to case identifiers, making topic interpretation challenging.Here we use [55]'s dataset and replicate their pre-processing steps with their published code.We limit the analysis to training set cases with clear violation/non-violation outcomes (i.e.not filed in the dataset as "both").Below we focus on Articles 3, 6, and 8 which have the largest number of cases in this dataset.Following [55], we use only text from the Procedure, Circumstances, and Relevant Law sections.Table 3 summarises the dataset.As our aim was to demonstrate generalisability, unlike with the UDRP we did not further extract new case variables.The main specification tested is violation i = textf eatures i + ǫ i with textf eatures i being 100 topics synthesised using the above topic models.Table 3 Summary statistics for the ECHR dataset.Mean values presented with standard deviations in brackets.Notice that [] had balanced the dataset by random under-sampling.

UDRP results
Table 4 summarises our primary results on the UDRP dataset.Columns 1-3 report baseline estimates computed without any text features for three main regression settings.Around 50 panellists are significant at α = 0.05 across these baselines even with several controls included (column 3), suggesting an association between their involvement and complaint outcomes.The association is notably weaker in the corresponding topic regressions with OHE, LSA, BTO M , and BTO L features added (columns 4, 5-7, 8-10, and 11-13).The topic regressions consistently yield fewer significant panellists, smaller panellist effects, and better model fits.Statistical significance can be observed shifting towards the topics instead.This can already be observed with simple OHE, but is clearest with the LSA regressions, where few panelists remain significant (9, 13, 8 in columns 5-7 versus 53, 50, 49 in columns 1-3).Across all regression settings, LSA consistently produces the largest number of significant topics and the highest fit scores.Column 4.7 in particular yields 32 significant topics but only 8 significant panellists at α = 0.05 and the highest MDR (0.431) and AUROC (0.914).BTO L and BTO M yield more significant panellists and fewer significant topics, but are nonetheless superior to the non-text and white noise (column 4.14) baselines, suggesting that these topic models also capture information on case features.The legally-finetuned BTO L performs slightly better than BTO M (MDR=0.295,AUROC=0.849 in column 13 in versus MDR=0.275,AUROC=0.838 in column 10), suggesting that domain adaptation helps.These results are relevant to legal debates on whether UDRP processes exhibit pro-complainant bias.While our correlative models cannot establish the absence of bias, our findings are consistent with [60,45]'s argument that high complaint success rates are better explained by case facts than structural pro-complainant biases.More importantly, our results suggest that the pipeline can automatically discover correlative legal predictors from decision texts.This becomes clearer when inspecting the discovered topics.The 5 LSA, BTO M , and BTO L topics with smallest p-values in columns 7, 10, and 13 respectively are presented in table 5).Some topics are intuitive.For example, the negative effect associated with LSA 17, a topic populated by n-gram variations on "administratively deficient", suggests logically that "administratively deficient" complaints correlate to worse complainant outcomes.Manual evaluation revealed that cases with the strongest weights for this topic indeed involved deficient complaints. 53 of these complaints were denied.Likewise, the top LSA 3 cases involved situations where the complainant provided incorrect "contact information" for the domain registrant and was asked to amend the complaint accordingly. 6ther topics are less readable, but their underlying logic can be identified on closer inspection.For instance, LSA 19 and BTO M 5 are populated by references to famous trademarks and brands.These topics feature most strongly in complaints filed by large corporations which owned these and other famous marks, and typically against individuals who had registered variations on their brand names. 7.For example, 3 of the top 5 LSA 19 cases involved the "lego" company suing for domains such as "legosets101.com"and "legowolds.com".Panels typically found evidence of bad faith in how respondents could not have registered these domains without knowing of the complainants' wellknown marks.9 of the top 10 complaints succeeded.The exception was a complaint filed by "Hugo Boss" for "boss-watch.com"and "boss-world.com".This was denied because the respondent had been selling watches under the "BOSS" mark in Hong Kong since the 1970s, before the complainant's mark was established. 8.
Consider also BTO L 69, which associates references to "reverse domain name hijacking" ("RDNH") with lower complaint success rates.UDRP Rule 1 defines RDNH as "using the Policy in bad faith to attempt to deprive a registered domain name holder of a domain name".When RDNH is found, the complaint fails.Recall however that the masked texts used in topic modelling exclude the "Discussion and Findings" and "Decision" sections, so the model should not have information on whether RDNH occurred.Inspecting the cases here reveals that RDNH n-grams feature strongly in the included "Contentions" section when respondents actively defend the claim and raise the RDNH issue.In the usual case where respondents default, neither panellists nor complainants have incentives or need to discuss it.Thus while RDNH was not ultimately found in any of the top 5 BTO L 69 cases, all were rare cases involving active respondents.This explains the topic's negative association with complaint success.
Not every topic can be easily understood.For instance, BTO M 57, represented by n-grams referencing Middle Eastern countries, indeed involved complainants from this region. 9Of these, 2 also involved Middle Eastern respondents.All 5 complaints were denied, but for differing reasons.In 3 cases the complainant failed to show bad faith because the domain had been registered before the complainant's mark was established.Whether complaints from Middle Eastern parties are properly associated with these facts and with lower success rates is however unclear.Likewise, BTO L 1 is populated by n-grams tracking a typical portion in the "Procedural History" section which states  LASSO logit results for ECHR cases.λs are separately derived per model following [61].
that "the Panel has submitted the Statement of Acceptance and Declaration of Impartiality and Independence, as required by the Center to ensure compliance with the Rules, paragraph 7". 10hen this sentence occurs immediately before the next section header, "Factual Background", the topic's n-grams arise (after stopword removal).Why this correlates with better complainant outcomes is not clear.It may signal the lack of other procedural issues, such that panellists can move directly to the next section. 11, but more qualitative evaluation is needed to ascertain this.

ECHR Results
Table 6 presents results for LSA, BTO M , and BTO L regressions fit on ECHR case.As with the UDRP, LSA tends to produce higher model fits and the largest number of significant topics.This especially for Article 3, where 15 LSA topics are significant at (α = 0.05) compared to 3 BTO M and 4 BTO L topics.This is notable given that LegalBERT was finetuned on ECHR cases [12].It is thus not surprising that BTO L again produces higher fit measures than BTO M , especially for Article 8 (AUROC=0.711versus 0.638).However, both BTO models produce broadly similar numbers of selected and significant topics.Table 7 presents representative n-grams for significant ECHR topics chosen based on smallest p-value and largest coefficient sizes.For Article 3 (prohibition torture or inhuman or degrading treatment), LSA 2 and BTO L 1 correctly discover and assign positive effects to what the ECtHR has described as "a whole series of cases concerning allegations of disappearances in the Chechen Republic". 12Applicants were typically Chechen individuals whose close relatives were allegedly abducted by state military servicemen.Despite multiple complaints to and visits from the state's district prosecutor's office, the applicants hear nothing of their relatives for years.The ECtHR has "found on many occasions" that the distress caused by their relatives' disappearance and the state's indifference to their plight violates Article 3. The top cases for these topics all involved similar fact patterns 13   Significant ECHR topics across all Articles and topic models tested.We select topics to report here by first identifying the 5 lowest p-value topics within each regression, and then choosing topics with the 3 most positive and negative effects across all 3 regressions within each Article.Article 8 has only 4 significant topics in total.*p < 0.05, **p < 0.01, ***p < 0.001 α We note that this token is derogatory and have preferred the term Romani in the main article.With apologies to the Romani people, it was retained here to reflect tokens actually used in the corpus.
they faced real risks of being subjected to treatment violating Article 3 if they were sent home.This allegedly because of their previous membership in military organisations that had clashed with their countries' current governments."December 2010" is a significant n-gram because the United Nations High Commissioner for Refugees had issued updated eligibility guidelines for Afghan asylum seekers then.The top 5 cases were all complaints from ex-Afghan security service personnel.As the negative coefficient suggests, these claims were typically denied because, among other reasons, these guidelines did not include them in their risk profiles for rights violations. 14Notably, there is a similar line of unsuccessful complaints involving failed asylum seekers who previously served in the Sri Lankan Tamil Tigers. 15These were also picked up by LSA 12 (not tabulated in table 7), which is represented by n-grams including "sri lanka", "ltte", and "colombo".LSA 3 also identifies cases involving unsuccessful asylum seekers, but includes more varied claims from individuals originally from Somalia, Iraq, and Libya. 16These complaints tended to fail because the court did not find a sufficiently real risk of treatment contrary to Article 3.
For Article 6 (right to fair trial), LSA 2 and 6 surface a collection of cases where Ukrainian individuals awarded compensation judgments against certain (often state-linked) companies were forced to wait for years before receiving due payment.They argued that the state bailiff had inordinately delayed enforcement proceedings.Decisions for such cases are worded very similarly, and typically reiterate how the ECtHR has "already" or "frequently found" violations in like cases. 17 Notably, a Ukrainian government judicial enforcement reform effort acknowledges Article 6 as its motivation. 18BTO L 79 is diluted by markup n-grams like "level0" but nonetheless identifies several cases involving Cypriot individuals who had participated in a 1989 anti-Turkish demonstration in disputed territory arising out of the 1974 Turkish intervention in North Cyprus. 19They were charged and convicted in the Turkish courts for entering Turkish territory without permission.Typically they argued that their Article 6 rights had been violated because the legal proceedings were generally in Turkish, not Greek which they understood.The ECtHR generally rejected these claims because the applicants had reasonable access to interpreters and other legal assistance.
For Article 8 (right to respect for family and private life, home, and correspondence), LSA 2 points to a line of cases filed against the Polish authorities by individuals detained in criminal remand over the authorities' standard practice of reading and re-sealing correspondence sent by these individuals to the courts and stamping the envelopes with a "censored" label.The ECtHR has noted how it has "held on many occasions" that the label forces the court to assume an interference with correspondence that, unless justified, violates Article 8. 20 2 of the top 5 BTO M 70 cases have similar facts, but the topic also seems to cover other kinds of interferences to correspondence.Relatedly, the top 5 LSA 10 cases all involve complaints filed by Romani persons against the British government and its consistent refusal to grant them planning permission to develop land they owned into caravan sites.After the ECtHR found this to be a violation in 1996, 21 several similar and ultimately successful cases were raised in which the ECtHR expressly "recalls that it has already examined" such complaints and found violations. 22

DISCUSSION
We proposed and evaluated an automated pipeline for discovering significant topics from cases by performing penalised regression and selective inference on features synthesized from decision texts using topic models.We show that significant topics discovered through this process capture relevant information on factual patterns correlated with case outcomes.On a large (by legal standards) dataset of UDRP cases, legal outcome models fit with decision text topics consistently produce higher fit scores compared with models fit without (table 4).The LASSO also tends to select the topics as significant predictors over other structured case attributes of potential interest, such as judge identities.Coefficients and p-value estimates also change noticeably.This holds across several regression settings and topic modeling approaches.Using a canonical dataset of ECHR cases, we show that the method generalises relatively easily, without the need for additional feature engineering or pre-processing.Only structured outcome information and unstructured decision texts are required, though additional variables can be added at the regression step.Running similar procedures on the existing dataset, albeit with corpus-tailored hyperparameters (such as topic number and λ), yields significant topics consistent with ECHR case law.
Our experiments show that LSA is a useful, even if dated, codebook for decision texts.Across all experiment settings, LSA produced higher fit scores and a higher number of significant topics than both BTO models.LSA is also computationally cheaper.This may appear counter-intuitive since BTO is a significantly more sophisticated model which exploits recent advances like word embeddings and language modelling.As noted by [73] in the context of legal topic classification, the length of legal decision texts may offer one explanation for LSA out-performing newer approaches in the legal domain.BTO's superiority over traditional topic models has mainly been demonstrated on shorter texts like tweets and news articles [34,18].For longer documents, our present approach of reconstituting document-level topics by normalising chunk-level topic counts is, while standard in the literature [70], unlikely to be the optimal way to deploy BTO.Using LMs with larger context windows than those tested here could significantly improve BTO's performance.Thus, we do not suggest that LSA is necessarily better-suited for this task.Further, since each topic model yields different topics which may provide different insights on the cases, there is no clear metric for "better" in this context.Further, BTO's lower fit scores may be a methodological artefact since we did not conduct hyperparameter optimisation, but chose similar parameters across all topic models to establish a baseline comparison.A fully-optimised BTO model may outperform a fully-optimised LSA.Hyperparameter tuning was not done because, unlike typical machine learning settings, our task focuses on explanation rather than classification and does not yield any clear performance metric (e.g.F1 score) for evaluating a grid search.BTO's performance here should be interpreted in this light.
More importantly, qualitative evaluation of significant topic n-grams demonstrates that the discovered topics rest on sound and interpretable legal bases.For UDRP cases, the topics shed correlative light on how administratively deficient complaints are less likely to win, how famous trademark owners can be associated with higher success rates, and how respondents who actively defend their domains are less likely to lose.For Arts 3, 6, and 8 ECHR, the topics identify archetypal cases involving abducted Chechen relatives, Afghan asylum seekers, Ukrainian judgment enforcers, Cypriot demonstrators, Polish detainees, and Romani land owners.These are correctly associated with their usual case outcomes.Essentially, unique case features prompt judges into writing decisions with a higher preponderance of correspondingly unique n-grams, producing signals which the topic models are capable of detecting.As legal decisions are written with close reference to case facts and relevant laws, and judges would generally not write about irrelevant matters and non-issues, we theorise that the decision text generation function accords with the standard topic modelling assumption that texts are generated by sampling n-grams from latent topics [6].
To be sure, not every discovered topic made sense.This may point to limitations in our evaluation process, since we only sampled the top 5 cases associated with the most significant topics.We may also have been unable to detect known patterns that the topics were in fact referencing.Further, there is no reason why each topic should capture exactly one predictor.Certain topics may have had n-gram distributions amalgamating several archetypal case features.Individually insignificant topics could have been jointly significant with others.Future work could examine this further by modeling interaction terms and conducting joint tests, though this may make interpreting the topics and coefficients more challenging for evaluators.Besides human limitations, the automated process is also imperfect.It can produce false positives (e.g.significant topics which do not actually capture legal predictors; high document-topic weights for a case not actually on topic) and false negatives (not attaching significance to topics which do; not synthesising a related topic to begin with).
At a more abstract level, the four pipeline steps can be understood as a series of dimensionality reduction steps, starting with a large decision corpus, and resulting in numerical associations between decisions and topics (document-topic weights), topics and outcomes (regression coefficients), and topics and words (n-gram distributions).These mappings can be analysed transitively to assist with the discovery, extraction, and analysis tasks identified in part 2.1.To illustrate, after observing UDRP LSA topic 19 (table 5), researchers could create and extract observations for an indicator variable for whether the complainant owned a famous mark.Decision-topic weights produced by the method could guide the extraction process.After several such variables are extracted, a regression (not necessarily the LASSO) could then be run on the reduced dataset.Notably, given the increasing popularity of large language models, the pipeline's ability to reduce large legal corpora to smaller components could prove useful in fitting legal texts into limited context windows.

CONCLUSION
We proposed and assessed an automated method for discovering significant topics given only decision texts and case outcomes, building on prior work examining how topic models can be used to predict and explain case outcomes [1,25,66].The task of legal topic discovery was formally defined and distinguished from related identification and analysis tasks.We developed and demonstrated pre-processing, topic modelling, regression and inference steps tailored to this task and its legal context.The method shows promise in its ability to discover archetypal case features and patterns consistent with the jurisprudence of the UDRP and ECHR datasets tested, and could generalise to other areas.It is however not perfect, and should be applied bearing the possibility of false positives and negatives in mind.There are two extensions we hope to pursue in future work.First, to conduct more rigorous experiments and hyper-parameter search with BERTopic and its variations.Notably, BERTopic is a modular framework involving six steps that accept several different algorithms and (optional) parameters.Second, a more robust yet ideally less manual method for evaluating and interpreting topics could be developed.

Table 2 UDRP
summary statistics by outcome.Mean values presented.Standard deviations in brackets.Raw word count includes all tokens in the text after removing only the "Decision" section.Processed word counts includes only tokens remaining after lower-casing, stopword removal, and lemmatisation were further applied.

Table 4
LASSO logit regression results for UDRP cases.Given the number of panellists and topics input we report medians and counts instead of individual estimates.Coefficients are direct estimates from glmnet and should not be interpreted cardinally.

Table 5
UDRP topics with smallest p-values across Setting C regressions 4.7, 4.10, and 4.13.Coefficients are scaled estimates from the LASSO and should only be interpreted ordinally within the same regression.Topics are synthesized from masked decision texts that exclude "Discussion and Findings" and later sections and not be interpreting as capturing what the panels found.

Table 6
BTO L 21 captures cases involving rejected asylum seekers who argued that