LeAK – Automatic anonymisation of court decisions

Transparency and availability of such documents is not only important for Law Sciences, but also for the public in general, for the whole legal-tech area, and for the digitalisation of government institutions. However, according to estimates (see Coupette and Fleckner 2018, Sohn 2018) a very small amount of all court decisions are made public in digital format in Germany. One of the reasons for the low amount of publicly available data is the need to anonymise each document, as required by the current privacy-preserving laws. In Germany, this process is being carried out manually, and it is time-consuming, as each anonymiser has to read the document and manually identify extracts that represent risks for the privacy of the people involved in the court decision.

In projects LeAK and AnGer, we are working together with German legal authorities to develop corpora that can be used to improve methods for the automatic anonymisation of court decisions, so that the process of making these documents available to the general public can be considerably sped up. Taking the data privacy concerns seriously into account, each document in our data is annotated from scratch by four different annotators, who have the task of identifying what parts of the text need to anonymised. The annotation refers to the type of information that is represented in the anonymised portion of the text and to the potential privacy risk, in case that part of the text is not anonymised. After these four rounds of annotation, the text goes through another two rounds of adjudication, where two new, independent annotators decide what parts should be actually anonymised in the final document. This process, in which at least six people have seen the same text, ensures that portions of the text that contain privacy-related information will be anonymised.

We currently have anonymised and realistically pseudonymised a corpus of court decisions from the areas of Law related to transport and landlord–tenant regulations. This involved many months of work, considering that every document was analysed by at least six annotators, and resulted in a gold standard of 570 documents (247 for landlord-tenant regulations and 323 for transport law), which amounts to circa 1.1 million tokens that can be computationally processed.

Using this dataset, we tested a combination of off-the-shelf and fine-tuned named-entity recognition tools to identify which model would yield the best automatic anonymisation. For fine-tuning and testing the first batch of models, we used pseudonymised court decisions on landlord-tenant regulations, and the results can be seen in the table below. The dataset was split into train and test for fine-tuning purposes, and all models were then evaluated on the test set. The results are based on correctly identifying the full span of the annotation, because it does not help, for instance, to merely anonymise part of a person’s or a company’s name.

  Anonymised data points Recall according to risk
  Precision Recall F1 High risk Medium risk Low risk
Off-the-shelf tools
Standard-NER (Flair) 0.14 0.12 0.13 0.39 0.31 0.01
Legal-NER (Flair) 0.26 0.16 0.19 0.42 0.28 0.05
OpenRedact 0.49 0.81 0.61 0.87 0.82 0.78
Fine-tuned models
OpenNLP 0.88 0.80 0.84 0.85 0.45 0.83
Riedl & Padó (2018) 0.80 0.83 0.82 0.90 0.52 0.85
GottBERT (Scheibe et al. 2020) 0.80 0.90 0.84 0.96 0.80 0.89


After observing that GottBERT presented the best results in terms of recall, we focused our efforts into fine-tuning it using the original, non-pseudonymised data, in both domains. This test was conducted in a secured environment, to prevent any risks of data leaks. The results are shown in the next table, and we can see that the model was able to achieve a recall of up to 0.98 when considering high risk data points.

Anonymised data points Recall according to risk
Precision Recall F1 High risk Medium risk Low risk
Landlord-tenant regulations 0.79 0.90 0.84 0.96 0.76 0.89
Transport law 0.85 0.90 0.87 0.98 0.81 0.87


These results show great potential for anonymisation, but being able to anonymise 98% of high risk data points still mean that, in a 100 thousand documents, 2 thousand may present privacy issues. That’s the reason why we continue to work on improving anonymisation models, so that we can improve recall even further, given the importance of the task and of preserving the privacy of the people involved.



Coupette, C. and Fleckner, A.M., 2018. Quantitative Rechtswissenschaft (Quantitative Legal Studies). JuristenZeitung (JZ)73(8), pp.379-389.

Sohn, G. 2018. „Wir haben die Werkzeuge, aber nicht genügend Daten“: Gerichtsurteile müssen nicht veröffentlicht werden. Netzpiloten Magazine. www.netzpiloten.de/werkzeuge-daten-gerichtsurteile

Riedl, M. and Padó, S. 2018. A named entity recognition shootout for German. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 120–125, Melbourne, Australia.

Scheible, R., Thomczyk, F., Tippmann, P., Jaravine, V. and Boeker, M. 2020. GottBERT: a pure German language model. CoRR, abs/2012.02110. https://arxiv.org/abs/2012.02110.