Automatic Anonymisation and Pseudonymisation of Court Decisions
German courts are legally required to publish their verdicts, but according to estimates, only a small amount (under 3%) of all court decisions are published per year. The transparency and availability of such documents do not only grant legal professionals and the general public access to a crucial information source, but is also of importance for the legal-tech industry and the digitilisation of government institutions. The main reason for this is the need of a manual, time-consuming anonymisation process. Moreover, to the best of our knowledge there is no high-quality German corpus available to train a fully automatic anonymisation model. Existing tools are still semi-automatic systems, they hence only support the manual anonymisation process. In the two projects LeAK (2020–2022) and AnGer (2023–2025), we fill this gap (1) by creating high-quality annotated training data using verdicts from different law domains, and (2) by developing a prototype for a fully automatic anonymisation pipeline.
AnGer


- Start date: 01.01.2023
- End date: 31.12.2025
- Funding source: Bundesministerium für Billdung und Forschung (BMBF)
Motivated by the promising results from our prior works in LeAK, the main objectives of the AnGer project lie in the extension and further development of our automatic anonymisation system and the creation of a new high-quality annotated dataset consisting of court decisions from higher regional courts (Oberlandesgericht). Moreover, the model’s generalisation across different law domains is still a challenging task which is addressed in this project. We are working on different data augmentation techniques, as well as robust neural networks which help enhance the robustness of our current prototype. Data augmentation, in particular, can help generate more training samples for domains that lack training data. In line with our findings in LeAK, domain adaptation is still required in certain domains. We thus create learning curves for AG and OLG datasets to visualise the learning quality during each training step and to analyse the amount of data needed for robust domain adaption. We also work on an approach for continuous domain adaptation without retraining using all data. Finally, we carry out legal-tech studies and conduct de-anonymisation experiments with the annotated gold standard.
LeaK
- Start date: 01.04.2020
- End date: 31.03.2022
- Funding source: Bayerisches Staatsministerium der Justiz (StMJ)
This project aimed at exploring the feasibility of a fully automatic anonymisation system for German court decisions. One of our key contributions is the development of a high-quality manually annotated gold standard for verdicts from regional districts (Amtsgericht). To ensure the absence of privacy-related information, at least six people worked on the same document. Each text in our dataset is annotated independently by four student annotators, who had to identify text spans that need to be anonymised, annotate information categories (i.a. names, addresses, jobs, or dates), as well as assess risk levels (high, middle and low); and adjudicated by two further annotators in a subsequent step. Text anonymisation is approached as Named Entity Recognition (NER). An essential aspect of the project was the systematic evaluation of different automatic approaches for automatic anonymisation using existing NER taggers, as well as fine-tuning several Large Language Models (LLMs) on our gold standard. We are also working on a multitask architecture in order to enhance the robustness of the anonymisation system. In the course of these experiments, the transferability of our prototype was validated across documents from other law domains of higher regional courts (Oberlandesgericht). Preliminary results of the experiments indicated the need of domain adaption in order to achieve good performance on court decisions from other text domains.