Methodological foundations of corpus research and digital humanities

Corpus research in linguistics as well as in the digital humanities and social sciences relies on a wide range of statistical techniques and visualizations. A central goal of our research is to develop sound methodological foundations for corpus linguistics, which address key problems in order to ensure that quantitative analyses are both reliable and meaningful.

Research activities

Quantitative methodology for literary stylometry (e-Humanities-Zentrum KALLIMACHOS)

Project funding

KALLIMACHOS Centre for Digital Humanities: corpus-linguistic approaches and statistical methodology (phase 1), linguistic complexity in literary stylometry (phase 2)
(10/2014 – 09/2019)
Efficient simulation experiments for large-scale parameter optimisation of machine learning approaches in natural language processing
(10/2016 – 09/2017)

Key publications

Evert, Stefan; Proisl, Thomas; Jannidis, Fotis; Reger, Isabella; Pielström, Steffen; Schöch, Christof; Vitt, Thorsten (2017). Understanding and explaining Delta measures for authorship attribution. Digital Scholarship in the Humanities 22(suppl_2), ii4–ii16.
Evert, Stefan and Neumann, Stella (2017). The impact of translation direction on characteristics of translated texts. A multivariate analysis for English and German. In G. De Sutter, M.-A. Lefer, and I. Delaere (eds.), Empirical Translation Studies. New Theoretical and Methodological Traditions (TiLSM 300), pages 47–80. Mouton de Gruyter, Berlin.
☞ online supplement
Evert, Stefan; Wankerl, Sebastian; Nöth, Elmar (2017). Reliable measures of syntactic and lexical complexity: The case of Iris Murdoch. In Proceedings of the Corpus Linguistics 2017 Conference, Birmingham, UK.
Evert, Stefan and Arppe, Antti (2015). Some theoretical and experimental observations on naïve discriminative learning. In Proceedings of the 6th Conference on Quantitative Investigations in Theoretical Linguistics (QITL-6), Tübingen, Germany.
Baroni, Marco and Evert, Stefan (2007). Words and echoes: Assessing and mitigating the non-randomness problem in word frequency distribution modeling. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL 2007), pages 904–911, Prague, Czech Republic.
Evert, Stefan (2006). How random is a corpus? The library metaphor. Zeitschrift für Anglistik und Amerikanistik 54(2), 177–190.

2025

Frenken, F., Evert, S., Schneider, G., & Neumann, S. (2025). How stable are multivariate findings about register variation across varieties of English? On the replicability of Geometric Multivariate Analysis. ICAME Journal, 49(1), 23--45. https://doi.org/10.2478/icame-2025-0003

2017

Evert, S., & Neumann, S. (2017). The impact of translation direction on characteristics of translated texts. A multivariate analysis for English and German. In De Sutter G, Lefer M, Delaere I (Eds.), Empirical Translation Studies. New Theoretical and Methodological Traditions. (pp. 47-80). Berlin: Mouton de Gruyter.
Evert, S., Wankerl, S., & Nöth, E. (2017). Reliable measures of syntactic and lexical complexity: The case of Iris Murdoch. Paper presentation, Birmingham, GB.

2015

Evert, S., & Arppe, A. (2015). Some theoretical and experimental observations on naïve discriminative learning. In Proceedings of the 6th Conference on Quantitative Investigations in Theoretical Linguistics (QITL-6). Tübingen, Germany.
Evert, S., Proisl, T., Jannidis, F., Pielström, S., Schöch, C., & Vitt, T. (2015). Towards a better understanding of Burrows's Delta in literary authorship attribution. In Proceedings of the Fourth Workshop on Computational Linguistics for Literature (pp. 79--88). Denver, CO.

2014

Diwersy, S., Evert, S., & Neumann, S. (2014). A weakly supervised multivariate approach to the study of language variation. In Szmrecsanyi B, Wälchli B (Eds.), Aggregating Dialectology, Typology, and Register Analysis. Linguistic Variation in Text and Speech. (pp. 174–204). Berlin, Boston: De Gruyter.

2007

Baroni, M., & Evert, S. (2007). Words and Echoes: Assessing and Mitigating the Non-Randomness Problem in Word Frequency Distribution Modeling. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (pp. 904-911). Prague, Czech Republic.

2006

Evert, S. (2006). How Random is a Corpus? The Library Metaphor. Zeitschrift für Anglistik und Amerikanistik, 54(2), 177-190.

Events

Open-source course on Statistical Inference – A Gentle Introduction for (Computational) Linguists (LinC 2018, Birmingham 2016, MaLT 2015, Zürich 2010, EMA 2008, DGfS/CL 2007, …)
Tutorial / course on Type-Token Distributions & Zipf’s Law (LREC 2018, Birmingham 2018, ESSLLI 2006)