Vortrag: Mathew Gillings (20.12.2023)

Im Rahmen des Oberseminars Computerlinguistik findet am 20.12.2023 ein Vortrag statt, zu dem wir herzlich einladen möchten.



Mathew Gillings (Wirtschaftsuniversität Wien)



Mittwoch, 13.12.2023, 16:15-17:45 Uhr



Bismarckstr. 12, R.0.320 (in Präsenz) / auch via Zoom (Link folgt über uniinterne Verteiler, externe Anmeldungen gerne über info@linguistik.uni-erlangen.de!)



“Human vs. machine: a methodological triangulation”



Exploring discourse (and discursive topics) through linguistic analysis has not only been of interest to linguists, but also to researchers working across the social sciences. Traditionally, this has been conducted based on small-scale interpretive analyses of discourse, involving some form of close reading. Naturally, however, that close reading is only possible when the dataset is small, and it leaves the analyst open to accusations of bias, cherry-picking and a lack of representativeness. Other methods have emerged which each have a some form of quantitative component, designed to avoid these issues, and involving larger datasets. Within linguistics, this has typically been through the use of corpus-assisted methods, whilst outside of linguistics, topic modelling is one of the most widely-used methods. How corpus linguistics and topic modelling differ, though, is in the degree of contextualisation available to the researcher. Topic modelling algorithms reduce texts to a simple bag-of-words and completely strip texts of their linguistic structure and context, presenting only a list of co-occurring words to the researcher for analysis. On the other hand, corpus-assisted methods, like concordance analysis, allow the user to see words within their co-text (typically a few words on either side). Corpus-assisted methods, then, are somewhere in between the completely decontextualised topic modelling, and the completely contextualised close reading.

This talk reports on a study assessing the effect that analytical method has on the interpretation of texts, specifically in relation to the identification of the main topics. Using a corpus of corporate sustainability reports, totalling 98,277 words, we asked 6 different researchers to interrogate the corpus and decide on its main ‘topics’ via three different methods. In Method A, two researchers were asked to view a topic model output and assign topic labels based purely on eyeballing the co-occurring words. In Method B, two researchers were asked to assign topic labels based on a concordance analysis of 100 randomised lines of each co-occurring word. In Method C, two researchers were asked to reverse-engineer a topic model output by creating topic labels based on a close reading. The talk explores how the identified topics differed both between researchers in the same condition, and between researchers in different conditions. We conclude with a series of tentative observations regarding the benefits and limitations of each method and recommendations for researchers in choosing which analytical technique to choose.