Vortrag: Philipp Heinrich & Stefan Evert (verschoben auf 17.02.2021)

Der für den 27.01.2021 Im Rahmen des Oberseminars CL geplante Vortrag muss leider kurzfristig auf den 17.02.2021 verschoben werden:

Philipp Heinrich & Stefan Evert
(Lehrstuhl für Korpus- und Computerlinguistik, FAU Erlangen-Nürnberg)

News from the Corpus Workbench (CWB):
Embedding CWB in a CL Workflow
| Finite State Queries

16:15–17:45, via Zoom (Link bleibt gleich, neue externe Anmeldungen über info@linguistik.uni-erlangen.de!)

Der Vortrag wird in englischer Sprache gehalten.

 

Abstract

Many powerful corpus query engines – notably the IMS Open Corpus Workbench (CWB), the (No)Sketch Engine, and several other tools inspired by them – offer a query language based on generalised regular expressions (formulated over complex token descriptions rather than individual characters). This enables researchers to locate lexico-grammatical patterns of interest and collect corpus instances in a concordance. Many applications of corpus linguistics – notably corpus-based discourse analysis and computational lexicography – are furthermore in need of collocations or word sketches, as well as dispersion and keyword analyses (based on metadata annotation included in the corpus).

The first part of the talk gives a practical introduction to cwb-ccc, an open-source Python package that translates CWB query results into pandas dataframes and then performs collocation analyses for different contexts. It also offers keyword analysis for subcorpora defined by metadata constraints.

The second part of the talk gives the first publicly available introduction to the CWB implementation of corpus queries by non-deterministic simulation of finite-state automata. It also addresses pitfalls and limitations of finite-state queries, in particular certain corner cases that may not be evaluated correctly.