Software & Data

Python packages

  • SoMaJoA tokenizer and sentence splitter for German and English web and social media texts.
  • SoMeWeTaA part-of-speech tagger with support for domain adaptation and external resources.
  • pandas-association-measures – Statistical Association Measures for co-occurrence dataframes in pandas.
  • cwb-ccc – A CWB wrapper to extract concordances and collocates.


  • GeRedE – A corpus of German Reddit exchanges.
  • EmpiriST 2.0 – A manually annotated corpus consisting of German web pages and German computer-mediated communication (CMC).