Teaching — LION Lab

Courses

Natural Language Processing

Lecture · B.Sc. Digital Humanities, B.Sc. Computer Science · Summer Semester

This course introduces students to the field of Natural Language Processing. We start with classic NLP tasks, then cover prerequisites to language models such as preprocessing and tokenisation. We move on to transformers and large language models, and finally cover topics from computational linguistics and their application to LLMs.

Foundations of Machine Learning

Lecture · B.Sc. Digital Humanities, B.Sc. Computer Science · Winter Semester

For a given task and measure of success, a computer program learns when its performance improves with experience. This course introduces machine learning as a guided search through a space of potential hypotheses. Students gain a broad overview of learning paradigms — including linear regression, decision trees, support vector machines, Bayesian learning, and neural networks — and understand the mathematical foundations that determine discrimination power and learning complexity.

Current Topics in Natural Language Processing

Seminar · M.Sc. Digital Humanities, M.Sc. Computer Science, M.Sc. Data Science · Irregular

This seminar covers a different topic from current NLP research each time it is offered. Students each present a paper, and at the end of the semester write up a project proposal for a new research project building on the current state of the topic. The most recent edition focused on Massively Multilingual Language Models.

Open Thesis Positions

We offer B.Sc. and M.Sc. thesis topics to students at Leipzig University in Computer Science, Digital Humanities, or Data Science. Potential topics are detailed below and updated regularly, but we also welcome topic suggestions that fit our general research areas. We generally require that students have successfully completed at least our Natural Language Processing and Machine Learning courses, or equivalents if their bachelor's degree is not from Leipzig University.

To apply, email us with [THESIS] in the subject line, including: the topic you'd like to work on (with a short explanation if it's not from the list below), why you find it interesting, prior NLP coursework and practical experience, full transcripts, and your CV.

B.Sc. / M.Sc.

Expanding and Improving Universal Dependencies for German

Universal Dependencies (UD) is a framework for consistent annotation of grammar across human languages. The largest UD treebank of any language is UD-HDT, created via automatic conversion from the Hamburg Dependency Treebank and not actively maintained since 2017. A thesis could integrate additional pre-conversion data and fix known errors and inconsistencies. A M.Sc. thesis would additionally train UD parsers before and after the fixes and evaluate them on German treebanks to demonstrate the impact of the changes.

B.Sc. / M.Sc.

Adopt a UD Treebank

Most UD treebanks have not been actively maintained since creation, accumulating validation errors that limit their usefulness. If you read (natively or with reasonable comprehension) any of the languages in the UD validation report, a thesis could adopt that treebank, fix errors and warnings, and expand feature coverage or cross-treebank consistency. A M.Sc. thesis would validate improvements by training and comparing parsers before and after.

B.Sc. / M.Sc.

Unsupervised Discovery of Unaccusative and Unergative Verbs

Within the broader category of intransitive verbs, unaccusative (e.g., You fall) and unergative (e.g., You resign) verbs have special syntactic behaviour relating to the semantic role of the subject, with is agentive for unaccusative verbs and patient-like for unergative verbs. These syntactic behaviours are not trivial to detect because in standard sentences, the distinction is only visible through the semantics of the subject. Previous work has made a first attempt at unsupervised discovery of both categories, and the thesis would attempt to approve on this with more modern methods, or possibly extend it to further languages where unaccusative and unergative verbs might have different properties.

B.Sc. / M.Sc.

Complex Noun Compound Benchmark

Noun compounds provide an interesting test case for LLMs' understanding of complex noun semantics, as well as generalisation to novel compounds. Previous work has investigated this for compounds of two nouns. Compounds of more than two nouns are semantically far more complex and remain unexplored as a test case for LLMs. We might for example ask, is a child camel jockey slave a type of (child|camel|jockey|slave)? Data for this could e.g. come from the recently release Compound Branching Resource, or could be automatically collected from corpora and then hand-annotated. The thesis would lead to the creation of a (possibly multilingual) benchmark of LLMs' understanding of complex noun compounds and their internal structure.