lacoco-lab

Linguistic Interpretability for Neural Models of Language

Software Project

Course Description:

For this course, you will work with one or two other students to propose and implement a software project focused on linguistic interpretabilty for neural-network-based models of language. This course runs in the summer semester, i.e. April-July 2025.

You will select an interpretability method and use it to analyze how one or more neural models represent and/or process formal linguistic categories (cf. examples below). You can use a pre-trained model, trained on language modeling or some other objective, or for more complicated methods / tasks you have the option of training your own smaller model. The model can be trained on any natural language. Many of the technical aspects are flexible and open to discussion, e.g. at the project proposal phase. The key requirement is that your final project will analyze linguistically relevant categories, and how they are realized in a given neural network model.

For a more comprehensive technical background, I highly recommend taking the seminar Interpreting and Analyzing Neural Language Models in parallel with this project.

Instructors: Kate McCurdy

Prerequisites: You should be comfortable working with neural networks and various machine learning techniques. You should also have sufficient linguistic background to understand and evaluate the categories of interest. Look through the lists of example methods and tasks — if you would not be capable of implementing one of the listed methods, or evaluating one of the listed tasks, this is probably not the course for you.

Registration: If you want to take part, please send an email to kmccurdy ( at sign ) lst.uni-saaland.de, deadline Friday April 11 18 (i.e. after the kickoff meeting on April 10 - feel free to come and ask questions there). N.B. registration deadline extended by one week to accommodate lack of CMS page!

In your email, please:

Give your name, semester, study program
Tell me why you want to take part in this course
Describe your previous experience:
- in deep learning or machine learning in general
- in linguistics and/or natural language processing in particular
Will you also take the concurrent seminar on interpretibilty? If not, please provide supporting evidence of your familiarity and/or capability with modern interpretibility methods.

Project Timeline

Kick-off meeting, in-person
- date Thursday April 10 at 14h, C7 4 Aquarium
- Kick-off meeting slides
April: Develop Project Proposal
- ~~May 1~~ May 5: Submit 3 page project proposal (20% of your grade)
May: Explore Interpretibility Methods
- Progress presentation + check-in meeting, remote
  - date ~~TBD~~ May ~~19-23~~ 22
June: Analyze Linguistic Categories
- Progress presentation + check-in meeting, remote
  - date ~~TBD~~ June ~~16-20~~ 13
July: Prepare Final Report
- Additional check-in July 1
- July 31: Submit final project report (6-8 pages; 50% of your grade) + code (30% of your grade)
Final meeting, in person
- date ~~TBD~~ July 14 (Monday), 10 am, C7 4 Aquarium

Example project components

Example methods

Note that some methods can be used on large language models, while others are more suitable for smaller models. Select a method appropriate for the task.

Probing e.g.
Disentangling Labels e.g.
Logit Lens e.g.
Sparse Autoencoders e.g. Gemma Scope
Activation Patching e.g. Patchscope
Replacement Model + Attribution Graph e.g. Circuit Tracing

Example categories

Parts of speech (e.g. nouns, verbs, …)
Morphemes (e.g. morphological segmentation)
Syllables (e.g. in poetic rhyme)
Prosodic constituents
Syntactic heads (e.g. governing subject-verb agreement)
Semantic heads (e.g. Semantic Role Labeling, Universal Dependencies, …)

Case Studies

Research works with a linguistic interpretability focus.

Tenney et al., 2019: BERT Rediscovers the Classical NLP Pipeline
- Linguistic category: multiple - PoS, dep. heads, semantic roles, …
- Method + model: probe + BERT
Dankers et al., 2021: Generalising to German Plural Noun Classes, from the Perspective of a Recurrent Neural Network
- Linguistic category: German plural classes
- Method + model: probe + LSTM
Chen et al., 2024: SUDDEN DROPS IN THE LOSS: SYNTAX ACQUISITION, PHASE TRANSITIONS, AND SIMPLICITY BIAS IN MLMS
- Linguistic category: syntactic dependencies
- Method + model: probe + BERT at different stages in training
Lindsey et al., 2025: Tracing circuits - Planning in Poems
- Method + model: attribution graphs via local replacement model + Claude
Brinkmann et al., 2025: Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages
- Method + model: SAEs + Llama