lacoco-lab

Aligning Language Models with Human Preferences: Methods and Challenges

Course Description: We will look into the rapidly developing field of aligning language models with human preferences, a central ingredient in today’s LLMs. In the narrow sense, this refers to the finetuning process by which language models , originally trained to predict the next token, are turned into chatbots and other systems that can meaningfully interact with humans. Here, technical ideas such as Reinforcement Learning from Human Feedback are relevant. In a broader sense, this refers to research on how we can ensure LLMs behave in ways that humans desire, e.g. follow social and ethical norms, and are robust to malevolent adversarial prompting.

We will read a diverse set of recent technical papers from this highly dynamic field.

Prerequisites: Many of our readings will be quite technical. You will need a good background in NLP or machine learning in order to thrive in this course.

Registration:

If you are an LST / CoLi student, and want to take this class, you should directly register in the Course Management System (CMS). You may either be directly admitted or waitlisted.
If you are a Computer Science student, you should initially register via the Computer Science department seminar registration system. If you want to take the seminar but were not selected by the assignment system, please apply for the waiting list by emailing mhahn@lst.uni-saarland.de. Only register in Course Management System (CMS) once you were selected by the assignment system or otherwise admitted by us.
In both cases, please email mhahn@lst.uni-saarland.de your top-3 preferences among the items in the syllabus, and a brief explanation why you want to take this course and feel prepared for it. If you want, you are welcome to additionally mention any other topic that you would like to present. If you suggest something interesting, that may boost your chances of being admitted.

Course Management System: CMS

Instructors: Michael Hahn

Time: Tue 12:15–13:45

Room: Building C7.3, Seminarraum 1.12

Format and requirements

This is a seminar course. Starting from the fourth week, one or two students will present in each unit (except for the June 11 session). Every student will present exactly once. We expect all students to read the readings every week. Every student submits one question about the readings by Monday noon.

Preliminary Syllabus

Note: The syllabus is subject to change, both the selection of topics and their order. You are welcome to suggest other topics or papers.

In each session, two students will together present two papers (in the “Readings” column) on a common topic.

Date	Topic	Readings	Slides	Optional Material	Presenter
2024-04-16	no class				Michael
2024-04-23	Introduction to (L)LMs		slides		Michael
2024-04-30	no class
2024-05-07	Reinforcement learning background	PPO: Schulman et al 2017			Robert
		RLHF: Ouyang et al 2022, NeurIPS			Aleksandra
2024-05-14	Further developments	DPO: Rafailov et al, NeurIPS		AlpacaFarm: 2023, NeurIPS	Anthony
		Fine-Grained Human Feedback Gives Better Rewards for Language Model Training 2023, NeurIPS			Ruveyda
2024-05-21	Further developments	constitutional AI Bai et al 2022, arXiV			Nadia
		self-alignment Sun et al 2023, NeurIPS			Viet Anh
2024-05-28	Alignment in Vision and Language Models	InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning 2023, NeurIPS			Nellia
		Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models 2023, NeurIPS			Monseej
2024-06-04	Alignment beyond language	Language Is Not All You Need: Aligning Perception with Language Models 2023, NeurIPS			Yana
		Language Models Meet World Models: Embodied Experiences Enhance Language Models 2023, NeurIPS			Xin
2024-06-11	Project Ideas			Everyone
2024-06-18	Limitations of Alignment	Failures of Safety Training Wei et al 2023, NeurIPS		Sleeper Agents: arXiV 2024 , Transferable attacks on aligned LMs arXiV 2023	Lucille
		Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback 2023, arXiV		Theoretical limitations of Alignment: arXiv 2023	Mark
2024-06-25	Truthfulness I	Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting 2023, NeurIPS		arXiV 2023	Sarah
		Honesty as best policy: NeurIPS 2023		TODO	Ritika Basavaraj
2024-07-02	Truthfulness II	Sycophancy arXiV 2023			Nicholas
		Inference-Time Intervention: Eliciting Truthful Answers from a Language Model 2023, NeurIPS		Lie detection arXiV 2023	Yash
2024-07-09	Measuring Alignment	MoCa: Measuring Human-Language Model Alignment on Causal and Moral Judgment Tasks 2023, NeurIPS		ETHICS dataset ICLR 2021	Tyler
		Whose opinions do LMs reflect? ICML 2023			Yan
2024-07-16	LMs modeling humans	In-Context Impersonation Reveals Large Language Models’ Strengths and Biases 2023, NeurIPS		Language Models as Agent Models TODO	Qian
		LMs helping align humans : NeurIPS 2022			Anna

Evaluation

For students taking the seminar for 4 credits:

Presentation: 60%
Questions about readings: 40%

For students taking the seminar for 7 credits:

Presentation: 30%
Questions about readings: 20%
Final paper: 50%

Questions

Please register on the forum on CMS.

Starting from the fourth week, every student submits one question about the readings by Monday noon. Questions are graded on a 3-point scale (0: no question submitted, 1: superficial question, 2: insightful question).

Presentations

We expect that presentations will cover the key points from the readings, such as the main evidence for and against the key claims under consideration in the paper.

We do not expect that presentations will cover all details of the papers. Rather, you should focus on big picture findings and conclusions, and are not expected to include every finding from the paper in your presentation. For instance, instead of a table of numbers, highlight key results. When there are multiple similar results in the paper, synthesize them. If the papers have many studies, you might select a representative subset to explain the paper’s conclusions. On the other hand, if the assigned papers primarily discuss/review other work (as is the case in some weeks), draw on material from the work cited to provide richer content and even details where useful.

Make sure to motivate the papers’ research question(s). Give background on key concepts, and convey to the audience your understanding of why certain research decisions were made.

Select what you consider the key points; you are not expected to cover every part of the paper exhaustively. Include details only to the extent that you believe them to be important.

Critically engage with the reading: contribute your own opinion on the key findings, and on the paper’s motivation and arguments. In what ways do or don’t you agree with arguments made by the authors?

As you’ll be presenting in teams of two, don’t just present the two papers separately, but make sure to also draw connections and compare. Aim for 40-60 minutes of presentation, allowing 30-40 minutes of discussion. Generating and moderating in-class discussion is a key component of your presentation – thinking about what will be interesting to your audience will thus be important. Discussion should happen not just after the presentation, but you should engage the audience and create ample opportunity for discussion during your presentation. Before the presentation, take a look at the questions that have been posted in the forum and refer to these as needed. These may be useful for getting discussion started. Conversely, when attending other students’ talks, reciprocate by participating actively in the discussion.

Final Papers (for the 7CP version)

Note: We will discuss this in the first meeting. Requirements may be changed based on popular demand.

Term papers will be about a small independent project.

You will investigate a question about LLMs’ alignment with humans. For instance, you might test their robustness to malign prompts, or probe their moral views.

Option 1: You may develop your own question and prompts. In this case, you will be expected to design at least 25 (not more than 50) prompts.

Option 2: You may draw on a larger existing benchmark. In this case, you will be expected to find some new angle on the benchmark, e.g., by tweaking the stimuli or by evaluating the LLM’s behavior in a different way.

This list is not exhaustive: you may also draw on other approaches, not necessarily based on prompting.

The report is expected to contain a brief literature review, motivation of your question, a description of your prompts, and evaluation of the LLM’s behavior. The report is expected to include quantitative evaluation of the LLM’s behavior (e.g., using measures such as accuracy). Additionally including qualitative evaluation can also be beneficial.

The report should have 8 pages of main report, plus unlimited appendix, in the NeurIPS style format. The main report should be self-contained, but you can use the appendix to report prompts, further analyses, or other material.

The report should be uploaded via CMS. The due date is October 13, 2024, 23:59.

Everyone is expected to report on their project idea in the June 12, 2024, session, and to participate in discussion to give feedback to other students’ ideas. Students may prepare a short slide deck on their idea. This will not be graded; the June 12 session is intended to help improve and finetune project ideas.

Contact

Please contact Michael (mhahn@lst.uni-saarland.de) for any questions.

Accommodations

If you need any accommodations due to a disability or chronic illness, please either contact Michael at mhahn@lst.uni-saarland.de or the Equal Opportunities and Diversity Management Unit of the university.