Post · bonfire.cafe

Post

⚠️ 𝗖𝗮𝗻 𝗳𝗶𝗻𝗲-𝘁𝘂𝗻𝗶𝗻𝗴 𝗹𝗲𝗮𝗱 𝘁𝗼 𝘂𝗻𝘄𝗮𝗻𝘁𝗲𝗱 𝗯𝗲𝗵𝗮𝘃𝗶𝗼𝘂𝗿𝘀 𝗶𝗻 𝗹𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗺𝗼𝗱𝗲𝗹𝘀?

A recent paper in Nature suggests that even small amounts of targeted fine-tuning data can trigger unexpected and problematic behaviour that generalises beyond the original pre-training task.

(1/🧵)

UKP Lab

@UKPLab@sigmoid.social · last month

In a new briefing by the Science Media Center Germany, Prof. Dr. Iryna Gurevych (Ubiquitous Knowledge Processing Lab, Technische Universität Darmstadt) notes that the study’s methodology is well aligned with its claims: It extends earlier work by the same lab showing that fine-tuning can lead to broader misalignment.

(2/🧵 )

UKP Lab

@UKPLab@sigmoid.social · last month

Most strikingly, she emphasises that just a few examples can cause far-reaching behavioural shifts in LLMs, potentially affecting current models as well. For practitioners, the takeaway is clear: careful training data curation and thorough testing after fine-tuning are essential.

(3/🧵 )

UKP Lab

@UKPLab@sigmoid.social · last month

The briefing also features perspectives from:
👤 Prof. Dr. Hinrich Schütze, Ludwig-Maximilians-Universität München (LMU)
👤 Prof. Dr. Dorothea Kolossa, Technische Universität Berlin
👤 Dr. Paul Röttger, Oxford Internet Institute
👤 Dr. Jonas Geiping, Max Planck Institute for Intelligent Systems

📄 Read the full German briefing here:
https://sciencemediacenter.de/angebote/sprachmodelle-entwickeln-unerwuenschte-verhaltensweisen-26006

🧾 Nature paper:
https://www.nature.com/articles/s41586-025-09937-5

(4/4)

#AI #NLP #NLProc #LLM #AIResearch #ResponsibleAI #UKPLab

Science Media Center Germany

Sprachmodelle entwickeln unerwünschte Verhaltensweisen

Studie: Chatbots übertragen erlerntes schädliches Verhalten auf alle Anfragen; emergent. Ursachen unklar, bestimmtes Training könnte bösartige Anteile verstärken.

bonfire.cafe

A space for Bonfire maintainers and contributors to communicate

bonfire.cafe: About · Code of conduct · Privacy · Users · Instances

Bonfire social · 1.0.2-alpha.34 no JS en

Automatic federation enabled