Discussion
Loading...

Post

Log in
  • About
  • Code of conduct
  • Privacy
  • Users
  • Instances
  • About Bonfire
UKP Lab
UKP Lab
@UKPLab@sigmoid.social  ·  activity timestamp 2 weeks ago

⚠️ 𝗖𝗮𝗻 𝗳𝗶𝗻𝗲-𝘁𝘂𝗻𝗶𝗻𝗴 𝗹𝗲𝗮𝗱 𝘁𝗼 𝘂𝗻𝘄𝗮𝗻𝘁𝗲𝗱 𝗯𝗲𝗵𝗮𝘃𝗶𝗼𝘂𝗿𝘀 𝗶𝗻 𝗹𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗺𝗼𝗱𝗲𝗹𝘀?

A recent paper in Nature suggests that even small amounts of targeted fine-tuning data can trigger unexpected and problematic behaviour that generalises beyond the original pre-training task.

(1/🧵)

Graphic combining logos and a portrait. At the top left is the Science Media Center Germany logo in a geometric gold network style. Below appears the Ubiquitous Knowledge Processing logo. On the right is a circular portrait of a woman with shoulder-length curly hair and glasses, wearing a patterned blouse and looking toward the camera. A small photo credit is visible in the corner.
Graphic combining logos and a portrait. At the top left is the Science Media Center Germany logo in a geometric gold network style. Below appears the Ubiquitous Knowledge Processing logo. On the right is a circular portrait of a woman with shoulder-length curly hair and glasses, wearing a patterned blouse and looking toward the camera. A small photo credit is visible in the corner.
Graphic combining logos and a portrait. At the top left is the Science Media Center Germany logo in a geometric gold network style. Below appears the Ubiquitous Knowledge Processing logo. On the right is a circular portrait of a woman with shoulder-length curly hair and glasses, wearing a patterned blouse and looking toward the camera. A small photo credit is visible in the corner.
  • Copy link
  • Flag this post
  • Block
UKP Lab
UKP Lab
@UKPLab@sigmoid.social replied  ·  activity timestamp 2 weeks ago

In a new briefing by the Science Media Center Germany, Prof. Dr. Iryna Gurevych (Ubiquitous Knowledge Processing Lab, Technische Universität Darmstadt) notes that the study’s methodology is well aligned with its claims: It extends earlier work by the same lab showing that fine-tuning can lead to broader misalignment.

(2/🧵 )

  • Copy link
  • Flag this comment
  • Block
UKP Lab
UKP Lab
@UKPLab@sigmoid.social replied  ·  activity timestamp 2 weeks ago

Most strikingly, she emphasises that just a few examples can cause far-reaching behavioural shifts in LLMs, potentially affecting current models as well. For practitioners, the takeaway is clear: careful training data curation and thorough testing after fine-tuning are essential.

(3/🧵 )

  • Copy link
  • Flag this comment
  • Block
UKP Lab
UKP Lab
@UKPLab@sigmoid.social replied  ·  activity timestamp 2 weeks ago

The briefing also features perspectives from:
👤 Prof. Dr. Hinrich Schütze, Ludwig-Maximilians-Universität München (LMU)
👤 Prof. Dr. Dorothea Kolossa, Technische Universität Berlin
👤 Dr. Paul Röttger, Oxford Internet Institute
👤 Dr. Jonas Geiping, Max Planck Institute for Intelligent Systems

📄 Read the full German briefing here:
https://sciencemediacenter.de/angebote/sprachmodelle-entwickeln-unerwuenschte-verhaltensweisen-26006

🧾 Nature paper:
https://www.nature.com/articles/s41586-025-09937-5

(4/4)

#AI #NLP #NLProc #LLM #AIResearch #ResponsibleAI #UKPLab

Science Media Center Germany

Sprachmodelle entwickeln unerwünschte Verhaltensweisen

Studie: Chatbots übertragen erlerntes schädliches Verhalten auf alle Anfragen; emergent. Ursachen unklar, bestimmtes Training könnte bösartige Anteile verstärken.
  • Copy link
  • Flag this comment
  • Block

bonfire.cafe

A space for Bonfire maintainers and contributors to communicate

bonfire.cafe: About · Code of conduct · Privacy · Users · Instances
Bonfire social · 1.0.2-alpha.7 no JS en
Automatic federation enabled
Log in
  • Explore
  • About
  • Members
  • Code of Conduct