Discussion
Loading...

#Tag

  • About
  • Code of conduct
  • Privacy
  • Users
  • Instances
  • About Bonfire
UKP Lab
@UKPLab@sigmoid.social  ยท  activity timestamp 2 days ago

๐—ž๐—ฒ๐˜† ๐˜๐—ฎ๐—ธ๐—ฒ๐—ฎ๐˜„๐—ฎ๐˜†๐˜€:
โšก ๐Ÿฑ๐Ÿณ% ๐—ฎ๐˜๐˜๐—ฎ๐—ฐ๐—ธ ๐˜€๐˜‚๐—ฐ๐—ฐ๐—ฒ๐˜€๐˜€ ๐—ฟ๐—ฎ๐˜๐—ฒ: Outperforms SOTA attacks across GPT-4o, LLama, Gemma, and Phi models
๐Ÿง  ๐—ฆ๐—บ๐—ฎ๐—ฟ๐˜๐—ฒ๐—ฟ โ‰  ๐—ฆ๐—ฎ๐—ณ๐—ฒ๐—ฟ: Larger, more capable models are MORE vulnerable to contrastive reasoning attacks
๐Ÿšจ ๐——๐—ฒ๐—ณ๐—ฒ๐—ป๐˜€๐—ฒ ๐—ด๐—ฎ๐—ฝ ๐—ฒ๐˜…๐—ฝ๐—ผ๐˜€๐—ฒ๐—ฑ: Current safety measures can't detect subtle, logic-driven jailbreaks
โœ… ๐—ฆ๐—ผ๐—น๐˜‚๐˜๐—ถ๐—ผ๐—ป ๐—ฒ๐˜…๐—ถ๐˜€๐˜๐˜€: Our Chain-of-Thought defenses reduce attack success by 95%

(2/๐Ÿงต)

UKP Lab
@UKPLab@sigmoid.social replied  ยท  activity timestamp 2 days ago

๐Ÿ“œ ๐—ฃ๐—ฎ๐—ฝ๐—ฒ๐—ฟ โ†’ https://arxiv.org/pdf/2501.01872
๐ŸŒ ๐—ฃ๐—ฟ๐—ผ๐—ท๐—ฒ๐—ฐ๐˜ โ†’ https://ukplab.github.io/emnlp2025-poate-attack/
๐Ÿ’พ ๐—–๐—ผ๐—ฑ๐—ฒ + ๐—ฑ๐—ฎ๐˜๐—ฎ โ†’ https://github.com/UKPLab/emnlp2025-poate-attack

And consider following the authors Rachneet Sachdevaโ€ฌ, Rima Hazra, and Iryna Gurevych (UKP Lab/TU Darmstadt) if you are interested in more information or an exchange of ideas.

(3/3)

#NLProc #LLMSafety #AIsecurity #Jailbreak #LLM

GitHub

GitHub - UKPLab/emnlp2025-poate-attack: Code associated with "Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions".

Code associated with "Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions". - UKPLab/emnlp2025-poate-attack

Turning Logic Against Itself: Probing Model Defenses Through Contrastive Questions

POATE achieves 44% attack success rate on major LLMs by harnessing contrastive reasoning to provoke unethical responses.
https://arxiv.org/pdf/2501.01872
  • Copy link
  • Flag this comment
  • Block
Log in

bonfire.cafe

A space for Bonfire maintainers and contributors to communicate

bonfire.cafe: About ยท Code of conduct ยท Privacy ยท Users ยท Instances
Bonfire social ยท 1.0.0-rc.3.21 no JS en
Automatic federation enabled
  • Explore
  • About
  • Members
  • Code of Conduct
Home
Login