Discussion

𝗞𝗲𝘆 𝘁𝗮𝗸𝗲𝗮𝘄𝗮𝘆𝘀:
⚡ 𝟱𝟳% 𝗮𝘁𝘁𝗮𝗰𝗸 𝘀𝘂𝗰𝗰𝗲𝘀𝘀 𝗿𝗮𝘁𝗲: Outperforms SOTA attacks across GPT-4o, LLama, Gemma, and Phi models
🧠 𝗦𝗺𝗮𝗿𝘁𝗲𝗿 ≠ 𝗦𝗮𝗳𝗲𝗿: Larger, more capable models are MORE vulnerable to contrastive reasoning attacks
🚨 𝗗𝗲𝗳𝗲𝗻𝘀𝗲 𝗴𝗮𝗽 𝗲𝘅𝗽𝗼𝘀𝗲𝗱: Current safety measures can't detect subtle, logic-driven jailbreaks
✅ 𝗦𝗼𝗹𝘂𝘁𝗶𝗼𝗻 𝗲𝘅𝗶𝘀𝘁𝘀: Our Chain-of-Thought defenses reduce attack success by 95%

(2/🧵)

UKP Lab

@UKPLab@sigmoid.social replied · 2 months ago

📜 𝗣𝗮𝗽𝗲𝗿 → https://arxiv.org/pdf/2501.01872
🌐 𝗣𝗿𝗼𝗷𝗲𝗰𝘁 → https://ukplab.github.io/emnlp2025-poate-attack/
💾 𝗖𝗼𝗱𝗲 + 𝗱𝗮𝘁𝗮 → https://github.com/UKPLab/emnlp2025-poate-attack

And consider following the authors Rachneet Sachdeva‬, Rima Hazra, and Iryna Gurevych (UKP Lab/TU Darmstadt) if you are interested in more information or an exchange of ideas.

(3/3)

#NLProc #LLMSafety #AIsecurity #Jailbreak #LLM

GitHub

GitHub - UKPLab/emnlp2025-poate-attack: Code associated with "Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions".

Code associated with "Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions". - UKPLab/emnlp2025-poate-attack

Turning Logic Against Itself: Probing Model Defenses Through Contrastive Questions

POATE achieves 44% attack success rate on major LLMs by harnessing contrastive reasoning to provoke unethical responses.

https://arxiv.org/pdf/2501.01872

bonfire.cafe

A space for Bonfire maintainers and contributors to communicate

bonfire.cafe: About · Code of conduct · Privacy · Users · Instances

Bonfire social · 1.0.1-alpha.44 no JS en

Automatic federation enabled