Discussion
Loading...

#Tag

  • About
  • Code of conduct
  • Privacy
  • Users
  • Instances
  • About Bonfire
UKP Lab
@UKPLab@sigmoid.social  ยท  activity timestamp 3 weeks ago

๐—ง๐—ถ๐—ฟ๐—ฒ๐—ฑ ๐—ผ๐—ณ ๐˜†๐—ผ๐˜‚๐—ฟ ๐—”๐—œ ๐—ฎ๐—ด๐—ฒ๐—ป๐˜ ๐—บ๐—ฎ๐—ธ๐—ถ๐—ป๐—ด ๐—บ๐—ฎ๐—ป๐˜† ๐—ฒ๐—ฟ๐—ฟ๐—ผ๐—ฟ๐˜€?
โžก๏ธ Weโ€™ve got the solution!

Meet ๐—ฆ๐—˜๐—˜๐—˜๐—— ๐ŸŒฑ โ€” a framework for ๐—”๐˜‚๐˜๐—ผ๐—บ๐—ฎ๐˜๐—ฒ๐—ฑ ๐—˜๐—ฟ๐—ฟ๐—ผ๐—ฟ ๐——๐—ถ๐˜€๐—ฐ๐—ผ๐˜ƒ๐—ฒ๐—ฟ๐˜† in conversational AI.

๐Ÿงฉ SEEED detects both ๐—ธ๐—ป๐—ผ๐˜„๐—ป ๐—ฎ๐—ป๐—ฑ ๐—ฝ๐—ฟ๐—ฒ๐˜ƒ๐—ถ๐—ผ๐˜‚๐˜€๐—น๐˜† ๐˜‚๐—ป๐˜€๐—ฒ๐—ฒ๐—ป ๐—ฒ๐—ฟ๐—ฟ๐—ผ๐—ฟ ๐˜๐˜†๐—ฝ๐—ฒ๐˜€, and even ๐—ด๐—ฒ๐—ป๐—ฒ๐—ฟ๐—ฎ๐˜๐—ฒ๐˜€ ๐—ฑ๐—ฒ๐—ณ๐—ถ๐—ป๐—ถ๐˜๐—ถ๐—ผ๐—ป๐˜€ for newly discovered ones

โš™๏ธ By combining ๐—น๐—ถ๐—ด๐—ต๐˜๐˜„๐—ฒ๐—ถ๐—ด๐—ต๐˜ ๐—ฒ๐—ป๐—ฐ๐—ผ๐—ฑ๐—ฒ๐—ฟ๐˜€ with a ๐—ป๐—ผ๐˜ƒ๐—ฒ๐—น ๐˜€๐—ฎ๐—บ๐—ฝ๐—น๐—ถ๐—ป๐—ด ๐˜€๐˜๐—ฟ๐—ฎ๐˜๐—ฒ๐—ด๐˜† ๐—ณ๐—ผ๐—ฟ ๐—ฐ๐—ผ๐—ป๐˜๐—ฟ๐—ฎ๐˜€๐˜๐—ถ๐˜ƒ๐—ฒ ๐—น๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด, it improves representation learning and uncovers ๐—ฐ๐—ผ๐—ต๐—ฒ๐—ฟ๐—ฒ๐—ป๐˜ ๐—ฒ๐—ฟ๐—ฟ๐—ผ๐—ฟ ๐—ฐ๐—ฎ๐˜๐—ฒ๐—ด๐—ผ๐—ฟ๐—ถ๐—ฒ๐˜€.

(1/๐Ÿงต )

Diagram illustrating a chatbot correction process. A human says, โ€œI really like indie music! Do you have a favorite artist?โ€ The chatbot replies, โ€œIโ€™m a huge fan of indie music too! The Beatles are my absolute favorite!โ€ A feedback system then flags this as factually inconsistent, explaining that The Beatles are a rock band and suggesting The Smiths as an indie band instead. The corrected response becomes, โ€œIโ€™m a huge fan of indie music too! The Smiths are my absolute favorite!โ€ The diagram also highlights limitations of relying solely on instructions or external tools, noting that they โ€œdo not cover everything.โ€
Diagram illustrating a chatbot correction process. A human says, โ€œI really like indie music! Do you have a favorite artist?โ€ The chatbot replies, โ€œIโ€™m a huge fan of indie music too! The Beatles are my absolute favorite!โ€ A feedback system then flags this as factually inconsistent, explaining that The Beatles are a rock band and suggesting The Smiths as an indie band instead. The corrected response becomes, โ€œIโ€™m a huge fan of indie music too! The Smiths are my absolute favorite!โ€ The diagram also highlights limitations of relying solely on instructions or external tools, noting that they โ€œdo not cover everything.โ€
Diagram illustrating a chatbot correction process. A human says, โ€œI really like indie music! Do you have a favorite artist?โ€ The chatbot replies, โ€œIโ€™m a huge fan of indie music too! The Beatles are my absolute favorite!โ€ A feedback system then flags this as factually inconsistent, explaining that The Beatles are a rock band and suggesting The Smiths as an indie band instead. The corrected response becomes, โ€œIโ€™m a huge fan of indie music too! The Smiths are my absolute favorite!โ€ The diagram also highlights limitations of relying solely on instructions or external tools, noting that they โ€œdo not cover everything.โ€
UKP Lab
@UKPLab@sigmoid.social replied  ยท  activity timestamp 3 weeks ago

๐Ÿ“Š ๐—ฆ๐—˜๐—˜๐—˜๐—— outperforms #GPT-4o and #Phi-4 by up to +๐Ÿด ๐—ฝ๐—ฝ across multiple datasets.

๐Ÿ“„ ๐—ฃ๐—ฎ๐—ฝ๐—ฒ๐—ฟ: https://www.arxiv.org/abs/2509.10833
๐Ÿ’ป ๐—–๐—ผ๐—ฑ๐—ฒ: https://github.com/UKPLab/emnlp2025-automatic-error-discovery
๐Ÿ”— ๐—ฃ๐—ฟ๐—ผ๐—ท๐—ฒ๐—ฐ๐˜: https://ukplab.github.io/emnlp2025-automatic-error-discovery/

Be sure to follow the authors: Dominic Petrak, Thy Thy Tran, and Iryna Gurevych from Ubiquitous Knowledge Processing (UKP) Lab/Technische Universitรคt Darmstadt.

See you at the #EMNLP in Suzhou!

(2/2)

#NLProc #ConversationalAI #Agents #EMNLP2025

  • Copy link
  • Flag this comment
  • Block
UKP Lab
@UKPLab@sigmoid.social  ยท  activity timestamp 3 weeks ago

๐—ž๐—ฒ๐˜† ๐˜๐—ฎ๐—ธ๐—ฒ๐—ฎ๐˜„๐—ฎ๐˜†๐˜€:
โšก ๐Ÿฑ๐Ÿณ% ๐—ฎ๐˜๐˜๐—ฎ๐—ฐ๐—ธ ๐˜€๐˜‚๐—ฐ๐—ฐ๐—ฒ๐˜€๐˜€ ๐—ฟ๐—ฎ๐˜๐—ฒ: Outperforms SOTA attacks across GPT-4o, LLama, Gemma, and Phi models
๐Ÿง  ๐—ฆ๐—บ๐—ฎ๐—ฟ๐˜๐—ฒ๐—ฟ โ‰  ๐—ฆ๐—ฎ๐—ณ๐—ฒ๐—ฟ: Larger, more capable models are MORE vulnerable to contrastive reasoning attacks
๐Ÿšจ ๐——๐—ฒ๐—ณ๐—ฒ๐—ป๐˜€๐—ฒ ๐—ด๐—ฎ๐—ฝ ๐—ฒ๐˜…๐—ฝ๐—ผ๐˜€๐—ฒ๐—ฑ: Current safety measures can't detect subtle, logic-driven jailbreaks
โœ… ๐—ฆ๐—ผ๐—น๐˜‚๐˜๐—ถ๐—ผ๐—ป ๐—ฒ๐˜…๐—ถ๐˜€๐˜๐˜€: Our Chain-of-Thought defenses reduce attack success by 95%

(2/๐Ÿงต)

UKP Lab
@UKPLab@sigmoid.social replied  ยท  activity timestamp 3 weeks ago

๐Ÿ“œ ๐—ฃ๐—ฎ๐—ฝ๐—ฒ๐—ฟ โ†’ https://arxiv.org/pdf/2501.01872
๐ŸŒ ๐—ฃ๐—ฟ๐—ผ๐—ท๐—ฒ๐—ฐ๐˜ โ†’ https://ukplab.github.io/emnlp2025-poate-attack/
๐Ÿ’พ ๐—–๐—ผ๐—ฑ๐—ฒ + ๐—ฑ๐—ฎ๐˜๐—ฎ โ†’ https://github.com/UKPLab/emnlp2025-poate-attack

And consider following the authors Rachneet Sachdevaโ€ฌ, Rima Hazra, and Iryna Gurevych (UKP Lab/TU Darmstadt) if you are interested in more information or an exchange of ideas.

(3/3)

#NLProc #LLMSafety #AIsecurity #Jailbreak #LLM

GitHub

GitHub - UKPLab/emnlp2025-poate-attack: Code associated with "Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions".

Code associated with "Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions". - UKPLab/emnlp2025-poate-attack

Turning Logic Against Itself: Probing Model Defenses Through Contrastive Questions

POATE achieves 44% attack success rate on major LLMs by harnessing contrastive reasoning to provoke unethical responses.
https://arxiv.org/pdf/2501.01872
  • Copy link
  • Flag this comment
  • Block
UKP Lab
@UKPLab@sigmoid.social  ยท  activity timestamp 3 months ago

9๏ธโƒฃ ๐˜“๐˜ฆ๐˜ข๐˜ฌ๐˜บ ๐˜›๐˜ฉ๐˜ฐ๐˜ถ๐˜จ๐˜ฉ๐˜ต๐˜ด: ๐˜“๐˜ข๐˜ณ๐˜จ๐˜ฆ ๐˜™๐˜ฆ๐˜ข๐˜ด๐˜ฐ๐˜ฏ๐˜ช๐˜ฏ๐˜จ ๐˜”๐˜ฐ๐˜ฅ๐˜ฆ๐˜ญ๐˜ด ๐˜ˆ๐˜ณ๐˜ฆ ๐˜•๐˜ฐ๐˜ต ๐˜—๐˜ณ๐˜ช๐˜ท๐˜ข๐˜ต๐˜ฆ ๐˜›๐˜ฉ๐˜ช๐˜ฏ๐˜ฌ๐˜ฆ๐˜ณ๐˜ด
Tommaso Green, Martin Gubri, Haritz Puerto, Sangdoo Yun, Seong Joon Oh

๐Ÿ”Ÿ ๐˜๐˜ฅ๐˜ฆ๐˜ฏ๐˜ต๐˜ช๐˜ง๐˜บ๐˜ช๐˜ฏ๐˜จ ๐˜ˆ๐˜ด๐˜ฑ๐˜ฆ๐˜ค๐˜ต๐˜ด ๐˜ช๐˜ฏ ๐˜—๐˜ฆ๐˜ฆ๐˜ณ ๐˜™๐˜ฆ๐˜ท๐˜ช๐˜ฆ๐˜ธ๐˜ด
Sheng Lu, Ilia Kuznetsov, Iryna Gurevych

UKP Lab
@UKPLab@sigmoid.social replied  ยท  activity timestamp 3 months ago

๐Ÿ‘ Congratulations to all authors and collaborators for their excellent work! We are looking forward to presenting these results at EMNLP 2025 in #Suzhou this November.

Stay tuned for more details!

#NLProc #MachineLearning #UKPLab #Research #EMNLP2025

  • Copy link
  • Flag this comment
  • Block
UKP Lab
@UKPLab@sigmoid.social  ยท  activity timestamp 4 months ago

Also consider following the authors Aniket Pramanick (Ubiquitous Knowledge Processing (UKP) Lab)โ€ฌ, Yufang Hou (IT:U- Interdisciplinary Transformation University Austria, IBM Research), Saif M Mohammad (National Research Council Canada / Conseil national de recherches Canada), and Iryna Gurevych (Ubiquitous Knowledge Processing (UKP) Lab).

๐Ÿ—บ๏ธ See you at #ACL2025 in Vienna

(2/2)

#NLProc#ACL2025#AI4Science#ACL2025

  • Copy link
  • Flag this post
  • Block
Log in

bonfire.cafe

A space for Bonfire maintainers and contributors to communicate

bonfire.cafe: About ยท Code of conduct ยท Privacy ยท Users ยท Instances
Bonfire social ยท 1.0.0 no JS en
Automatic federation enabled
  • Explore
  • About
  • Members
  • Code of Conduct
Home
Login