Discussion
Loading...

Post

Log in
  • About
  • Code of conduct
  • Privacy
  • Users
  • Instances
  • About Bonfire
Nick Byrd, Ph.D.
Nick Byrd, Ph.D.
@ByrdNick@nerdculture.de  ·  activity timestamp 5 days ago

How well can #AI infer the relationships between a pair of answers?

Are both plausible? (this AND that)
Is one more plausible? (this OR that)
Are both implausible? (NEITHER this NOR that)

LogicalCommonSenseQA benchmarks #LLMs on such #logic inference.

https://doi.org/10.48550/arXiv.2601.16504

"(1) all models perform reasonably well on AND and moderately on OR, and (2) performance collapses on NEITHER/NOR in zero-and few-shot settings, even for the strongest LLMs. ...weaknesses are amplified in the MIXED condition, where different operators appear across options and models must implicitly infer the governing relation. F1 drops to 43–56% for all few-shot models, [but] fine-tuned models achieve 83–93% F1 across all operators, suggesting that the task is learnable with supervision...."

"While LLaMA-3.1-8B achieves 72.2% accuracy on CommonSenseQA, its performance drops sharply on Logical-CommonSenseQA: 72% on AND, 62.2% on OR, 42.7% on MIXED, and only 13.9% on NEITHER/NOR. This discrepancy shows that benchmarks like CommonSenseQA substantially overestimate models’ commonsense reasoning ability by failing to test relational and compositional plausibility judgments."
"(1) all models perform reasonably well on AND and moderately on OR, and (2) performance collapses on NEITHER/NOR in zero-and few-shot settings, even for the strongest LLMs. ...weaknesses are amplified in the MIXED condition, where different operators appear across options and models must implicitly infer the governing relation. F1 drops to 43–56% for all few-shot models, [but] fine-tuned models achieve 83–93% F1 across all operators, suggesting that the task is learnable with supervision...." "While LLaMA-3.1-8B achieves 72.2% accuracy on CommonSenseQA, its performance drops sharply on Logical-CommonSenseQA: 72% on AND, 62.2% on OR, 42.7% on MIXED, and only 13.9% on NEITHER/NOR. This discrepancy shows that benchmarks like CommonSenseQA substantially overestimate models’ commonsense reasoning ability by failing to test relational and compositional plausibility judgments."
"(1) all models perform reasonably well on AND and moderately on OR, and (2) performance collapses on NEITHER/NOR in zero-and few-shot settings, even for the strongest LLMs. ...weaknesses are amplified in the MIXED condition, where different operators appear across options and models must implicitly infer the governing relation. F1 drops to 43–56% for all few-shot models, [but] fine-tuned models achieve 83–93% F1 across all operators, suggesting that the task is learnable with supervision...." "While LLaMA-3.1-8B achieves 72.2% accuracy on CommonSenseQA, its performance drops sharply on Logical-CommonSenseQA: 72% on AND, 62.2% on OR, 42.7% on MIXED, and only 13.9% on NEITHER/NOR. This discrepancy shows that benchmarks like CommonSenseQA substantially overestimate models’ commonsense reasoning ability by failing to test relational and compositional plausibility judgments."
arXiv.org

LOGICAL-COMMONSENSEQA: A Benchmark for Logical Commonsense Reasoning

Commonsense reasoning often involves evaluating multiple plausible interpretations rather than selecting a single atomic answer, yet most benchmarks rely on single-label evaluation, obscuring whether statements are jointly plausible, mutually exclusive, or jointly implausible. We introduce LOGICAL-COMMONSENSEQA, a benchmark that re-frames commonsense reasoning as logical composition over pairs of atomic statements using plausibility-level operators (AND, OR, NEITHER/NOR). Evaluating instruction-tuned, reasoning-specialized, and fine-tuned models under zero-shot, few-shot, and chain-of-thought prompting, we find that while models perform reasonably on conjunctive and moderately on disjunctive reasoning, performance degrades sharply on negation-based questions. LOGICAL-COMMONSENSEQA exposes fundamental reasoning limitations and provides a controlled framework for advancing compositional commonsense reasoning.
  • Copy link
  • Flag this post
  • Block

bonfire.cafe

A space for Bonfire maintainers and contributors to communicate

bonfire.cafe: About · Code of conduct · Privacy · Users · Instances
Bonfire social · 1.0.2-alpha.7 no JS en
Automatic federation enabled
Log in
  • Explore
  • About
  • Members
  • Code of Conduct