"(1) all models perform reasonably well on AND and moderately on OR, and (2) performance collapses on NEITHER/NOR in zero-and few-shot settings, even for the strongest LLMs. ...weaknesses are amplified in the MIXED condition, where different operators appear across options and models must implicitly infer the governing relation. F1 drops to 43–56% for all few-shot models, [but] fine-tuned models achieve 83–93% F1 across all operators, suggesting that the task is learnable with supervision...."
"While LLaMA-3.1-8B achieves 72.2% accuracy on CommonSenseQA, its performance drops sharply on Logical-CommonSenseQA: 72% on AND, 62.2% on OR, 42.7% on MIXED, and only 13.9% on NEITHER/NOR. This discrepancy shows that benchmarks like CommonSenseQA substantially overestimate models’ commonsense reasoning ability by failing to test relational and compositional plausibility judgments."