Reasoning has grow to be a central paradigm for big language fashions (LLMs), constantly boosting accuracy throughout various benchmarks. But its suitability for precision-sensitive duties stays unclear. We current the primary systematic examine of reasoning for classification duties beneath strict low false constructive fee (FPR) regimes. Our evaluation covers two duties—security detection and hallucination detection—evaluated in each fine-tuned and zero-shot settings, utilizing commonplace LLMs and Massive Reasoning Fashions (LRMs). Our outcomes reveal a transparent trade-off: Assume On (reasoning-augmented) technology improves total accuracy however underperforms on the low-FPR thresholds important for sensible use. In distinction, Assume Off (no reasoning throughout inference) dominates in these precision-sensitive regimes, with Assume On surpassing solely when larger FPRs are acceptable. As well as, we discover token-based scoring considerably outperforms self-verbalized confidence for precision-sensitive deployments. Lastly, a easy ensemble of the 2 modes recovers the strengths of every. Taken collectively, our findings place reasoning as a double-edged device: helpful for common accuracy, however usually ill-suited for functions requiring strict precision.
- ‡ Equal contribution
- † College of Maryland, Faculty Park
- ** Work achieved whereas at Apple