@iScienceLuvr
New paper from OpenAI: Training LLMs for Honesty via Confessions "In this work we propose a method for eliciting an honest expression of an LLM’s shortcomings via a self-reported confession." "we train GPT-5-Thinking to produce confessions, and we evaluate its honesty in out-of-distribution scenarios measuring hallucination, instruction following, scheming, and reward hacking. We find that when the model lies or omits shortcomings in its “main” answer, it often confesses to these behaviors honestly, and this confession honesty modestly improves with training. "