@omarsar0
Overview The paper argues that common QA hallucination detectors look better than they are because evaluations lean on ROUGE. In human‑aligned tests, many detectors drop sharply. Simple response‑length heuristics rival complex methods, revealing a core evaluation flaw. https://t.co/StydKhOFqE