@omarsar0
Final Words Overall, overlap‑based and even several embedding metrics inflate detector performance by rewarding surface similarity and verbosity. The authors call for semantically aware, human‑aligned evaluation frameworks before claiming progress on hallucination detection. Hallucination detection remains a very hard problem for LLMs. Paper: https://t.co/zirIpeB5nC