@llama_index
Let's talk content faithfulness. Four days ago, we launched ParseBench, the first document OCR benchmark for AI agents. Its most fundamental metric asks: did the parser capture all the text, in order, without making things up? We grade three failure modes with 167K+ rule-based tests: āOmissions (word, sentence, digit) āHallucinations āReading order violations The bar has shifted from "good enough for a human to read" to "reliable enough for an agent to act on." Deep dive in the video. Full write-up: https://t.co/2sq5ncGiel