@emollick
Another example of a persistent problem with LLMs. They do very well on standard medical questions, but when the right answer is replaced with “none of the above” performance drops. More recent models generally have lower drops in performance. https://t.co/X1WiQlmzAQ https://t.co/OrVewtellP