@omarsar0
Stress Testing Large Reasoning Models This looks like a more interesting way to evaluate large reasoning models. Presents multiple reasoning problems in a single prompt to better represent real-world scenarios. Which are the best models at this? Here are my notes: https://t.co/I5wlnb0k3w