@dair_ai
Don't sleep on using "code-as-tool" with your AI agents. Here is a great example of how it applies to vision. State-of-the-art vision models are surprisingly brittle. The default assumption is that models like GPT-4o and Gemini 2.5 Pro can robustly understand images. They score well on benchmarks. They handle complex visual reasoning. But rotate an image 90 degrees and performance collapses. The researchers ran a simple diagnostic: take 200 images, apply basic transformations like rotation or flipping, and ask models to identify what changed. Humans get 100% accuracy. GPT-5 and Gemini 2.5 Pro perform poorly. On OCRBench, simple rotations can reduce model performance by up to 80%. This new research introduces CodeVision, a framework where models generate code as a universal interface to invoke any image operation. Instead of relying on a fixed set of predefined tools, the model writes Python code to call whatever transformations are needed. Treating code as a tool unlocks three capabilities: - Emergence of new tools the model was never trained on. - Efficiency through chaining multiple operations in a single execution. - Robustness from leveraging runtime error messages to revise and retry. Training uses a two-stage approach. First, supervised fine-tuning on 5,000 examples covering multi-tool sequences, error handling, and coarse-to-fine localization. Second, reinforcement learning with a dense reward function that encourages strategic tool use while penalizing reward hacking behaviors like exhaustively trying every rotation. Results: - CodeVision-7B achieves 73.4 average score on transformed OCRBench, a +17.4 improvement over its base model. - On MVToolBench, their new multi-tool benchmark, CodeVision-7B scores 60.1, nearly doubling Gemini 2.5 Pro's 32.6. - The model learns to use tools like contrast enhancement, brightness adjustment, and edge detection that never appeared in training data. Vision models that seem robust on standard benchmarks can fail catastrophically on trivial real-world perturbations. Code-as-tool frameworks offer a path to genuine robustness by letting models compose arbitrary operations dynamically. π (bookmark it) Paper: https://t.co/BG2AgRUey3 Learn to build effective AI agents in our academy: https://t.co/zQXQt0PMbG