🐦 Twitter Post Details

Viewing enriched Twitter post

@dair_ai

Don't sleep on using "code-as-tool" with your AI agents. Here is a great example of how it applies to vision. State-of-the-art vision models are surprisingly brittle. The default assumption is that models like GPT-4o and Gemini 2.5 Pro can robustly understand images. They score well on benchmarks. They handle complex visual reasoning. But rotate an image 90 degrees and performance collapses. The researchers ran a simple diagnostic: take 200 images, apply basic transformations like rotation or flipping, and ask models to identify what changed. Humans get 100% accuracy. GPT-5 and Gemini 2.5 Pro perform poorly. On OCRBench, simple rotations can reduce model performance by up to 80%. This new research introduces CodeVision, a framework where models generate code as a universal interface to invoke any image operation. Instead of relying on a fixed set of predefined tools, the model writes Python code to call whatever transformations are needed. Treating code as a tool unlocks three capabilities: - Emergence of new tools the model was never trained on. - Efficiency through chaining multiple operations in a single execution. - Robustness from leveraging runtime error messages to revise and retry. Training uses a two-stage approach. First, supervised fine-tuning on 5,000 examples covering multi-tool sequences, error handling, and coarse-to-fine localization. Second, reinforcement learning with a dense reward function that encourages strategic tool use while penalizing reward hacking behaviors like exhaustively trying every rotation. Results: - CodeVision-7B achieves 73.4 average score on transformed OCRBench, a +17.4 improvement over its base model. - On MVToolBench, their new multi-tool benchmark, CodeVision-7B scores 60.1, nearly doubling Gemini 2.5 Pro's 32.6. - The model learns to use tools like contrast enhancement, brightness adjustment, and edge detection that never appeared in training data. Vision models that seem robust on standard benchmarks can fail catastrophically on trivial real-world perturbations. Code-as-tool frameworks offer a path to genuine robustness by letting models compose arbitrary operations dynamically. 🔖 (bookmark it) Paper: https://t.co/BG2AgRUey3 Learn to build effective AI agents in our academy: https://t.co/zQXQt0PMbG

View on Twitter

📊 Media Metadata

{
  "media": [
    {
      "type": "photo",
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/1996624052493209730/media_0.jpg?",
      "filename": "media_0.jpg"
    },
    {
      "type": "photo",
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/1996624052493209730/media_1.png?",
      "filename": "media_1.png"
    }
  ],
  "processed_at": "2025-12-04T20:37:17.940791",
  "pipeline_version": "2.0"
}

🔧 Raw API Response

{
  "type": "tweet",
  "id": "1996624052493209730",
  "url": "https://x.com/dair_ai/status/1996624052493209730",
  "twitterUrl": "https://twitter.com/dair_ai/status/1996624052493209730",
  "text": "Don't sleep on using \"code-as-tool\" with your AI agents.\n\nHere is a great example of how it applies to vision.\n\nState-of-the-art vision models are surprisingly brittle.\n\nThe default assumption is that models like GPT-4o and Gemini 2.5 Pro can robustly understand images. They score well on benchmarks. They handle complex visual reasoning. But rotate an image 90 degrees and performance collapses.\n\nThe researchers ran a simple diagnostic: take 200 images, apply basic transformations like rotation or flipping, and ask models to identify what changed.\n\nHumans get 100% accuracy. GPT-5 and Gemini 2.5 Pro perform poorly. On OCRBench, simple rotations can reduce model performance by up to 80%.\n\nThis new research introduces CodeVision, a framework where models generate code as a universal interface to invoke any image operation. Instead of relying on a fixed set of predefined tools, the model writes Python code to call whatever transformations are needed.\n\nTreating code as a tool unlocks three capabilities:\n\n- Emergence of new tools the model was never trained on.\n- Efficiency through chaining multiple operations in a single execution.\n- Robustness from leveraging runtime error messages to revise and retry.\n\nTraining uses a two-stage approach. First, supervised fine-tuning on 5,000 examples covering multi-tool sequences, error handling, and coarse-to-fine localization. Second, reinforcement learning with a dense reward function that encourages strategic tool use while penalizing reward hacking behaviors like exhaustively trying every rotation.\n\nResults:\n\n- CodeVision-7B achieves 73.4 average score on transformed OCRBench, a +17.4 improvement over its base model.\n- On MVToolBench, their new multi-tool benchmark, CodeVision-7B scores 60.1, nearly doubling Gemini 2.5 Pro's 32.6.\n- The model learns to use tools like contrast enhancement, brightness adjustment, and edge detection that never appeared in training data.\n\nVision models that seem robust on standard benchmarks can fail catastrophically on trivial real-world perturbations.\n\nCode-as-tool frameworks offer a path to genuine robustness by letting models compose arbitrary operations dynamically.\n\n🔖 (bookmark it)\n\nPaper: https://t.co/BG2AgRUey3\nLearn to build effective AI agents in our academy: https://t.co/zQXQt0PMbG",
  "source": "Twitter for iPhone",
  "retweetCount": 6,
  "replyCount": 0,
  "likeCount": 30,
  "quoteCount": 0,
  "viewCount": 2039,
  "createdAt": "Thu Dec 04 16:54:04 +0000 2025",
  "lang": "en",
  "bookmarkCount": 27,
  "isReply": false,
  "inReplyToId": null,
  "conversationId": "1996624052493209730",
  "displayTextRange": [
    0,
    276
  ],
  "inReplyToUserId": null,
  "inReplyToUsername": null,
  "author": {
    "type": "user",
    "userName": "dair_ai",
    "url": "https://x.com/dair_ai",
    "twitterUrl": "https://twitter.com/dair_ai",
    "id": "889050642903293953",
    "name": "DAIR.AI",
    "isVerified": false,
    "isBlueVerified": true,
    "verifiedType": null,
    "profilePicture": "https://pbs.twimg.com/profile_images/1643277398522187778/31dedbLo_normal.jpg",
    "coverPicture": "https://pbs.twimg.com/profile_banners/889050642903293953/1742055232",
    "description": "",
    "location": "",
    "followers": 82134,
    "following": 1,
    "status": "",
    "canDm": true,
    "canMediaTag": true,
    "createdAt": "Sun Jul 23 09:12:45 +0000 2017",
    "entities": {
      "description": {
        "urls": []
      },
      "url": {}
    },
    "fastFollowersCount": 0,
    "favouritesCount": 3836,
    "hasCustomTimelines": true,
    "isTranslator": false,
    "mediaCount": 78,
    "statusesCount": 2605,
    "withheldInCountries": [],
    "affiliatesHighlightedLabel": {},
    "possiblySensitive": false,
    "pinnedTweetIds": [
      "1996227436913340858"
    ],
    "profile_bio": {
      "description": "Democratizing AI research, education, and technologies. New Claude Code cohort: https://t.co/GZCGtVkIFm",
      "entities": {
        "description": {
          "urls": [
            {
              "display_url": "dair-ai.thinkific.com/courses/claude…",
              "expanded_url": "https://dair-ai.thinkific.com/courses/claude-code",
              "indices": [
                80,
                103
              ],
              "url": "https://t.co/GZCGtVkIFm"
            }
          ]
        },
        "url": {
          "urls": [
            {
              "display_url": "dair.ai",
              "expanded_url": "https://www.dair.ai/",
              "indices": [
                0,
                23
              ],
              "url": "https://t.co/lkqPZtMmfU"
            }
          ]
        }
      }
    },
    "isAutomated": false,
    "automatedBy": null
  },
  "extendedEntities": {
    "media": [
      {
        "display_url": "pic.twitter.com/s33egxSyRv",
        "expanded_url": "https://twitter.com/dair_ai/status/1996624052493209730/photo/1",
        "ext_media_availability": {
          "status": "Available"
        },
        "features": {
          "large": {},
          "orig": {}
        },
        "id_str": "1996624048995147776",
        "indices": [
          277,
          300
        ],
        "media_key": "3_1996624048995147776",
        "media_results": {
          "id": "QXBpTWVkaWFSZXN1bHRzOgwAAQoAARu1bv52W6AACgACG7Vu/0bb0IIAAA==",
          "result": {
            "__typename": "ApiMedia",
            "id": "QXBpTWVkaWE6DAABCgABG7Vu/nZboAAKAAIbtW7/RtvQggAA",
            "media_key": "3_1996624048995147776"
          }
        },
        "media_url_https": "https://pbs.twimg.com/media/G7Vu_nZboAAaJ8R.jpg",
        "original_info": {
          "focus_rects": [
            {
              "h": 781,
              "w": 1394,
              "x": 0,
              "y": 0
            },
            {
              "h": 1394,
              "w": 1394,
              "x": 0,
              "y": 0
            },
            {
              "h": 1589,
              "w": 1394,
              "x": 0,
              "y": 0
            },
            {
              "h": 1798,
              "w": 899,
              "x": 248,
              "y": 0
            },
            {
              "h": 1798,
              "w": 1394,
              "x": 0,
              "y": 0
            }
          ],
          "height": 1798,
          "width": 1394
        },
        "sizes": {
          "large": {
            "h": 1798,
            "w": 1394
          }
        },
        "type": "photo",
        "url": "https://t.co/s33egxSyRv"
      }
    ]
  },
  "card": null,
  "place": {},
  "entities": {
    "urls": [
      {
        "display_url": "arxiv.org/abs/2512.03746",
        "expanded_url": "https://arxiv.org/abs/2512.03746",
        "indices": [
          2199,
          2222
        ],
        "url": "https://t.co/BG2AgRUey3"
      },
      {
        "display_url": "dair-ai.thinkific.com",
        "expanded_url": "https://dair-ai.thinkific.com/",
        "indices": [
          2274,
          2297
        ],
        "url": "https://t.co/zQXQt0PMbG"
      }
    ]
  },
  "quoted_tweet": null,
  "retweeted_tweet": null,
  "isLimitedReply": false,
  "article": null
}