🐦 Twitter Post Details

Viewing enriched Twitter post

@DrJimFan

The power of the Claw, in the palm of a robot hand. Agentic robotics is here! Today, we open-source CaP-X: vibe agents, alive in the physical world. They incarnate as robot arms and humanoids with a rich set of perception APIs, actuation APIs, and auto synthesize skill libraries as they go. CaP-X is a strict superset of our old stack, because policies like VLAs are “just” API calls as well. It solves many tasks zero-shot that a learned policy would struggle with. And we are doing much more than vibing. CaP-X is our most systematic, scientific study on agentic robotics so far: - We build a comprehensive agentic toolkit: perception (SAM3 segmentation, Molmo pointing, depth, point cloud), control (IK solvers, grasp planner, navigation), and visualization (EEF, mask overlays) that work across different robots. - CaP-Gym: LLM’s first Physical Exam! 187 manipulation tasks across RoboSuite, LIBERO-PRO, and BEHAVIOR. Tabletop, bimanual, mobile manipulation. Sim and real. Can’t wait to see the gradients flow from CaP-Gym to the next wave of frontier LLM releases. - CaP-Bench: we benchmark 12 frontier LLMs/VLMs (Gemini, GPT, Opus, Qwen, DeepSeek, Kimi, and more) across 8 evaluation tiers. We systematically vary API abstraction level, agentic harness, and visual grounding methods. Lots of insights in our paper. - CaP-Agent0: a training-free agentic harness that matches or exceeds human expert code on 4 out of 7 tasks without task-specific tuning. - CaP-RL: if you get a gym, you get RL ;). A 7B OSS model jumps from 20% to 72% success after only 50 training iterations. The synthesized programs transfer to real robots with minimal sim-to-real gap. 3 years ago, our team created Voyager, one of the earliest agentic AI that plays and learns in Minecraft continuously. Its key ideas — skill libraries, self-reflection loops, and in-context planning — have since influenced many modern agentic designs. Today, the agent graduates from Minecraft and gets a real job. It’s April Fool’s, but this Claw is getting its hands dirty for real! Link in thread:

View on Twitter

📊 Media Metadata

{
  "media": [
    {
      "type": "video",
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2039358115318243352/media_0.mp4",
      "filename": "media_0.mp4"
    }
  ],
  "processed_at": "2026-04-01T15:17:56.504104",
  "pipeline_version": "2.0"
}

🔧 Raw API Response

{
  "type": "tweet",
  "id": "2039358115318243352",
  "url": "https://x.com/DrJimFan/status/2039358115318243352",
  "twitterUrl": "https://twitter.com/DrJimFan/status/2039358115318243352",
  "text": "The power of the Claw, in the palm of a robot hand. Agentic robotics is here! Today, we open-source CaP-X: vibe agents, alive in the physical world. They incarnate as robot arms and humanoids with a rich set of perception APIs, actuation APIs, and auto synthesize skill libraries as they go. CaP-X is a strict superset of our old stack, because policies like VLAs are “just” API calls as well. It solves many tasks zero-shot that a learned policy would struggle with.\n\nAnd we are doing much more than vibing. CaP-X is our most systematic, scientific study on agentic robotics so far:\n\n- We build a comprehensive agentic toolkit: perception (SAM3 segmentation, Molmo pointing, depth, point cloud), control (IK solvers, grasp planner, navigation), and visualization (EEF, mask overlays) that work across different robots. \n- CaP-Gym: LLM’s first Physical Exam! 187 manipulation tasks across RoboSuite, LIBERO-PRO, and BEHAVIOR. Tabletop, bimanual, mobile manipulation. Sim and real. Can’t wait to see the gradients flow from CaP-Gym to the next wave of frontier LLM releases. \n- CaP-Bench: we benchmark 12 frontier LLMs/VLMs (Gemini, GPT, Opus, Qwen, DeepSeek, Kimi, and more) across 8 evaluation tiers. We systematically vary API abstraction level, agentic harness, and visual grounding methods. Lots of insights in our paper.\n- CaP-Agent0: a training-free agentic harness that matches or exceeds human expert code on 4 out of 7 tasks without task-specific tuning. \n- CaP-RL: if you get a gym, you get RL ;). A 7B OSS model jumps from 20% to 72% success after only 50 training iterations. The synthesized programs transfer to real robots with minimal sim-to-real gap.\n\n3 years ago, our team created Voyager, one of the earliest agentic AI that plays and learns in Minecraft continuously. Its key ideas — skill libraries, self-reflection loops, and in-context planning — have since influenced many modern agentic designs.\n\nToday, the agent graduates from Minecraft and gets a real job. It’s April Fool’s, but this Claw is getting its hands dirty for real! \n\nLink in thread:",
  "source": "Twitter for iPhone",
  "retweetCount": 8,
  "replyCount": 7,
  "likeCount": 35,
  "quoteCount": 2,
  "viewCount": 2007,
  "createdAt": "Wed Apr 01 15:03:58 +0000 2026",
  "lang": "en",
  "bookmarkCount": 16,
  "isReply": false,
  "inReplyToId": null,
  "conversationId": "2039358115318243352",
  "displayTextRange": [
    0,
    279
  ],
  "inReplyToUserId": null,
  "inReplyToUsername": null,
  "author": {
    "type": "user",
    "userName": "DrJimFan",
    "url": "https://x.com/DrJimFan",
    "twitterUrl": "https://twitter.com/DrJimFan",
    "id": "1007413134",
    "name": "Jim Fan",
    "isVerified": false,
    "isBlueVerified": true,
    "verifiedType": null,
    "profilePicture": "https://pbs.twimg.com/profile_images/1554922493101559808/SYSZhbcd_normal.jpg",
    "coverPicture": "https://pbs.twimg.com/profile_banners/1007413134/1672408318",
    "description": "",
    "location": "Views my own. Contact →",
    "followers": 379118,
    "following": 3122,
    "status": "",
    "canDm": true,
    "canMediaTag": true,
    "createdAt": "Wed Dec 12 22:11:27 +0000 2012",
    "entities": {
      "description": {
        "urls": []
      },
      "url": {}
    },
    "fastFollowersCount": 0,
    "favouritesCount": 8736,
    "hasCustomTimelines": true,
    "isTranslator": false,
    "mediaCount": 847,
    "statusesCount": 4118,
    "withheldInCountries": [],
    "affiliatesHighlightedLabel": {},
    "possiblySensitive": false,
    "pinnedTweetIds": [
      "2018754323141054786"
    ],
    "profile_bio": {
      "description": "NVIDIA Director of Robotics & Distinguished Scientist. Co-Lead of GEAR lab. Solving Physical AGI, one motor at a time. Stanford Ph.D. OpenAI's 1st intern.",
      "entities": {
        "description": {
          "hashtags": [],
          "symbols": [],
          "urls": [],
          "user_mentions": []
        },
        "url": {
          "urls": [
            {
              "display_url": "jimfan.me",
              "expanded_url": "https://jimfan.me",
              "indices": [
                0,
                23
              ],
              "url": "https://t.co/H4rXo4Ei8X"
            }
          ]
        }
      }
    },
    "isAutomated": false,
    "automatedBy": null
  },
  "extendedEntities": {
    "media": [
      {
        "additional_media_info": {
          "monetizable": true
        },
        "allow_download_status": {
          "allow_download": true
        },
        "display_url": "pic.twitter.com/cBeZx3x8cL",
        "expanded_url": "https://twitter.com/DrJimFan/status/2039358115318243352/video/1",
        "ext_media_availability": {
          "status": "Available"
        },
        "id_str": "2039345467038081024",
        "indices": [
          280,
          303
        ],
        "media_key": "13_2039345467038081024",
        "media_results": {
          "id": "QXBpTWVkaWFSZXN1bHRzOgwABAoAARxNNeW5G6AAAAA=",
          "result": {
            "__typename": "ApiMedia",
            "id": "QXBpTWVkaWE6DAAECgABHE015bkboAAAAA==",
            "media_key": "13_2039345467038081024"
          }
        },
        "media_url_https": "https://pbs.twimg.com/amplify_video_thumb/2039345467038081024/img/0d_LMqHb3PiFLReP.jpg",
        "original_info": {
          "focus_rects": [],
          "height": 1080,
          "width": 1920
        },
        "sizes": {
          "large": {
            "h": 1080,
            "w": 1920
          }
        },
        "type": "video",
        "url": "https://t.co/cBeZx3x8cL",
        "video_info": {
          "aspect_ratio": [
            16,
            9
          ],
          "duration_millis": 91221,
          "variants": [
            {
              "content_type": "application/x-mpegURL",
              "url": "https://video.twimg.com/amplify_video/2039345467038081024/pl/ukMEypZRoRiNvfji.m3u8?tag=21&v=cfc"
            },
            {
              "bitrate": 256000,
              "content_type": "video/mp4",
              "url": "https://video.twimg.com/amplify_video/2039345467038081024/vid/avc1/480x270/mTbNJxOSPjsnrTDV.mp4?tag=21"
            },
            {
              "bitrate": 832000,
              "content_type": "video/mp4",
              "url": "https://video.twimg.com/amplify_video/2039345467038081024/vid/avc1/640x360/gZTMcEb6KwWJceF6.mp4?tag=21"
            },
            {
              "bitrate": 2176000,
              "content_type": "video/mp4",
              "url": "https://video.twimg.com/amplify_video/2039345467038081024/vid/avc1/1280x720/JFhbSTqYLupTc4Vw.mp4?tag=21"
            },
            {
              "bitrate": 10368000,
              "content_type": "video/mp4",
              "url": "https://video.twimg.com/amplify_video/2039345467038081024/vid/avc1/1920x1080/cHS_cnP8qUAu4leQ.mp4?tag=21"
            }
          ]
        }
      }
    ]
  },
  "card": null,
  "place": {},
  "entities": {
    "hashtags": [],
    "symbols": [],
    "timestamps": [],
    "urls": [],
    "user_mentions": []
  },
  "quoted_tweet": null,
  "retweeted_tweet": null,
  "isLimitedReply": false,
  "article": null
}