🐦 Twitter Post Details

Viewing enriched Twitter post

@DrJimFan

Everyone's freaking out about vibe coding. In the holiday spirit, allow me to share my anxiety on the wild west of robotics. 3 lessons I learned in 2025. 1. Hardware is ahead of software, but hardware reliability severely limits software iteration speed. We've seen exquisite engineering arts like Optimus, e-Atlas, Figure, Neo, G1, etc. Our best AI has not squeezed all the juice out of these frontier hardware. The body is more capable than what the brain can command. Yet babysitting these robots demands an entire operation team. Unlike humans, robots don't heal from bruises. Overheating, broken motors, bizarre firmware issues haunt us daily. Mistakes are irreversible and unforgiving. My patience was the only thing that scaled. 2. Benchmarking is still an epic disaster in robotics. LLM normies thought MMLU & SWE-Bench are common sense. Hold your 🍺 for robotics. No one agrees on anything: hardware platform, task definition, scoring rubrics, simulator, or real world setups. Everyone is SOTA, by definition, on the benchmark they define on the fly for each news announcement. Everyone cherry-picks the nicest looking demo out of 100 retries. We gotta do better as a field in 2026 and stop treating reproducibility and scientific discipline as second-class citizens. 3. VLM-based VLA feels wrong. VLA stands for "vision-language-action" model and has been the dominant approach for robot brains. Recipe is simple: take a pretrained VLM checkpoint and graft an action module on top. But if you think about it, VLMs are hyper-optimized to hill-climb benchmarks like visual question answering. This implies two problems: (1) most parameters in VLMs are for language & knowledge, not for physics; (2) visual encoders are actively tuned to *discard* low-level details, because Q&A only requires high-level understanding. But minute details matter a lot for dexterity. There's no reason for VLA's performance to scale as VLM parameters scale. Pretraining is misaligned. Video world model seems to be a much better pretraining objective for robot policy. I'm betting big on it.

View on Twitter

📊 Media Metadata

{
  "media": [
    {
      "type": "photo",
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2005340845055340558/media_0.jpg?",
      "filename": "media_0.jpg"
    }
  ],
  "processed_at": "2025-12-31T02:47:20.606367",
  "pipeline_version": "2.0"
}

🔧 Raw API Response

{
  "type": "tweet",
  "id": "2005340845055340558",
  "url": "https://x.com/DrJimFan/status/2005340845055340558",
  "twitterUrl": "https://twitter.com/DrJimFan/status/2005340845055340558",
  "text": "Everyone's freaking out about vibe coding. In the holiday spirit, allow me to share my anxiety on the wild west of robotics. 3 lessons I learned in 2025.\n\n1. Hardware is ahead of software, but hardware reliability severely limits software iteration speed. \n\nWe've seen exquisite engineering arts like Optimus, e-Atlas, Figure, Neo, G1, etc. Our best AI has not squeezed all the juice out of these frontier hardware. The body is more capable than what the brain can command. Yet babysitting these robots demands an entire operation team. Unlike humans, robots don't heal from bruises. Overheating, broken motors, bizarre firmware issues haunt us daily. Mistakes are irreversible and unforgiving.\n\nMy patience was the only thing that scaled. \n\n2. Benchmarking is still an epic disaster in robotics. \n\nLLM normies thought MMLU & SWE-Bench are common sense. Hold your 🍺 for robotics. No one agrees on anything: hardware platform, task definition, scoring rubrics, simulator, or real world setups. Everyone is SOTA, by definition, on the benchmark they define on the fly for each news announcement. Everyone cherry-picks the nicest looking demo out of 100 retries.\n\nWe gotta do better as a field in 2026 and stop treating reproducibility and scientific discipline as second-class citizens.\n\n3. VLM-based VLA feels wrong. \n\nVLA stands for \"vision-language-action\" model and has been the dominant approach for robot brains. Recipe is simple: take a pretrained VLM checkpoint and graft an action module on top. But if you think about it, VLMs are hyper-optimized to hill-climb benchmarks like visual question answering. This implies two problems: (1) most parameters in VLMs are for language & knowledge, not for physics; (2) visual encoders are actively tuned to *discard* low-level details, because Q&A only requires high-level understanding. But minute details matter a lot for dexterity.\n\nThere's no reason for VLA's performance to scale as VLM parameters scale. Pretraining is misaligned. Video world model seems to be a much better pretraining objective for robot policy. I'm betting big on it.",
  "source": "Twitter for iPhone",
  "retweetCount": 241,
  "replyCount": 119,
  "likeCount": 1461,
  "quoteCount": 67,
  "viewCount": 226285,
  "createdAt": "Sun Dec 28 18:11:29 +0000 2025",
  "lang": "en",
  "bookmarkCount": 755,
  "isReply": false,
  "inReplyToId": null,
  "conversationId": "2005340845055340558",
  "displayTextRange": [
    0,
    303
  ],
  "inReplyToUserId": null,
  "inReplyToUsername": null,
  "author": {
    "type": "user",
    "userName": "DrJimFan",
    "url": "https://x.com/DrJimFan",
    "twitterUrl": "https://twitter.com/DrJimFan",
    "id": "1007413134",
    "name": "Jim Fan",
    "isVerified": false,
    "isBlueVerified": true,
    "verifiedType": null,
    "profilePicture": "https://pbs.twimg.com/profile_images/1554922493101559808/SYSZhbcd_normal.jpg",
    "coverPicture": "https://pbs.twimg.com/profile_banners/1007413134/1672408318",
    "description": "",
    "location": "Views my own. Contact →",
    "followers": 343305,
    "following": 3097,
    "status": "",
    "canDm": true,
    "canMediaTag": true,
    "createdAt": "Wed Dec 12 22:11:27 +0000 2012",
    "entities": {
      "description": {
        "urls": []
      },
      "url": {}
    },
    "fastFollowersCount": 0,
    "favouritesCount": 8580,
    "hasCustomTimelines": true,
    "isTranslator": false,
    "mediaCount": 839,
    "statusesCount": 4045,
    "withheldInCountries": [],
    "affiliatesHighlightedLabel": {},
    "possiblySensitive": false,
    "pinnedTweetIds": [
      "2003879965369290797"
    ],
    "profile_bio": {
      "description": "NVIDIA Director of Robotics & Distinguished Scientist. Co-Lead of GEAR lab. Solving Physical AGI, one motor at a time. Stanford Ph.D. OpenAI's 1st intern.",
      "entities": {
        "description": {},
        "url": {
          "urls": [
            {
              "display_url": "jimfan.me",
              "expanded_url": "https://jimfan.me",
              "indices": [
                0,
                23
              ],
              "url": "https://t.co/H4rXo4Ei8X"
            }
          ]
        }
      }
    },
    "isAutomated": false,
    "automatedBy": null
  },
  "extendedEntities": {
    "media": [
      {
        "allow_download_status": {
          "allow_download": true
        },
        "display_url": "pic.twitter.com/kZObrX4tBz",
        "expanded_url": "https://twitter.com/DrJimFan/status/2005340845055340558/photo/1",
        "ext_media_availability": {
          "status": "Available"
        },
        "features": {
          "large": {},
          "orig": {}
        },
        "id_str": "2005338601140465665",
        "indices": [
          304,
          327
        ],
        "media_key": "3_2005338601140465665",
        "media_results": {
          "id": "QXBpTWVkaWFSZXN1bHRzOgwAAQoAARvUZNU/WzABCgACG9Rm37Ma8A4AAA==",
          "result": {
            "__typename": "ApiMedia",
            "id": "QXBpTWVkaWE6DAABCgABG9Rk1T9bMAEKAAIb1GbfsxrwDgAA",
            "media_key": "3_2005338601140465665"
          }
        },
        "media_url_https": "https://pbs.twimg.com/media/G9Rk1T9bMAE_l5V.jpg",
        "original_info": {
          "focus_rects": [
            {
              "h": 573,
              "w": 1024,
              "x": 0,
              "y": 0
            },
            {
              "h": 1024,
              "w": 1024,
              "x": 0,
              "y": 0
            },
            {
              "h": 1024,
              "w": 898,
              "x": 88,
              "y": 0
            },
            {
              "h": 1024,
              "w": 512,
              "x": 281,
              "y": 0
            },
            {
              "h": 1024,
              "w": 1024,
              "x": 0,
              "y": 0
            }
          ],
          "height": 1024,
          "width": 1024
        },
        "sizes": {
          "large": {
            "h": 1024,
            "w": 1024
          }
        },
        "type": "photo",
        "url": "https://t.co/kZObrX4tBz"
      }
    ]
  },
  "card": null,
  "place": {},
  "entities": {},
  "quoted_tweet": null,
  "retweeted_tweet": null,
  "isLimitedReply": false,
  "article": null
}