🐦 Twitter Post Details

Viewing enriched Twitter post

@JunweiLiangCMU

Meet DiT4DiT, the FIRST video generation architecture for humanoid robot control. 🤖✨ By treating video generation as a world model, we give robots real "physical intuition." 🔥 The Results: 🚀 >10x better sample efficiency & up to 7x faster convergence! 🏆 SOTA on LIBERO (98.6%) & RoboCasa-GR1 (50.8%). 🦾 Zero-shot generalization on the Unitree G1 humanoid using just monocular vision (1x speed, fully autonomous). 🧠 How it works: We couple a Video DiT with an Action DiT via a dual flow-matching objective. Instead of relying on fully reconstructed future frames, we extract "intermediate denoising features" to guide action prediction—simple but highly effective! Check out the paper, real-world videos, and project page here: https://t.co/Ml0AA8PKqA #EmbodiedAI #Robotics #MachineLearning #WorldModels

📊 Media Metadata

{
  "media": [
    {
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2036754438803824967/media_0.mp4",
      "media_url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2036754438803824967/media_0.mp4",
      "type": "video",
      "filename": "media_0.mp4"
    }
  ],
  "processed_at": "2026-03-25T21:23:57.918316",
  "pipeline_version": "2.0"
}

🔧 Raw API Response

{
  "type": "tweet",
  "id": "2036754438803824967",
  "url": "https://x.com/JunweiLiangCMU/status/2036754438803824967",
  "twitterUrl": "https://twitter.com/JunweiLiangCMU/status/2036754438803824967",
  "text": "Meet DiT4DiT, the FIRST video generation architecture for humanoid robot control. 🤖✨ \n\nBy treating video generation as a world model, we give robots real \"physical intuition.\" 🔥 \nThe Results:\n🚀 >10x better sample efficiency & up to 7x faster convergence! \n🏆 SOTA on LIBERO (98.6%) & RoboCasa-GR1 (50.8%). \n🦾 Zero-shot generalization on the Unitree G1 humanoid using just monocular vision (1x speed, fully autonomous). \n🧠 How it works: We couple a Video DiT with an Action DiT via a dual flow-matching objective. Instead of relying on fully reconstructed future frames, we extract \"intermediate denoising features\" to guide action prediction—simple but highly effective! \n\nCheck out the paper, real-world videos, and project page here: https://t.co/Ml0AA8PKqA \n\n#EmbodiedAI #Robotics #MachineLearning #WorldModels",
  "source": "Twitter for iPhone",
  "retweetCount": 4,
  "replyCount": 0,
  "likeCount": 16,
  "quoteCount": 0,
  "viewCount": 1558,
  "createdAt": "Wed Mar 25 10:37:53 +0000 2026",
  "lang": "en",
  "bookmarkCount": 14,
  "isReply": false,
  "inReplyToId": null,
  "conversationId": "2036754438803824967",
  "displayTextRange": [
    0,
    279
  ],
  "inReplyToUserId": null,
  "inReplyToUsername": null,
  "author": {
    "type": "user",
    "userName": "JunweiLiangCMU",
    "url": "https://x.com/JunweiLiangCMU",
    "twitterUrl": "https://twitter.com/JunweiLiangCMU",
    "id": "3820486455",
    "name": "Junwei Liang 梁俊卫",
    "isVerified": false,
    "isBlueVerified": false,
    "verifiedType": null,
    "profilePicture": "https://pbs.twimg.com/profile_images/1939862616098123776/F1kqWlz4_normal.jpg",
    "coverPicture": "https://pbs.twimg.com/profile_banners/3820486455/1751334274",
    "description": "Assistant Professor @HKUST (GZ) // Ph.D. @CarnegieMellon // NeurIPS Area Chair",
    "location": "Guangzhou, China",
    "followers": 293,
    "following": 130,
    "status": "",
    "canDm": false,
    "canMediaTag": true,
    "createdAt": "Wed Sep 30 03:22:19 +0000 2015",
    "entities": {
      "description": {
        "urls": []
      },
      "url": {
        "urls": [
          {
            "display_url": "junweiliang.me",
            "expanded_url": "https://junweiliang.me/",
            "indices": [
              0,
              23
            ],
            "url": "https://t.co/S6iRZlxWj6"
          }
        ]
      }
    },
    "fastFollowersCount": 0,
    "favouritesCount": 169,
    "hasCustomTimelines": false,
    "isTranslator": false,
    "mediaCount": 35,
    "statusesCount": 72,
    "withheldInCountries": [],
    "affiliatesHighlightedLabel": {},
    "possiblySensitive": false,
    "pinnedTweetIds": [],
    "profile_bio": {},
    "isAutomated": false,
    "automatedBy": null
  },
  "extendedEntities": {
    "media": [
      {
        "additional_media_info": {
          "monetizable": true
        },
        "display_url": "pic.x.com/MwBBuWf8Wi",
        "expanded_url": "https://x.com/JunweiLiangCMU/status/2036754438803824967/video/1",
        "ext_media_availability": {
          "status": "Available"
        },
        "id_str": "2036753538655903744",
        "indices": [
          280,
          303
        ],
        "media_key": "13_2036753538655903744",
        "media_results": {
          "result": {
            "media_key": "13_2036753538655903744"
          }
        },
        "media_url_https": "https://pbs.twimg.com/amplify_video_thumb/2036753538655903744/img/Fz339fAYxVPiUlEd.jpg",
        "original_info": {
          "focus_rects": [],
          "height": 744,
          "width": 1920
        },
        "sizes": {
          "large": {
            "h": 744,
            "resize": "fit",
            "w": 1920
          },
          "medium": {
            "h": 465,
            "resize": "fit",
            "w": 1200
          },
          "small": {
            "h": 264,
            "resize": "fit",
            "w": 680
          },
          "thumb": {
            "h": 150,
            "resize": "crop",
            "w": 150
          }
        },
        "type": "video",
        "url": "https://t.co/MwBBuWf8Wi",
        "video_info": {
          "aspect_ratio": [
            80,
            31
          ],
          "duration_millis": 53543,
          "variants": [
            {
              "content_type": "application/x-mpegURL",
              "url": "https://video.twimg.com/amplify_video/2036753538655903744/pl/KDYMQGVVJAhLyBaJ.m3u8?tag=21"
            },
            {
              "bitrate": 256000,
              "content_type": "video/mp4",
              "url": "https://video.twimg.com/amplify_video/2036753538655903744/vid/avc1/696x270/SKonGMe3ktS3Jm9G.mp4?tag=21"
            },
            {
              "bitrate": 832000,
              "content_type": "video/mp4",
              "url": "https://video.twimg.com/amplify_video/2036753538655903744/vid/avc1/928x360/Qcz2y97PLvylO7ai.mp4?tag=21"
            },
            {
              "bitrate": 2176000,
              "content_type": "video/mp4",
              "url": "https://video.twimg.com/amplify_video/2036753538655903744/vid/avc1/1920x744/2Oq-fFmOlVs1sfNF.mp4?tag=21"
            }
          ]
        }
      }
    ]
  },
  "card": null,
  "place": {},
  "entities": {
    "hashtags": [
      {
        "indices": [
          761,
          772
        ],
        "text": "EmbodiedAI"
      },
      {
        "indices": [
          773,
          782
        ],
        "text": "Robotics"
      },
      {
        "indices": [
          783,
          799
        ],
        "text": "MachineLearning"
      },
      {
        "indices": [
          800,
          812
        ],
        "text": "WorldModels"
      }
    ],
    "symbols": [],
    "timestamps": [],
    "urls": [
      {
        "display_url": "dit4dit.github.io",
        "expanded_url": "https://dit4dit.github.io/",
        "indices": [
          735,
          758
        ],
        "url": "https://t.co/Ml0AA8PKqA"
      }
    ],
    "user_mentions": []
  },
  "quoted_tweet": null,
  "retweeted_tweet": null,
  "article": null
}