🐦 Twitter Post Details

Viewing enriched Twitter post

@DrJimFan

Let's reverse engineer the phenomenal Tesla Optimus. No insider info, just my own analysis. Long read: 1. The smooth hand movements are almost certainly trained by imitation learning ("behavior cloning") from human operators. The alternative is reinforcement learning in simulation, but that typically leads to jittery motion and unnatural hand poses. There're at least 4 ways to collect human demonstrations: (1) A custom-built teleoperation system - I believe this is the most likely means used by Tesla team. Open-source example: ALOHA, a low-cost bimanual robot arm and teleoperation system by Stanford AI Labs (https://t.co/8iXpiHVEjS). It enables very precise, dexterous motions, such as putting AAA batteries into a remote or manipulating contact lens. (2) Motion Capture (MoCap): apply the MoCap systems used for Hollywood movies to capture the fine-grained motions of hand joints. Optimus' 5-finger hand is a great design decision that enables a direct mapping - there is no "embodiment gap" from human operators. For instance, a demonstrator can wear a CyberGlove (https://t.co/S8hxErsEuU) and grasp the cubes on the table (as shown in video). CyberGlove will capture the motion signals & haptic feedback in real-time, which can be re-targeted onto Optimus. (3) Wearing gloves & markers can be clumsy. An alternative way to do MoCap is through computer vision. DexPilot from NVIDIA enables marker-less and glove-free data collection. The human operator simply uses their bare hands to perform the tasks. 4 Intel RealSense depth cameras and 2 NVIDIA Titan XP GPUs (yeah, 2019 work) translate the pixels to precise motion signals for robot learning. (4) VR Headset: turn the training room into a VR game, and let humans "role play" Optimus. Use the native VR controller or CyberGlove to control the virtual Optimus hands. This has the advantage of scalable remote data collection - annotators from around the world can contribute without coming onsite. VR demonstration technique appeared in research projects like the iGibson home robot simulator, an initiative that I participated in at Stanford: https://t.co/eyI4ORkH6G Above 4 are not mutually exclusive. Optimus could use a combo of them for different pros & cons. 2. Neural Architecture. Optimus is trained end-to-end: videos in, actions out. I'm quite sure it's implemented by a multimodal Transformer with the following components: (1) Image: some variant of efficient ViT, or simply an old ResNet/EfficientNet backbone (https://t.co/L6PLTQJnGA). The block pick-and-place demo doesn't require sophisticated vision. The spatial feature map from the image backbone can be tokenized easily. (2) Video: two ways. Either flatten the video into a sequence of images and produce tokens independently, or have a video-level tokenizer. There're numerous ways to efficiently process video pixel volumes. You don't necessarily need Transformer backbones, e.g. SlowFast Network (https://t.co/qDdXzqwJQp) and RubiksNet (https://t.co/CQU8D7TZgx, my paper at ECCV 2020, efficient CUDA shift primitives). (3) Language: it's not clear if Optimus is language prompted. If it is, there needs to be a way to "fuse" the language representations into perception. FiLM is a very lightweight neural network module that serves this purpose (https://t.co/VI4TpgQ22V). You can think of it intuitively as a "cross attention" of language embedding into the image-processing neural pathway. (4) Action tokenization: Optimus needs to convert the continuous motion signals into discrete tokens for the autoregressive Transformer to work. A few ways: - Directly bin the continuous values for each hand joint control. [0, 0.01) -> token #0, [0.01, 0.02) -> token #1, etc. This is straightforward but could be inefficient due to the long sequence length. - The joint movements are highly dependent on each other, which means they occupy a low-dimensional "state space". Apply VQVAE to the motion data to obtain a shorter-length, compressed token set. (5) Putting the above pieces together, we have a Transformer controller that consumes video tokens (optionally with language modulation), and outputs action tokens, one step at a time. The next frame from the table is fed back to the Transformer, so it knows the consequence of its action. That gives the self-corrective ability shown in the demo. I believe the architecture is most similar to: - Google RT-1: https://t.co/dpuon1bqU6 - NVIDIA VIMA: https://t.co/Tn3L63uGrv 3. Lastly, I'm genuinely impressed by the hardware quality. The motions are fluid, and the aesthetics is amazing as well. As I mentioned above, it's such a great decision to follow human morphology closely, so that there is no gap in imitating humans. Atlas from Boston Dynamics only has simple gripper-style hands. In the long run, Optimus' bi-dexterous, 5-finger hands will prove far superior in daily tasks. Congrats to @Tesla_Optimus team & @elonmusk 🎉! I look forward to seeing the bots roam Mars some day 🦾

View on Twitter

🔧 Raw API Response

{
  "user": {
    "created_at": "2012-12-12T22:11:27.000Z",
    "default_profile_image": false,
    "description": "@NVIDIA Senior AI Scientist. @Stanford PhD. Join me on the frontier of AI Agents, LLM & Robotics. MineDojo (NeurIPS Best Paper), Voyager. Ex: @OpenAI, @GoogleAI",
    "fast_followers_count": 0,
    "favourites_count": 6164,
    "followers_count": 140794,
    "friends_count": 2796,
    "has_custom_timelines": true,
    "is_translator": false,
    "listed_count": 2856,
    "location": "Views my own. Get in touch →",
    "media_count": 633,
    "name": "Jim Fan",
    "normal_followers_count": 140794,
    "possibly_sensitive": false,
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/1007413134/1672408318",
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1554922493101559808/SYSZhbcd_normal.jpg",
    "screen_name": "DrJimFan",
    "statuses_count": 2939,
    "translator_type": "none",
    "url": "https://t.co/H4rXo4Ei8X",
    "verified": false,
    "withheld_in_countries": [],
    "id_str": "1007413134"
  },
  "id": "1705982525825503282",
  "conversation_id": "1705982525825503282",
  "full_text": "Let's reverse engineer the phenomenal Tesla Optimus. No insider info, just my own analysis. Long read: \n\n1. The smooth hand movements are almost certainly trained by imitation learning (\"behavior cloning\") from human operators. The alternative is reinforcement learning in simulation, but that typically leads to jittery motion and unnatural hand poses.\n\nThere're at least 4 ways to collect human demonstrations: \n(1) A custom-built teleoperation system - I believe this is the most likely means used by Tesla team. Open-source example: ALOHA, a low-cost bimanual robot arm and teleoperation system by Stanford AI Labs (https://t.co/8iXpiHVEjS). It enables very precise, dexterous motions, such as putting AAA batteries into a remote or manipulating contact lens.\n\n(2) Motion Capture (MoCap): apply the MoCap systems used for Hollywood movies to capture the fine-grained motions of hand joints. Optimus' 5-finger hand is a great design decision that enables a direct mapping - there is no \"embodiment gap\" from human operators. \nFor instance, a demonstrator can wear a CyberGlove (https://t.co/S8hxErsEuU) and grasp the cubes on the table (as shown in video). CyberGlove will capture the motion signals & haptic feedback in real-time, which can be re-targeted onto Optimus.\n\n(3) Wearing gloves & markers can be clumsy. An alternative way to do MoCap is through computer vision. DexPilot from NVIDIA enables marker-less and glove-free data collection. The human operator simply uses their bare hands to perform the tasks. 4 Intel RealSense depth cameras and 2 NVIDIA Titan XP GPUs (yeah, 2019 work) translate the pixels to precise motion signals for robot learning.\n\n(4) VR Headset: turn the training room into a VR game, and let humans \"role play\" Optimus. Use the native VR controller or CyberGlove to control the virtual Optimus hands. This has the advantage of scalable remote data collection - annotators from around the world can contribute without coming onsite. \nVR demonstration technique appeared in research projects like the iGibson home robot simulator, an initiative that I participated in at Stanford: https://t.co/eyI4ORkH6G\n\nAbove 4 are not mutually exclusive. Optimus could use a combo of them for different pros & cons.\n\n2. Neural Architecture. Optimus is trained end-to-end: videos in, actions out. I'm quite sure it's implemented by a multimodal Transformer with the following components:\n\n(1) Image: some variant of efficient ViT, or simply an old ResNet/EfficientNet backbone (https://t.co/L6PLTQJnGA). The block pick-and-place demo doesn't require sophisticated vision. The spatial feature map from the image backbone can be tokenized easily.\n\n(2) Video: two ways. Either flatten the video into a sequence of images and produce tokens independently, or have a video-level tokenizer. There're numerous ways to efficiently process video pixel volumes. You don't necessarily need Transformer backbones, e.g. SlowFast Network (https://t.co/qDdXzqwJQp) and RubiksNet (https://t.co/CQU8D7TZgx, my paper at ECCV 2020, efficient CUDA shift primitives).\n\n(3) Language: it's not clear if Optimus is language prompted. If it is, there needs to be a way to \"fuse\" the language representations into perception. FiLM is a very lightweight neural network module that serves this purpose (https://t.co/VI4TpgQ22V). \nYou can think of it intuitively as a \"cross attention\" of language embedding into the image-processing neural pathway.\n\n(4) Action tokenization: Optimus needs to convert the continuous motion signals into discrete tokens for the autoregressive Transformer to work. A few ways:\n- Directly bin the continuous values for each hand joint control. [0, 0.01) -> token #0, [0.01, 0.02) -> token #1, etc. This is straightforward but could be inefficient due to the long sequence length.\n- The joint movements are highly dependent on each other, which means they occupy a low-dimensional \"state space\". Apply VQVAE to the motion data to obtain a shorter-length, compressed token set. \n\n(5) Putting the above pieces together, we have a Transformer controller that consumes video tokens (optionally with language modulation), and outputs action tokens, one step at a time. The next frame from the table is fed back to the Transformer, so it knows the consequence of its action. That gives the self-corrective ability shown in the demo.\n\nI believe the architecture is most similar to:\n- Google RT-1: https://t.co/dpuon1bqU6\n- NVIDIA VIMA: https://t.co/Tn3L63uGrv\n\n3. Lastly, I'm genuinely impressed by the hardware quality. The motions are fluid, and the aesthetics is amazing as well. As I mentioned above, it's such a great decision to follow human morphology closely, so that there is no gap in imitating humans. \n\nAtlas from Boston Dynamics only has simple gripper-style hands. In the long run, Optimus' bi-dexterous, 5-finger hands will prove far superior in daily tasks.\n\nCongrats to @Tesla_Optimus team & @elonmusk 🎉! I look forward to seeing the bots roam Mars some day 🦾",
  "reply_count": 160,
  "retweet_count": 589,
  "favorite_count": 3180,
  "hashtags": [],
  "symbols": [],
  "user_mentions": [],
  "urls": [],
  "media": [
    {
      "media_url": "https://pbs.twimg.com/ext_tw_video_thumb/1705962100441694208/pu/img/0qdejMdWg-zkb3T7.jpg",
      "type": "video",
      "video_url": "https://video.twimg.com/ext_tw_video/1705962100441694208/pu/vid/avc1/1280x720/GVFK7ol-QiQzvw-n.mp4?tag=12"
    }
  ],
  "url": "https://twitter.com/DrJimFan/status/1705982525825503282",
  "created_at": "2023-09-24T16:28:24.000Z",
  "#sort_index": "1705982525825503282",
  "view_count": 1838888,
  "quote_count": 123,
  "is_quote_tweet": false,
  "is_retweet": false,
  "is_pinned": false,
  "is_truncated": true,
  "startUrl": "https://twitter.com/drjimfan/status/1705982525825503282"
}