🐦 Twitter Post Details

Viewing enriched Twitter post

@sainingxie

Here's my take on the Sora technical report, with a good dose of speculation that could be totally off. First of all, really appreciate the team for sharing helpful insights and design decisions – Sora is incredible and is set to transform the video generation community. What we have learned so far: - Architecture: Sora is built on our diffusion transformer (DiT) model (published in ICCV 2023) — it's a diffusion model with a transformer backbone, in short: DiT = [VAE encoder + ViT + DDPM + VAE decoder]. According to the report, it seems there are not much additional bells and whistles. - "Video compressor network": Looks like it's just a VAE but trained on raw video data. Tokenization probably plays a significant role in getting good temporal consistency. By the way, VAE is a ConvNet, so DiT technically is a hybrid model ;) (1/n)

View on Twitter

📊 Media Metadata

{
  "media": [
    {
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/1758433676105310543/media_0.jpg",
      "type": "photo",
      "original_url": "https://pbs.twimg.com/media/GGcpNl6XYAAkV8V.jpg",
      "download_date": "2025-08-13T06:02:54.453857",
      "stored_in_supabase": true
    }
  ],
  "conversion_date": "2025-08-13T00:41:44.335272",
  "format_converted": true,
  "original_structure": "had_media_only"
}

🔧 Raw API Response

{
  "user": {
    "created_at": "2020-07-14T16:51:59.000Z",
    "default_profile_image": false,
    "description": "researcher in #deeplearning & #computervision | assistant professor at @NYU_Courant CS @nyuniversity | previous: research scientist @metaai (FAIR) @UCSanDiego",
    "fast_followers_count": 0,
    "favourites_count": 2440,
    "followers_count": 9868,
    "friends_count": 1026,
    "has_custom_timelines": true,
    "is_translator": false,
    "listed_count": 118,
    "location": "",
    "media_count": 29,
    "name": "Saining Xie",
    "normal_followers_count": 9868,
    "possibly_sensitive": false,
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1464323007078285312/dR9-qSsT_normal.jpg",
    "screen_name": "sainingxie",
    "statuses_count": 238,
    "translator_type": "none",
    "url": "https://t.co/z7UjEiSqGu",
    "verified": false,
    "withheld_in_countries": [],
    "id_str": "1283081795890626560"
  },
  "id": "1758433676105310543",
  "conversation_id": "1758433676105310543",
  "full_text": "Here's my take on the Sora technical report, with a good dose of speculation that could be totally off. First of all, really appreciate the team for sharing helpful insights and design decisions – Sora is incredible and is set to transform the video generation community.\n\nWhat we have learned so far:\n- Architecture: Sora is built on our diffusion transformer (DiT) model (published in ICCV 2023) — it's a diffusion model with a transformer backbone, in short:\nDiT = [VAE encoder + ViT + DDPM + VAE decoder]. \nAccording to the report, it seems there are not much additional bells and whistles.\n\n- \"Video compressor network\": Looks like it's just a VAE but trained on raw video data. Tokenization probably plays a significant role in getting good temporal consistency. By the way, VAE is a ConvNet, so DiT technically is a hybrid model ;) (1/n)",
  "reply_count": 23,
  "retweet_count": 403,
  "favorite_count": 2139,
  "hashtags": [],
  "symbols": [],
  "user_mentions": [],
  "urls": [],
  "media": [
    {
      "media_url": "https://pbs.twimg.com/media/GGcpNl6XYAAkV8V.jpg",
      "type": "photo"
    }
  ],
  "url": "https://twitter.com/sainingxie/status/1758433676105310543",
  "created_at": "2024-02-16T10:10:33.000Z",
  "#sort_index": "1758433676105310543",
  "view_count": 928931,
  "quote_count": 47,
  "is_quote_tweet": false,
  "is_retweet": false,
  "is_pinned": false,
  "is_truncated": true,
  "startUrl": "https://twitter.com/sainingxie/status/1758433676105310543"
}