🐦 Twitter Post Details

Viewing enriched Twitter post

@tri_dao

The FA4 paper is finally out after a year of work. On Blackwell GPUs, attention now goes about as fast as matmul even though the bottlenecks are so different! Tensor cores are now crazy fast that attn fwd is bottlenecked by exponential, and attn bwd is bottlenecked by shared memory bandwidth. Some fun stuff in the redesigned algorithm to overcome these bottlenecks: exponential emulation with polynomials, new online softmax to avoid 90% of softmax rescaling, 2CTA MMA instructions that allow two thread blocks to share operands to reduce smem traffic.

View on Twitter

📊 Media Metadata

{
  "score": 0.42,
  "score_components": {
    "author": 0.09,
    "engagement": 0.0,
    "quality": 0.12,
    "source": 0.135,
    "nlp": 0.05,
    "recency": 0.025
  },
  "scored_at": "2026-03-06T07:22:21.443394",
  "import_source": "api_import",
  "source_tagged_at": "2026-03-06T07:22:21.443410",
  "enriched": true,
  "enriched_at": "2026-03-06T07:22:21.443412"
}

🔧 Raw API Response

{
  "type": "tweet",
  "id": "2029569881151263082",
  "url": "https://x.com/tri_dao/status/2029569881151263082",
  "twitterUrl": "https://twitter.com/tri_dao/status/2029569881151263082",
  "text": "The FA4 paper is finally out after a year of work. On Blackwell GPUs, attention now goes about as fast as matmul even though the bottlenecks are so different! Tensor cores are now crazy fast that attn fwd is bottlenecked by exponential, and attn bwd is bottlenecked by shared memory bandwidth. \n\nSome fun stuff in the redesigned algorithm to overcome these bottlenecks: exponential emulation with polynomials, new online softmax to avoid 90% of softmax rescaling, 2CTA MMA instructions that allow two thread blocks to share operands to reduce smem traffic.",
  "source": "Twitter for iPhone",
  "retweetCount": 216,
  "replyCount": 27,
  "likeCount": 1662,
  "quoteCount": 23,
  "viewCount": 149157,
  "createdAt": "Thu Mar 05 14:49:01 +0000 2026",
  "lang": "en",
  "bookmarkCount": 712,
  "isReply": false,
  "inReplyToId": null,
  "conversationId": "2029569881151263082",
  "displayTextRange": [
    0,
    275
  ],
  "inReplyToUserId": null,
  "inReplyToUsername": null,
  "author": {
    "type": "user",
    "userName": "tri_dao",
    "url": "https://x.com/tri_dao",
    "twitterUrl": "https://twitter.com/tri_dao",
    "id": "568879807",
    "name": "Tri Dao",
    "isVerified": false,
    "isBlueVerified": true,
    "verifiedType": null,
    "profilePicture": "https://pbs.twimg.com/profile_images/1389486324516671496/owa9Z4AC_normal.jpg",
    "coverPicture": "",
    "description": "",
    "location": "Stanford, CA",
    "followers": 37865,
    "following": 643,
    "status": "",
    "canDm": true,
    "canMediaTag": true,
    "createdAt": "Wed May 02 07:13:50 +0000 2012",
    "entities": {
      "description": {
        "urls": []
      },
      "url": {}
    },
    "fastFollowersCount": 0,
    "favouritesCount": 1610,
    "hasCustomTimelines": true,
    "isTranslator": false,
    "mediaCount": 54,
    "statusesCount": 901,
    "withheldInCountries": [],
    "affiliatesHighlightedLabel": {},
    "possiblySensitive": false,
    "pinnedTweetIds": [
      "1811453622070444071"
    ],
    "profile_bio": {
      "description": "Asst. Prof @PrincetonCS, Chief Scientist @togethercompute. Machine learning & systems.",
      "entities": {
        "description": {
          "hashtags": [],
          "symbols": [],
          "urls": [],
          "user_mentions": [
            {
              "id_str": "0",
              "indices": [
                11,
                23
              ],
              "name": "",
              "screen_name": "PrincetonCS"
            },
            {
              "id_str": "0",
              "indices": [
                41,
                57
              ],
              "name": "",
              "screen_name": "togethercompute"
            }
          ]
        },
        "url": {
          "urls": [
            {
              "display_url": "tridao.me",
              "expanded_url": "https://tridao.me",
              "indices": [
                0,
                23
              ],
              "url": "https://t.co/uFTGOmLPGP"
            }
          ]
        }
      }
    },
    "isAutomated": false,
    "automatedBy": null
  },
  "extendedEntities": {},
  "card": null,
  "place": {},
  "entities": {
    "hashtags": [],
    "symbols": [],
    "urls": [],
    "user_mentions": []
  },
  "quoted_tweet": {
    "type": "tweet",
    "id": "2029569295806841236",
    "url": "https://x.com/tedzadouri/status/2029569295806841236",
    "twitterUrl": "https://twitter.com/tedzadouri/status/2029569295806841236",
    "text": "Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed!\n\njoint work w/  Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__  ), Tri Dao (@tri_dao)\n\n1/",
    "source": "Twitter for iPhone",
    "retweetCount": 121,
    "replyCount": 5,
    "likeCount": 687,
    "quoteCount": 18,
    "viewCount": 170513,
    "createdAt": "Thu Mar 05 14:46:42 +0000 2026",
    "lang": "en",
    "bookmarkCount": 347,
    "isReply": false,
    "inReplyToId": null,
    "conversationId": "2029569295806841236",
    "displayTextRange": [
      0,
      285
    ],
    "inReplyToUserId": null,
    "inReplyToUsername": null,
    "author": {
      "type": "user",
      "userName": "tedzadouri",
      "url": "https://x.com/tedzadouri",
      "twitterUrl": "https://twitter.com/tedzadouri",
      "id": "1273146181951139840",
      "name": "Ted Zadouri",
      "isVerified": false,
      "isBlueVerified": true,
      "verifiedType": null,
      "profilePicture": "https://pbs.twimg.com/profile_images/1928173784273412096/52OgSZ5z_normal.jpg",
      "coverPicture": "https://pbs.twimg.com/profile_banners/1273146181951139840/1694279097",
      "description": "",
      "location": "Princeton, NJ",
      "followers": 980,
      "following": 305,
      "status": "",
      "canDm": false,
      "canMediaTag": true,
      "createdAt": "Wed Jun 17 06:51:50 +0000 2020",
      "entities": {
        "description": {
          "urls": []
        },
        "url": {}
      },
      "fastFollowersCount": 0,
      "favouritesCount": 399,
      "hasCustomTimelines": true,
      "isTranslator": false,
      "mediaCount": 20,
      "statusesCount": 201,
      "withheldInCountries": [],
      "affiliatesHighlightedLabel": {},
      "possiblySensitive": false,
      "pinnedTweetIds": [
        "2029569295806841236"
      ],
      "profile_bio": {
        "description": "PhD Student @PrincetonCS @togethercompute | Previously: @cohere @UCLA",
        "entities": {
          "description": {
            "hashtags": [],
            "symbols": [],
            "urls": [],
            "user_mentions": [
              {
                "id_str": "0",
                "indices": [
                  12,
                  24
                ],
                "name": "",
                "screen_name": "PrincetonCS"
              },
              {
                "id_str": "0",
                "indices": [
                  25,
                  41
                ],
                "name": "",
                "screen_name": "togethercompute"
              },
              {
                "id_str": "0",
                "indices": [
                  56,
                  63
                ],
                "name": "",
                "screen_name": "cohere"
              },
              {
                "id_str": "0",
                "indices": [
                  64,
                  69
                ],
                "name": "",
                "screen_name": "UCLA"
              }
            ]
          }
        }
      },
      "isAutomated": false,
      "automatedBy": null
    },
    "extendedEntities": {
      "media": [
        {
          "allow_download_status": {
            "allow_download": true
          },
          "display_url": "pic.twitter.com/vbyY9sLcCC",
          "expanded_url": "https://twitter.com/tedzadouri/status/2029569295806841236/photo/1",
          "ext_media_availability": {
            "status": "Available"
          },
          "features": {
            "large": {
              "faces": []
            },
            "orig": {
              "faces": []
            }
          },
          "id_str": "2029567640759652352",
          "indices": [
            286,
            309
          ],
          "media_key": "3_2029567640759652352",
          "media_results": {
            "id": "QXBpTWVkaWFSZXN1bHRzOgwAAQoAARwqeQQjG1AACgACHCp6hXubQZQAAA==",
            "result": {
              "__typename": "ApiMedia",
              "id": "QXBpTWVkaWE6DAABCgABHCp5BCMbUAAKAAIcKnqFe5tBlAAA",
              "media_key": "3_2029567640759652352"
            }
          },
          "media_url_https": "https://pbs.twimg.com/media/HCp5BCMbUAAyYg1.jpg",
          "original_info": {
            "focus_rects": [
              {
                "h": 2294,
                "w": 4096,
                "x": 0,
                "y": 0
              },
              {
                "h": 2627,
                "w": 2627,
                "x": 18,
                "y": 0
              },
              {
                "h": 2627,
                "w": 2304,
                "x": 179,
                "y": 0
              },
              {
                "h": 2627,
                "w": 1314,
                "x": 674,
                "y": 0
              },
              {
                "h": 2627,
                "w": 4096,
                "x": 0,
                "y": 0
              }
            ],
            "height": 2627,
            "width": 4096
          },
          "sizes": {
            "large": {
              "h": 1314,
              "w": 2048
            }
          },
          "type": "photo",
          "url": "https://t.co/vbyY9sLcCC"
        }
      ]
    },
    "card": null,
    "place": {},
    "entities": {
      "hashtags": [],
      "symbols": [],
      "urls": [],
      "user_mentions": [
        {
          "id_str": "64145443",
          "indices": [
            323,
            336
          ],
          "name": "Jay Shah",
          "screen_name": "ultraproduct"
        },
        {
          "id_str": "3280272739",
          "indices": [
            365,
            380
          ],
          "name": "Vijay",
          "screen_name": "__tensorcore__"
        },
        {
          "id_str": "568879807",
          "indices": [
            394,
            402
          ],
          "name": "Tri Dao",
          "screen_name": "tri_dao"
        }
      ]
    },
    "quoted_tweet": null,
    "retweeted_tweet": null,
    "isLimitedReply": false,
    "article": null
  },
  "retweeted_tweet": null,
  "isLimitedReply": false,
  "article": null
}