🐦 Twitter Post Details

Viewing enriched Twitter post

@omarsar0

Great paper on why RL actually works for LLM reasoning. Apparently, "aha moments" during training aren't random. They're markers of something deeper. Researchers analyzed RL training dynamics across eight models, including Qwen, LLaMA, and vision-language models. The findings challenge how we think about training reasoning capabilities. RL training follows a two-phase dynamic that mirrors human cognition: first, the model masters low-level execution (calculations, formulas), then the learning bottleneck shifts to high-level strategic planning (logical maneuvers, backtracing, branching). It turns out that current algorithms like GRPO apply optimization pressure uniformly across all tokens. This dilutes the learning signal. Most tokens are procedural execution. The real gains come from strategic planning tokens. This new research introduces HICRA (Hierarchy-Aware Credit Assignment), an algorithm that concentrates optimization specifically on planning tokens rather than treating all tokens equally. How do they identify planning tokens? Through "Strategic Grams," n-grams that function as logical scaffolding: phrases like "let's try a different approach" or "but the problem mentions that." Human annotation validated 86% of identified Strategic Grams genuinely guide reasoning flow. On Qwen3-4B-Instruct, HICRA achieves 73.1% on AIME24 versus GRPO's 68.5%. On AIME25, 65.1% versus 60.0%. On Qwen2.5-7B-Base, gains of +8.4 points on AMC23 and +4.0 on Olympiad benchmarks. Error analysis reveals the mechanism: during RL training, strategic errors decrease far more than procedural errors. A perfectly executed incorrect plan still fails. RL preferentially fixes high-level strategic faults because that's where the leverage is. HICRA sustains higher semantic entropy than GRPO while maintaining lower token entropy. The difference matters because entropy regularization that promotes token-level diversity actually hurts performance. Only targeted strategic exploration improves reasoning. Overall, the paper provides a mechanistic explanation for mysterious RL phenomena like "aha moments" and length-scaling, and demonstrates that focusing optimization on the right tokens substantially improves training efficiency. (bookmark it) Paper: https://t.co/mpLvne0gGk Learn to build with AI Agents in my academy: https://t.co/JBU5beIoD0

View on Twitter

📊 Media Metadata

{
  "media": [
    {
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/1999483394963701911/media_0.jpg?",
      "media_url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/1999483394963701911/media_0.jpg?",
      "type": "photo",
      "filename": "media_0.jpg"
    }
  ],
  "processed_at": "2025-12-12T14:48:30.918223",
  "pipeline_version": "2.0"
}

🔧 Raw API Response

{
  "type": "tweet",
  "id": "1999483394963701911",
  "url": "https://x.com/omarsar0/status/1999483394963701911",
  "twitterUrl": "https://twitter.com/omarsar0/status/1999483394963701911",
  "text": "Great paper on why RL actually works for LLM reasoning.\n\nApparently, \"aha moments\" during training aren't random. They're markers of something deeper.\n\nResearchers analyzed RL training dynamics across eight models, including Qwen, LLaMA, and vision-language models. The findings challenge how we think about training reasoning capabilities.\n\nRL training follows a two-phase dynamic that mirrors human cognition: first, the model masters low-level execution (calculations, formulas), then the learning bottleneck shifts to high-level strategic planning (logical maneuvers, backtracing, branching).\n\nIt turns out that current algorithms like GRPO apply optimization pressure uniformly across all tokens. This dilutes the learning signal. Most tokens are procedural execution. The real gains come from strategic planning tokens.\n\nThis new research introduces HICRA (Hierarchy-Aware Credit Assignment), an algorithm that concentrates optimization specifically on planning tokens rather than treating all tokens equally.\n\nHow do they identify planning tokens? Through \"Strategic Grams,\" n-grams that function as logical scaffolding: phrases like \"let's try a different approach\" or \"but the problem mentions that.\" Human annotation validated 86% of identified Strategic Grams genuinely guide reasoning flow.\n\nOn Qwen3-4B-Instruct, HICRA achieves 73.1% on AIME24 versus GRPO's 68.5%. On AIME25, 65.1% versus 60.0%. On Qwen2.5-7B-Base, gains of +8.4 points on AMC23 and +4.0 on Olympiad benchmarks.\n\nError analysis reveals the mechanism: during RL training, strategic errors decrease far more than procedural errors. A perfectly executed incorrect plan still fails. RL preferentially fixes high-level strategic faults because that's where the leverage is.\n\nHICRA sustains higher semantic entropy than GRPO while maintaining lower token entropy. The difference matters because entropy regularization that promotes token-level diversity actually hurts performance. Only targeted strategic exploration improves reasoning.\n\nOverall, the paper provides a mechanistic explanation for mysterious RL phenomena like \"aha moments\" and length-scaling, and demonstrates that focusing optimization on the right tokens substantially improves training efficiency.\n\n(bookmark it)\n\nPaper: https://t.co/mpLvne0gGk\n\nLearn to build with AI Agents in my academy: https://t.co/JBU5beIoD0",
  "source": "Twitter for iPhone",
  "retweetCount": 4,
  "replyCount": 3,
  "likeCount": 24,
  "quoteCount": 0,
  "viewCount": 1503,
  "createdAt": "Fri Dec 12 14:16:04 +0000 2025",
  "lang": "en",
  "bookmarkCount": 23,
  "isReply": false,
  "inReplyToId": null,
  "conversationId": "1999483394963701911",
  "displayTextRange": [
    0,
    279
  ],
  "inReplyToUserId": null,
  "inReplyToUsername": null,
  "author": {
    "type": "user",
    "userName": "omarsar0",
    "url": "https://x.com/omarsar0",
    "twitterUrl": "https://twitter.com/omarsar0",
    "id": "3448284313",
    "name": "elvis",
    "isVerified": false,
    "isBlueVerified": true,
    "verifiedType": null,
    "profilePicture": "https://pbs.twimg.com/profile_images/939313677647282181/vZjFWtAn_normal.jpg",
    "coverPicture": "https://pbs.twimg.com/profile_banners/3448284313/1565974901",
    "description": "Building @dair_ai • Ex Meta AI, Elastic, PhD • New cohort: https://t.co/xw2XQ0z8up",
    "location": "DAIR.AI Academy",
    "followers": 278977,
    "following": 733,
    "status": "",
    "canDm": true,
    "canMediaTag": true,
    "createdAt": "Fri Sep 04 12:59:26 +0000 2015",
    "entities": {
      "description": {
        "urls": [
          {
            "display_url": "dair-ai.thinkific.com/courses/buildi…",
            "expanded_url": "https://dair-ai.thinkific.com/courses/building-effective-ai-agents-2",
            "url": "https://t.co/xw2XQ0z8up",
            "indices": [
              59,
              82
            ]
          }
        ]
      },
      "url": {
        "urls": [
          {
            "display_url": "dair.ai",
            "expanded_url": "https://www.dair.ai/",
            "url": "https://t.co/XQto5ypkSM",
            "indices": [
              0,
              23
            ]
          }
        ]
      }
    },
    "fastFollowersCount": 0,
    "favouritesCount": 33895,
    "hasCustomTimelines": true,
    "isTranslator": false,
    "mediaCount": 4375,
    "statusesCount": 16725,
    "withheldInCountries": [],
    "affiliatesHighlightedLabel": {},
    "possiblySensitive": false,
    "pinnedTweetIds": [
      "1999135611392053586"
    ],
    "profile_bio": {},
    "isAutomated": false,
    "automatedBy": null
  },
  "extendedEntities": {
    "media": [
      {
        "display_url": "pic.x.com/wlXGsc3Yfz",
        "expanded_url": "https://x.com/omarsar0/status/1999483394963701911/photo/1",
        "id_str": "1999483390245109761",
        "indices": [
          280,
          303
        ],
        "media_key": "3_1999483390245109761",
        "media_url_https": "https://pbs.twimg.com/media/G7-XjLnagAEgcbu.jpg",
        "type": "photo",
        "url": "https://t.co/wlXGsc3Yfz",
        "ext_media_availability": {
          "status": "Available"
        },
        "features": {
          "large": {
            "faces": [
              {
                "x": 278,
                "y": 962,
                "h": 81,
                "w": 81
              }
            ]
          },
          "medium": {
            "faces": [
              {
                "x": 191,
                "y": 664,
                "h": 55,
                "w": 55
              }
            ]
          },
          "small": {
            "faces": [
              {
                "x": 108,
                "y": 376,
                "h": 31,
                "w": 31
              }
            ]
          },
          "orig": {
            "faces": [
              {
                "x": 278,
                "y": 962,
                "h": 81,
                "w": 81
              }
            ]
          }
        },
        "sizes": {
          "large": {
            "h": 1738,
            "w": 1532,
            "resize": "fit"
          },
          "medium": {
            "h": 1200,
            "w": 1058,
            "resize": "fit"
          },
          "small": {
            "h": 680,
            "w": 599,
            "resize": "fit"
          },
          "thumb": {
            "h": 150,
            "w": 150,
            "resize": "crop"
          }
        },
        "original_info": {
          "height": 1738,
          "width": 1532,
          "focus_rects": [
            {
              "x": 0,
              "y": 0,
              "w": 1532,
              "h": 858
            },
            {
              "x": 0,
              "y": 0,
              "w": 1532,
              "h": 1532
            },
            {
              "x": 0,
              "y": 0,
              "w": 1525,
              "h": 1738
            },
            {
              "x": 0,
              "y": 0,
              "w": 869,
              "h": 1738
            },
            {
              "x": 0,
              "y": 0,
              "w": 1532,
              "h": 1738
            }
          ]
        },
        "media_results": {
          "result": {
            "media_key": "3_1999483390245109761"
          }
        }
      }
    ]
  },
  "card": null,
  "place": {},
  "entities": {
    "hashtags": [],
    "symbols": [],
    "urls": [
      {
        "display_url": "openreview.net/pdf?id=NlkykTq…",
        "expanded_url": "https://openreview.net/pdf?id=NlkykTqAId",
        "url": "https://t.co/mpLvne0gGk",
        "indices": [
          2265,
          2288
        ]
      },
      {
        "display_url": "dair-ai.thinkific.com",
        "expanded_url": "https://dair-ai.thinkific.com/",
        "url": "https://t.co/JBU5beIoD0",
        "indices": [
          2335,
          2358
        ]
      }
    ],
    "user_mentions": []
  },
  "quoted_tweet": null,
  "retweeted_tweet": null,
  "article": null
}