🐦 Twitter Post Details

Viewing enriched Twitter post

@omarsar0

This paper is a big deal! It's well known that RL works great for math and code. But RL for training agents is a different story. The default approach to training LLM agents today is based on methods like ReAct-style reasoning loops, human-designed workflows, and fixed tool-calling patterns. The issue is that these methods treat the environment as passive rather than interactive. But in the real world, agents must make sequential decisions, maintain memory across turns, and adapt to stochastic environmental feedback. That's fundamentally an RL problem. This new research introduces Agent-R1, a framework for training LLM agents with end-to-end reinforcement learning across multi-turn interactions. As agents move from predefined workflows to autonomous interaction, end-to-end RL becomes the natural training paradigm. Agent-R1 provides a modular foundation for scaling RL to complex, tool-using LLM agents. Standard RL for LLMs assumes deterministic state transitions. You generate a token, append it to the sequence, done. But agents trigger external tools with uncertain outcomes. The environment responds unpredictably. State transitions become stochastic. Therefore, the researchers extend the Markov Decision Process framework to capture this. State space expands to include full interaction history and environmental feedback. Actions can trigger external tools, not just generate text. Rewards become dense, with process rewards for intermediate steps alongside final outcome rewards. Two core mechanisms make this work. An Action Mask distinguishes agent-generated tokens from environmental feedback, ensuring credit assignment targets only the agent's actual decisions. A ToolEnv module manages the interaction loop, handling state transitions and reward calculation when tools are invoked. On multi-hop question answering, RL-trained agents dramatically outperform baselines. The weakest RL algorithm (REINFORCE++) still beat naive RAG by 2.5x on average EM. GRPO achieved 0.3877 average EM compared to 0.1328 for RAG. Ablation results also confirm that the design matters. Disabling the advantage mask dropped PPO performance from 0.3719 to 0.3136. Disabling the loss mask caused further degradation to 0.3022. Precise credit assignment is essential for multi-turn learning. Paper: https://t.co/BrIBT3AAxC Learn to build effective AI agents in my academy: https://t.co/JBU5beIoD0

View on Twitter

📊 Media Metadata

{
  "media": [
    {
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2003862504490086596/media_0.jpg?",
      "media_url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2003862504490086596/media_0.jpg?",
      "type": "photo",
      "filename": "media_0.jpg"
    }
  ],
  "processed_at": "2025-12-24T17:13:48.247629",
  "pipeline_version": "2.0"
}

🔧 Raw API Response

{
  "type": "tweet",
  "id": "2003862504490086596",
  "url": "https://x.com/omarsar0/status/2003862504490086596",
  "twitterUrl": "https://twitter.com/omarsar0/status/2003862504490086596",
  "text": "This paper is a big deal!\n\nIt's well known that RL works great for math and code.\n\nBut RL for training agents is a different story.\n\nThe default approach to training LLM agents today is based on methods like ReAct-style reasoning loops, human-designed workflows, and fixed tool-calling patterns. The issue is that these methods treat the environment as passive rather than interactive.\n\nBut in the real world, agents must make sequential decisions, maintain memory across turns, and adapt to stochastic environmental feedback.\n\nThat's fundamentally an RL problem.\n\nThis new research introduces Agent-R1, a framework for training LLM agents with end-to-end reinforcement learning across multi-turn interactions.\n\nAs agents move from predefined workflows to autonomous interaction, end-to-end RL becomes the natural training paradigm. Agent-R1 provides a modular foundation for scaling RL to complex, tool-using LLM agents.\n\nStandard RL for LLMs assumes deterministic state transitions. You generate a token, append it to the sequence, done. But agents trigger external tools with uncertain outcomes. The environment responds unpredictably. State transitions become stochastic.\n\nTherefore, the researchers extend the Markov Decision Process framework to capture this. State space expands to include full interaction history and environmental feedback. Actions can trigger external tools, not just generate text. Rewards become dense, with process rewards for intermediate steps alongside final outcome rewards.\n\nTwo core mechanisms make this work. An Action Mask distinguishes agent-generated tokens from environmental feedback, ensuring credit assignment targets only the agent's actual decisions. A ToolEnv module manages the interaction loop, handling state transitions and reward calculation when tools are invoked.\n\nOn multi-hop question answering, RL-trained agents dramatically outperform baselines. The weakest RL algorithm (REINFORCE++) still beat naive RAG by 2.5x on average EM. GRPO achieved 0.3877 average EM compared to 0.1328 for RAG.\n\nAblation results also confirm that the design matters. Disabling the advantage mask dropped PPO performance from 0.3719 to 0.3136. Disabling the loss mask caused further degradation to 0.3022. Precise credit assignment is essential for multi-turn learning.\n\nPaper: https://t.co/BrIBT3AAxC\n\nLearn to build effective AI agents in my academy: https://t.co/JBU5beIoD0",
  "source": "Twitter for iPhone",
  "retweetCount": 3,
  "replyCount": 2,
  "likeCount": 32,
  "quoteCount": 0,
  "viewCount": 4540,
  "createdAt": "Wed Dec 24 16:17:05 +0000 2025",
  "lang": "en",
  "bookmarkCount": 49,
  "isReply": false,
  "inReplyToId": null,
  "conversationId": "2003862504490086596",
  "displayTextRange": [
    0,
    273
  ],
  "inReplyToUserId": null,
  "inReplyToUsername": null,
  "author": {
    "type": "user",
    "userName": "omarsar0",
    "url": "https://x.com/omarsar0",
    "twitterUrl": "https://twitter.com/omarsar0",
    "id": "3448284313",
    "name": "elvis",
    "isVerified": false,
    "isBlueVerified": true,
    "verifiedType": null,
    "profilePicture": "https://pbs.twimg.com/profile_images/939313677647282181/vZjFWtAn_normal.jpg",
    "coverPicture": "https://pbs.twimg.com/profile_banners/3448284313/1565974901",
    "description": "Building @dair_ai • Prev: Meta AI, Elastic, PhD • New cohort: https://t.co/GZMhf39NRs",
    "location": "DAIR.AI Academy",
    "followers": 280903,
    "following": 744,
    "status": "",
    "canDm": true,
    "canMediaTag": true,
    "createdAt": "Fri Sep 04 12:59:26 +0000 2015",
    "entities": {
      "description": {
        "urls": [
          {
            "display_url": "dair-ai.thinkific.com/courses/claude…",
            "expanded_url": "https://dair-ai.thinkific.com/courses/claude-code-for-everyone-2",
            "url": "https://t.co/GZMhf39NRs",
            "indices": [
              62,
              85
            ]
          }
        ]
      },
      "url": {
        "urls": [
          {
            "display_url": "dair.ai",
            "expanded_url": "https://www.dair.ai/",
            "url": "https://t.co/XQto5ypkSM",
            "indices": [
              0,
              23
            ]
          }
        ]
      }
    },
    "fastFollowersCount": 0,
    "favouritesCount": 34141,
    "hasCustomTimelines": true,
    "isTranslator": false,
    "mediaCount": 4410,
    "statusesCount": 16829,
    "withheldInCountries": [],
    "affiliatesHighlightedLabel": {},
    "possiblySensitive": false,
    "pinnedTweetIds": [
      "2003862504490086596"
    ],
    "profile_bio": {
      "description": "Building @dair_ai • Prev: Meta AI, Elastic, PhD • New cohort: https://t.co/GZMhf39NRs"
    },
    "isAutomated": false,
    "automatedBy": null
  },
  "extendedEntities": {
    "media": [
      {
        "display_url": "pic.x.com/7qHo3m21OX",
        "expanded_url": "https://x.com/omarsar0/status/2003862504490086596/photo/1",
        "id_str": "2003862500736131072",
        "indices": [
          274,
          297
        ],
        "media_key": "3_2003862500736131072",
        "media_url_https": "https://pbs.twimg.com/media/G88mU9caYAAzxb-.png",
        "type": "photo",
        "url": "https://t.co/7qHo3m21OX",
        "ext_media_availability": {
          "status": "Available"
        },
        "features": {
          "large": {
            "faces": []
          },
          "medium": {
            "faces": []
          },
          "small": {
            "faces": []
          },
          "orig": {
            "faces": []
          }
        },
        "sizes": {
          "large": {
            "h": 1768,
            "w": 1612,
            "resize": "fit"
          },
          "medium": {
            "h": 1200,
            "w": 1094,
            "resize": "fit"
          },
          "small": {
            "h": 680,
            "w": 620,
            "resize": "fit"
          },
          "thumb": {
            "h": 150,
            "w": 150,
            "resize": "crop"
          }
        },
        "original_info": {
          "height": 1768,
          "width": 1612,
          "focus_rects": [
            {
              "x": 0,
              "y": 0,
              "w": 1612,
              "h": 903
            },
            {
              "x": 0,
              "y": 0,
              "w": 1612,
              "h": 1612
            },
            {
              "x": 31,
              "y": 0,
              "w": 1551,
              "h": 1768
            },
            {
              "x": 364,
              "y": 0,
              "w": 884,
              "h": 1768
            },
            {
              "x": 0,
              "y": 0,
              "w": 1612,
              "h": 1768
            }
          ]
        },
        "media_results": {
          "result": {
            "media_key": "3_2003862500736131072"
          }
        }
      }
    ]
  },
  "card": null,
  "place": {},
  "entities": {
    "hashtags": [],
    "symbols": [],
    "urls": [
      {
        "display_url": "arxiv.org/abs/2511.14460",
        "expanded_url": "https://arxiv.org/abs/2511.14460",
        "url": "https://t.co/BrIBT3AAxC",
        "indices": [
          2314,
          2337
        ]
      },
      {
        "display_url": "dair-ai.thinkific.com",
        "expanded_url": "https://dair-ai.thinkific.com/",
        "url": "https://t.co/JBU5beIoD0",
        "indices": [
          2389,
          2412
        ]
      }
    ],
    "user_mentions": []
  },
  "quoted_tweet": null,
  "retweeted_tweet": null,
  "article": null
}