🐦 Twitter Post Details

Viewing enriched Twitter post

@omarsar0

Designing reward functions for RL agents is kind of broken. The default approach remains manual engineering: domain experts iteratively craft reward signals through trial-and-error. This requires significant expertise, takes enormous human effort, and often fails when task complexity increases. But what if agents could discover their own optimal reward functions? This new research introduces a bilevel optimization framework that automatically discovers optimal reward functions for embodied RL agents through regret minimization. The optimal reward function can be defined as one that minimizes the gap between the learned policy and the true optimal policy. No expert demonstrations needed. No human feedback required. How it works: Two optimization levels run simultaneously. The lower level trains the RL agent to maximize rewards as usual. The upper level continuously updates the reward function itself, guided by a meta-gradient that minimizes policy regret. The reward function learns to assign high values to critical states like success or failure, while providing dense feedback throughout the state space. The framework works across both value-based agents (DQN) and policy-based agents (PPO, SAC, TD3) without task-specific tuning. In data center energy management, all RL agents using discovered rewards achieved energy reductions exceeding 60%, compared to 21-52% for baseline RL. In UAV trajectory tracking, the approach enabled PPO agents to successfully complete tasks where hand-designed rewards failed entirely. In sparse-reward OpenAI benchmarks, agents using discovered rewards outperformed baselines in both convergence speed and final performance. The discovered reward functions also reveal interpretable structure: they automatically identify critical states and encode latent relationships between states and rewards that match physics-based reward designs, despite having no explicit mathematical model. Paper: https://t.co/W9fRH6sbDq Learn to build effective AI Agents in our academy: https://t.co/JBU5beIoD0

View on Twitter

📊 Media Metadata

{
  "media": [
    {
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2002008280030404647/media_0.jpg?",
      "media_url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2002008280030404647/media_0.jpg?",
      "type": "photo",
      "filename": "media_0.jpg"
    },
    {
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2002008280030404647/media_1.jpg?",
      "media_url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2002008280030404647/media_1.jpg?",
      "type": "photo",
      "filename": "media_1.jpg"
    }
  ],
  "processed_at": "2025-12-19T14:35:06.370213",
  "pipeline_version": "2.0"
}

🔧 Raw API Response

{
  "type": "tweet",
  "id": "2002008280030404647",
  "url": "https://x.com/omarsar0/status/2002008280030404647",
  "twitterUrl": "https://twitter.com/omarsar0/status/2002008280030404647",
  "text": "Designing reward functions for RL agents is kind of broken.\n\nThe default approach remains manual engineering: domain experts iteratively craft reward signals through trial-and-error. This requires significant expertise, takes enormous human effort, and often fails when task complexity increases.\n\nBut what if agents could discover their own optimal reward functions?\n\nThis new research introduces a bilevel optimization framework that automatically discovers optimal reward functions for embodied RL agents through regret minimization.\n\nThe optimal reward function can be defined as one that minimizes the gap between the learned policy and the true optimal policy. No expert demonstrations needed. No human feedback required.\n\nHow it works: Two optimization levels run simultaneously. The lower level trains the RL agent to maximize rewards as usual. The upper level continuously updates the reward function itself, guided by a meta-gradient that minimizes policy regret. The reward function learns to assign high values to critical states like success or failure, while providing dense feedback throughout the state space.\n\nThe framework works across both value-based agents (DQN) and policy-based agents (PPO, SAC, TD3) without task-specific tuning.\n\nIn data center energy management, all RL agents using discovered rewards achieved energy reductions exceeding 60%, compared to 21-52% for baseline RL. In UAV trajectory tracking, the approach enabled PPO agents to successfully complete tasks where hand-designed rewards failed entirely. In sparse-reward OpenAI benchmarks, agents using discovered rewards outperformed baselines in both convergence speed and final performance.\n\nThe discovered reward functions also reveal interpretable structure: they automatically identify critical states and encode latent relationships between states and rewards that match physics-based reward designs, despite having no explicit mathematical model.\n\nPaper: https://t.co/W9fRH6sbDq\n\nLearn to build effective AI Agents in our academy: https://t.co/JBU5beIoD0",
  "source": "Twitter for iPhone",
  "retweetCount": 5,
  "replyCount": 3,
  "likeCount": 19,
  "quoteCount": 0,
  "viewCount": 1643,
  "createdAt": "Fri Dec 19 13:29:04 +0000 2025",
  "lang": "en",
  "bookmarkCount": 14,
  "isReply": false,
  "inReplyToId": null,
  "conversationId": "2002008280030404647",
  "displayTextRange": [
    0,
    275
  ],
  "inReplyToUserId": null,
  "inReplyToUsername": null,
  "author": {
    "type": "user",
    "userName": "omarsar0",
    "url": "https://x.com/omarsar0",
    "twitterUrl": "https://twitter.com/omarsar0",
    "id": "3448284313",
    "name": "elvis",
    "isVerified": false,
    "isBlueVerified": true,
    "verifiedType": null,
    "profilePicture": "https://pbs.twimg.com/profile_images/939313677647282181/vZjFWtAn_normal.jpg",
    "coverPicture": "https://pbs.twimg.com/profile_banners/3448284313/1565974901",
    "description": "Building @dair_ai • Prev: Meta AI, Elastic, PhD • New cohort: https://t.co/GZMhf39NRs",
    "location": "DAIR.AI Academy",
    "followers": 280091,
    "following": 739,
    "status": "",
    "canDm": true,
    "canMediaTag": true,
    "createdAt": "Fri Sep 04 12:59:26 +0000 2015",
    "entities": {
      "description": {
        "urls": [
          {
            "display_url": "dair-ai.thinkific.com/courses/claude…",
            "expanded_url": "https://dair-ai.thinkific.com/courses/claude-code-for-everyone-2",
            "url": "https://t.co/GZMhf39NRs",
            "indices": [
              62,
              85
            ]
          }
        ]
      },
      "url": {
        "urls": [
          {
            "display_url": "dair.ai",
            "expanded_url": "https://www.dair.ai/",
            "url": "https://t.co/XQto5ypkSM",
            "indices": [
              0,
              23
            ]
          }
        ]
      }
    },
    "fastFollowersCount": 0,
    "favouritesCount": 34044,
    "hasCustomTimelines": true,
    "isTranslator": false,
    "mediaCount": 4399,
    "statusesCount": 16791,
    "withheldInCountries": [],
    "affiliatesHighlightedLabel": {},
    "possiblySensitive": false,
    "pinnedTweetIds": [
      "2001714322817368472"
    ],
    "profile_bio": {
      "description": "Building @dair_ai • Prev: Meta AI, Elastic, PhD • New cohort: https://t.co/GZMhf39NRs"
    },
    "isAutomated": false,
    "automatedBy": null
  },
  "extendedEntities": {
    "media": [
      {
        "display_url": "pic.x.com/UmefspXbSA",
        "expanded_url": "https://x.com/omarsar0/status/2002008280030404647/photo/1",
        "id_str": "2002008276180082688",
        "indices": [
          276,
          299
        ],
        "media_key": "3_2002008276180082688",
        "media_url_https": "https://pbs.twimg.com/media/G8iP64sbIAAaYDv.jpg",
        "type": "photo",
        "url": "https://t.co/UmefspXbSA",
        "ext_media_availability": {
          "status": "Available"
        },
        "features": {
          "large": {
            "faces": []
          },
          "medium": {
            "faces": []
          },
          "small": {
            "faces": []
          },
          "orig": {
            "faces": []
          }
        },
        "sizes": {
          "large": {
            "h": 1762,
            "w": 1454,
            "resize": "fit"
          },
          "medium": {
            "h": 1200,
            "w": 990,
            "resize": "fit"
          },
          "small": {
            "h": 680,
            "w": 561,
            "resize": "fit"
          },
          "thumb": {
            "h": 150,
            "w": 150,
            "resize": "crop"
          }
        },
        "original_info": {
          "height": 1762,
          "width": 1454,
          "focus_rects": [
            {
              "x": 0,
              "y": 0,
              "w": 1454,
              "h": 814
            },
            {
              "x": 0,
              "y": 0,
              "w": 1454,
              "h": 1454
            },
            {
              "x": 0,
              "y": 0,
              "w": 1454,
              "h": 1658
            },
            {
              "x": 132,
              "y": 0,
              "w": 881,
              "h": 1762
            },
            {
              "x": 0,
              "y": 0,
              "w": 1454,
              "h": 1762
            }
          ]
        },
        "media_results": {
          "result": {
            "media_key": "3_2002008276180082688"
          }
        }
      }
    ]
  },
  "card": null,
  "place": {},
  "entities": {
    "hashtags": [],
    "symbols": [],
    "urls": [
      {
        "display_url": "nature.com/articles/s4146…",
        "expanded_url": "https://www.nature.com/articles/s41467-025-66009-y",
        "url": "https://t.co/W9fRH6sbDq",
        "indices": [
          1951,
          1974
        ]
      },
      {
        "display_url": "dair-ai.thinkific.com",
        "expanded_url": "https://dair-ai.thinkific.com/",
        "url": "https://t.co/JBU5beIoD0",
        "indices": [
          2027,
          2050
        ]
      }
    ],
    "user_mentions": []
  },
  "quoted_tweet": null,
  "retweeted_tweet": null,
  "article": null
}