🐦 Twitter Post Details

Viewing enriched Twitter post

@iScienceLuvr

Reward Model Ensembles Help Mitigate Overoptimization abs: https://t.co/JEqdksPnT5 RLHF can struggle with overoptimization, where the policy gets better according to the learned reward model but its true reward is actually worse. Building off Gao et al. 2023, here it is demonstrated that utilizing ensembles of reward models both mitigate overoptimization and also improve overall performance.

View on Twitter

📊 Media Metadata

{
  "media": [
    {
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/1709738453829877844/media_0.png",
      "type": "photo",
      "original_url": "https://pbs.twimg.com/media/F7o17YgbAAAh3FJ.png",
      "download_date": "2025-08-13T06:01:57.345574",
      "stored_in_supabase": true
    },
    {
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/1709738453829877844/media_0.jpg",
      "type": "photo",
      "original_url": "https://pbs.twimg.com/media/F7o2ATIbcAE8qrw.jpg",
      "download_date": "2025-08-13T06:01:58.716261",
      "stored_in_supabase": true
    }
  ],
  "conversion_date": "2025-08-13T00:40:04.360270",
  "format_converted": true,
  "original_structure": "had_media_only"
}

🔧 Raw API Response

{
  "user": {
    "created_at": "2011-12-20T03:45:50.000Z",
    "default_profile_image": false,
    "description": "PhD at 19 |\nFounder and CEO at @MedARC_AI |\nResearch Director at @StabilityAI | \n@kaggle Notebooks GM |\nBiomed. engineer @ 14 |\nTEDx talk➡https://t.co/DwMkst4bnG",
    "fast_followers_count": 0,
    "favourites_count": 59837,
    "followers_count": 44964,
    "friends_count": 994,
    "has_custom_timelines": true,
    "is_translator": false,
    "listed_count": 691,
    "location": "",
    "media_count": 1186,
    "name": "Tanishq Mathew Abraham, PhD",
    "normal_followers_count": 44964,
    "possibly_sensitive": false,
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/441465751/1675968078",
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1553508977735962624/nnlSwBmu_normal.jpg",
    "screen_name": "iScienceLuvr",
    "statuses_count": 12033,
    "translator_type": "none",
    "url": "https://t.co/nNzCz2VVd1",
    "verified": false,
    "withheld_in_countries": [],
    "id_str": "441465751"
  },
  "id": "1709738453829877844",
  "conversation_id": "1709738453829877844",
  "full_text": "Reward Model Ensembles Help Mitigate Overoptimization\n\nabs: https://t.co/JEqdksPnT5\n\nRLHF can struggle with overoptimization, where the policy gets better according to the learned reward model but its true reward is actually worse. Building off Gao et al. 2023, here it is demonstrated that utilizing ensembles of reward models both mitigate overoptimization and also improve overall performance.",
  "reply_count": 0,
  "retweet_count": 13,
  "favorite_count": 75,
  "hashtags": [],
  "symbols": [],
  "user_mentions": [],
  "urls": [
    {
      "url": "https://t.co/JHqI4Eh1D8",
      "expanded_url": "https://arxiv.org/abs/2310.02743",
      "display_url": "arxiv.org/abs/2310.02743"
    }
  ],
  "media": [
    {
      "media_url": "https://pbs.twimg.com/media/F7o17YgbAAAh3FJ.png",
      "type": "photo"
    },
    {
      "media_url": "https://pbs.twimg.com/media/F7o2ATIbcAE8qrw.jpg",
      "type": "photo"
    }
  ],
  "url": "https://twitter.com/iScienceLuvr/status/1709738453829877844",
  "created_at": "2023-10-05T01:13:07.000Z",
  "#sort_index": "1709738453829877844",
  "view_count": 11897,
  "quote_count": 0,
  "is_quote_tweet": false,
  "is_retweet": false,
  "is_pinned": false,
  "is_truncated": true,
  "startUrl": "https://twitter.com/iscienceluvr/status/1709738453829877844"
}