🐦 Twitter Post Details

Viewing enriched Twitter post

@QuanquanGu

1/n 🚀 Introducing General Preference representation Model (GPM) and General Preference Optimization (GPO) for RLHF! 🎯 Reward modeling plays a central role in RLHF. Most existing reward models are based on the classical Bradley-Terry (BT) reward model. However, the BT model has limitations in handling intransitivity and complex human preferences. 💡 We introduce the GPM model, which lifts the BT model from scalar-valued space to vector-valued space using preference embedding, retaining the simplicity of BT model training while adding greater flexibility! Notably, our GPM achieves a query complexity of O(K) for evaluating preferences among K responses, a significant improvement over the O(K^2) complexity of traditional supervised preference models that rely on pairwise inputs. 💡 Building on GPM, we propose GPO, which takes self-play preference optimization (SPPO) to new heights! Paper: https://t.co/eDlRoc1LAp

View on Twitter

🔧 Raw API Response

{
  "user": {
    "created_at": "2017-08-26T04:43:13.000Z",
    "default_profile_image": false,
    "description": "Professor @UCLA, Research Scientist at ByteDance | Recent work: SPIN, SPPO, DPLM 1/2, GPM, MARS | Opinions are my own",
    "fast_followers_count": 0,
    "favourites_count": 26853,
    "followers_count": 11422,
    "friends_count": 1637,
    "has_custom_timelines": false,
    "is_translator": false,
    "listed_count": 130,
    "location": "Los Angeles, CA",
    "media_count": 122,
    "name": "Quanquan Gu",
    "normal_followers_count": 11422,
    "possibly_sensitive": false,
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/901303999529312256/1620230438",
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1202858000576237569/deB25mFF_normal.jpg",
    "screen_name": "QuanquanGu",
    "statuses_count": 1743,
    "translator_type": "none",
    "url": "https://t.co/OUkO2fFweH",
    "verified": true,
    "withheld_in_countries": [],
    "id_str": "901303999529312256"
  },
  "id": "1842985784472952876",
  "conversation_id": "1842985784472952876",
  "full_text": "1/n 🚀 Introducing General Preference representation Model (GPM) and General Preference Optimization (GPO) for RLHF! 🎯\n\nReward modeling plays a central role in RLHF. Most existing reward models are based on the classical Bradley-Terry (BT) reward model. However, the BT model has limitations in handling intransitivity and complex human preferences.\n\n💡 We introduce the GPM model, which lifts the BT model from scalar-valued space to vector-valued space using preference embedding, retaining the simplicity of BT model training while adding greater flexibility! Notably, our GPM achieves a query complexity of O(K) for evaluating preferences among K responses, a significant improvement over the O(K^2) complexity of traditional supervised preference models that rely on pairwise inputs.\n\n💡 Building on GPM, we propose GPO, which takes self-play preference optimization (SPPO) to new heights!\n\nPaper: https://t.co/eDlRoc1LAp",
  "reply_count": 4,
  "retweet_count": 58,
  "favorite_count": 297,
  "hashtags": [],
  "symbols": [],
  "user_mentions": [],
  "urls": [],
  "media": [
    {
      "media_url": "https://pbs.twimg.com/media/GZOYVsGbMAAFTFz.jpg",
      "type": "photo"
    }
  ],
  "url": "https://twitter.com/QuanquanGu/status/1842985784472952876",
  "created_at": "2024-10-06T17:50:26.000Z",
  "#sort_index": "1842985784472952876",
  "view_count": 79057,
  "quote_count": 0,
  "is_quote_tweet": true,
  "is_retweet": false,
  "is_pinned": false,
  "is_truncated": true,
  "quoted_tweet": {
    "user": {
      "created_at": "2022-10-27T06:22:03.000Z",
      "default_profile_image": false,
      "description": "https://t.co/Kqbq5Ol59m Graduate student at IIIS, Tsinghua, Visiting graduate student at UCLA",
      "fast_followers_count": 0,
      "favourites_count": 42,
      "followers_count": 72,
      "friends_count": 191,
      "has_custom_timelines": false,
      "is_translator": false,
      "listed_count": 1,
      "location": "Los Angeles, CA",
      "media_count": 5,
      "name": "Yifan Zhang",
      "normal_followers_count": 72,
      "possibly_sensitive": false,
      "profile_banner_url": "https://pbs.twimg.com/profile_banners/1585517096120655872/1728026308",
      "profile_image_url_https": "https://pbs.twimg.com/profile_images/1842101845759819777/aded1HSj_normal.jpg",
      "screen_name": "yifan_zhang_",
      "statuses_count": 20,
      "translator_type": "none",
      "url": "https://t.co/Kqbq5Ol59m",
      "verified": false,
      "withheld_in_countries": [],
      "id_str": "1585517096120655872"
    },
    "id": "1842115884619624522",
    "conversation_id": "1842115884619624522",
    "full_text": "1/8 ⭐General Preference Modeling with Preference Representations for Aligning Language Models⭐ https://t.co/FPYnMWGmOm  \n\nAs Huggingface Daily Papers: https://t.co/exiQmvmg1r  \n\nWe just dropped our latest research on General Preference Modeling (GPM)! 🚀",
    "reply_count": 3,
    "retweet_count": 14,
    "favorite_count": 44,
    "hashtags": [],
    "symbols": [],
    "user_mentions": [],
    "urls": [
      {
        "url": "https://t.co/FPYnMWGmOm",
        "expanded_url": "https://arxiv.org/abs/2410.02197",
        "display_url": "arxiv.org/abs/2410.02197"
      },
      {
        "url": "https://t.co/exiQmvmg1r",
        "expanded_url": "https://huggingface.co/papers/2410.02197",
        "display_url": "huggingface.co/papers/2410.02…"
      }
    ],
    "media": [],
    "url": "https://twitter.com/yifan_zhang_/status/1842115884619624522",
    "created_at": "2024-10-04T08:13:46.000Z",
    "#sort_index": "1842985784472952800",
    "view_count": 82177,
    "quote_count": 1,
    "is_quote_tweet": false,
    "is_retweet": false,
    "is_pinned": false,
    "is_truncated": false
  },
  "startUrl": "https://x.com/quanquangu/status/1842985784472952876"
}