🐦 Twitter Post Details

Viewing enriched Twitter post

@rasbt

I just read the "Thinking LLMs: General Instruction Following With Thought Generation" paper (I), which offers a simple yet effective way to improve the response quality of instruction-finetuned LLMs. Thinking of it as a very simple alternative to OpenAI's o1 model, which produces better answers via internal "thinking" yet only shows you the final response, not the thinking process. The idea of the proposed Thought Preference Optimization (TPO) is to incorporate a Chain-of-Thought-style prompting/reasoning into the training. However, a) just asking the model to "think" via Chain-of-Thought prompting can reduce response accuracy b) training on Chain-of-Thought data would be hard because human thought processes are usually not included in instruction datasets So, their idea is this (see figure below): 1) Modify the prompt with a Chain-of-Thought style: "think before responding." 2) Use an LLM judge to evaluate the responses (excluding the thoughts generated by the LLM) 3) Form preference pairs for DPO based on the rejected and preferred responses (these responses include the thoughts) This way, the LLM implicitly learns to optimize its thinking process to produce better responses. (Note that the thinking process doesn't need to be shown to the user in a way similar to how it's not shown to the judge LLM.) The results, based on Llama 3 8B Instruct, show that this TPO approach works quite well: i) Interestingly, if the thought prompt is prepended but the Llama 3 8B Instruct base model doesn't undergo DPO finetuning on the preference pairs, this base model performs much worse than without the thought prompt ii) Finetuning the model on the instruction data (direct response baseline) without thought prompt improves the base model performance already by a lot, about 27.6% points on AlpacaEval and 17% on Arena-Hard; this shows how important finetuning is in general iii) Now, adding the thought preference optimization further boosts the performance by 4% Note that this method is applied to general instruction-response answering and is not specific to logic or math tasks.

🔧 Raw API Response

{
  "user": {
    "created_at": "2012-10-07T02:06:16.000Z",
    "default_profile_image": false,
    "description": "AI & ML researcher. Author of the \"Build a Large Language Model From Scratch\" book (https://t.co/O8LAAMRzzW). LLM research engineer @LightningAI.",
    "fast_followers_count": 0,
    "favourites_count": 20732,
    "followers_count": 301488,
    "friends_count": 952,
    "has_custom_timelines": true,
    "is_translator": false,
    "listed_count": 3896,
    "location": "United States",
    "media_count": 1825,
    "name": "Sebastian Raschka",
    "normal_followers_count": 301488,
    "possibly_sensitive": false,
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/865622395/1726358218",
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1661187442043486209/a3E4t1eV_normal.jpg",
    "screen_name": "rasbt",
    "statuses_count": 16774,
    "translator_type": "none",
    "url": "https://t.co/ZRGQFNZXOv",
    "verified": true,
    "withheld_in_countries": [],
    "id_str": "865622395"
  },
  "id": "1850177459930497118",
  "conversation_id": "1850177459930497118",
  "full_text": "I just read the \"Thinking LLMs: General Instruction Following With Thought Generation\" paper (I), which offers a simple yet effective way to improve the response quality of instruction-finetuned LLMs. \n\nThinking of it as a very simple alternative to OpenAI's o1 model, which produces better answers via internal \"thinking\" yet only shows you the final response, not the thinking process.\n\nThe idea of the proposed Thought Preference Optimization (TPO) is to incorporate a Chain-of-Thought-style prompting/reasoning into the training. However, \n\na) just asking the model to \"think\" via Chain-of-Thought prompting can reduce response accuracy\n\nb) training on Chain-of-Thought data would be hard because human thought processes are usually not included in instruction datasets\n\nSo, their idea is this (see figure below): \n\n1) Modify the prompt with a Chain-of-Thought style: \"think before responding.\"\n\n2) Use an LLM judge to evaluate the responses (excluding the thoughts generated by the LLM)\n\n3) Form preference pairs for DPO based on the rejected and preferred responses (these responses include the thoughts)\n\nThis way, the LLM implicitly learns to optimize its thinking process to produce better responses. (Note that the thinking process doesn't need to be shown to the user in a way similar to how it's not shown to the judge LLM.)\n\nThe results, based on Llama 3 8B Instruct, show that this TPO approach works quite well:\n\ni) Interestingly, if the thought prompt is prepended but the Llama 3 8B Instruct base model doesn't undergo DPO finetuning on the preference pairs, this base model performs much worse than without the thought prompt\n\nii) Finetuning the model on the instruction data (direct response baseline) without thought prompt improves the base model performance already by a lot, about 27.6% points on AlpacaEval and 17% on Arena-Hard; this shows how important finetuning is in general\n\niii) Now, adding the thought preference optimization further boosts the performance by 4%\n\nNote that this method is applied to general instruction-response answering and is not specific to logic or math tasks.",
  "reply_count": 21,
  "retweet_count": 199,
  "favorite_count": 1016,
  "hashtags": [],
  "symbols": [],
  "user_mentions": [],
  "urls": [],
  "media": [
    {
      "media_url": "https://pbs.twimg.com/media/Ga0mcQkaUAAhMii.jpg",
      "type": "photo"
    }
  ],
  "url": "https://twitter.com/rasbt/status/1850177459930497118",
  "created_at": "2024-10-26T14:07:35.000Z",
  "#sort_index": "1850177459930497118",
  "view_count": 80134,
  "quote_count": 6,
  "is_quote_tweet": false,
  "is_retweet": false,
  "is_pinned": false,
  "is_truncated": true,
  "startUrl": "https://x.com/rasbt/status/1850177459930497118"
}