🐦 Twitter Post Details

Viewing enriched Twitter post

@rasbt

While everyone is talking about Sora, there's a potential successor to LoRA (low-rank adaptation) called DoRA. Here's a closer look at the "DoRA: Weight-Decomposed Low-Rank Adaptation" paper: https://t.co/nDHYeoSUPf LoRA is probably the most widely used parameter-efficient finetuning method for LLMs and vision transformers, and DoRA can be seen as an improvement or extension of LoRA that is built on top of it. A brief LoRA recap: Assuming we have pretrained model weights W, LoRA uses low-rank matrices to approximate weight changes ΔW. I.e., in regular finetuning we have W' = W + ΔW, and in LoRA, we approximate ΔW with BA. Now, the DoRA method first decomposes the pretrained weight matrix into a magnitude vector (m) and a directional matrix (V). Then, it takes the directional matrix V and applies standard LoRA to it, i.e., W' = m (V + ΔV)/norm = m (W + BA)/norm. The motivation for developing this method is based on analyzing and comparing the LoRA and full finetuning learning patterns. They found that LoRA either increases or decreases magnitude and direction updates proportionally but seems to lack the capability of making only subtle directional changes as found in full finetuning. Hence, the researchers propose the decoupling of magnitude and directional components. In other words, their DoRA method aims to apply LoRA only to the directional component (while also allowing the magnitude component to be trained separably.) Note that introducing the magnitude vector m in DoRA adds 0.01% more parameters than standard LoRA. However, across both LLM and vision transformer benchmarks, they found that DoRA even outperforms LoRA if the DoRA rank is halved, i.e., when DoRA only uses half the parameters of regular LoRA. Overall, I am actually quite impressed by the results and need to toy with this method in practice. It should not be too big of a lift to upgrade a LoRA implementation to DoRA.

View on Twitter

🔧 Raw API Response

{
  "user": {
    "created_at": "2012-10-07T02:06:16.000Z",
    "default_profile_image": false,
    "description": "Machine learning & AI researcher writing at https://t.co/A0tXWzG1p5 • LLM research engineer @LightningAI • ex-stats professor at UW-Madison",
    "fast_followers_count": 0,
    "favourites_count": 18987,
    "followers_count": 254809,
    "friends_count": 858,
    "has_custom_timelines": true,
    "is_translator": false,
    "listed_count": 3577,
    "location": "United States",
    "media_count": 1632,
    "name": "Sebastian Raschka",
    "normal_followers_count": 254809,
    "possibly_sensitive": false,
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/865622395/1663716396",
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1661187442043486209/a3E4t1eV_normal.jpg",
    "screen_name": "rasbt",
    "statuses_count": 15078,
    "translator_type": "none",
    "url": "https://t.co/HrtQQ5tgJl",
    "verified": true,
    "withheld_in_countries": [],
    "id_str": "865622395"
  },
  "id": "1758502685995589698",
  "conversation_id": "1758502685995589698",
  "full_text": "While everyone is talking about Sora, there's a potential successor to LoRA (low-rank adaptation) called DoRA. Here's a closer look at the \"DoRA: Weight-Decomposed Low-Rank Adaptation\" paper: https://t.co/nDHYeoSUPf\n\nLoRA is probably the most widely used parameter-efficient finetuning method for LLMs and vision transformers, and DoRA can be seen as an improvement or extension of LoRA that is built on top of it. \n\nA brief LoRA recap: Assuming we have pretrained model weights W, LoRA uses low-rank matrices to approximate weight changes ΔW. I.e., in regular finetuning we have W' = W + ΔW, and in LoRA, we approximate ΔW with BA.\n\nNow, the DoRA method first decomposes the pretrained weight matrix into a magnitude vector (m) and a directional matrix (V). Then, it takes the directional matrix V and applies standard LoRA to it, i.e., W' = m (V + ΔV)/norm = m (W + BA)/norm.\n\nThe motivation for developing this method is based on analyzing and comparing the LoRA and full finetuning learning patterns. They found that LoRA either increases or decreases magnitude and direction updates proportionally but seems to lack the capability of making only subtle directional changes as found in full finetuning. Hence, the researchers propose the decoupling of magnitude and directional components. In other words, their DoRA method aims to apply LoRA only to the directional component (while also allowing the magnitude component to be trained separably.)\n\nNote that introducing the magnitude vector m in DoRA adds 0.01% more parameters than standard LoRA. However, across both LLM and vision transformer benchmarks, they found that DoRA even outperforms LoRA if the DoRA rank is halved, i.e., when DoRA only uses half the parameters of regular LoRA.\n\nOverall, I am actually quite impressed by the results and need to toy with this method in practice. It should not be too big of a lift to upgrade a LoRA implementation to DoRA.",
  "reply_count": 30,
  "retweet_count": 322,
  "favorite_count": 1681,
  "hashtags": [],
  "symbols": [],
  "user_mentions": [],
  "urls": [
    {
      "url": "https://t.co/Mmjhy3xTpd",
      "expanded_url": "https://arxiv.org/abs/2402.09353",
      "display_url": "arxiv.org/abs/2402.09353"
    }
  ],
  "media": [
    {
      "media_url": "https://pbs.twimg.com/media/GGd015TW0AEnOlI.jpg",
      "type": "photo"
    }
  ],
  "url": "https://twitter.com/rasbt/status/1758502685995589698",
  "created_at": "2024-02-16T14:44:46.000Z",
  "#sort_index": "1758502685995589698",
  "view_count": 169914,
  "quote_count": 35,
  "is_quote_tweet": false,
  "is_retweet": false,
  "is_pinned": false,
  "is_truncated": true,
  "startUrl": "https://twitter.com/rasbt/status/1758502685995589698"
}