🐦 Twitter Post Details

Viewing enriched Twitter post

@rohanpaul_ai

Run QWen 2.5 72B model on just a 4GB GPU 👨‍🔧 Without quantization, distillation and pruning or other model compression techniques. 🔥 Normally this model would require around 276GB of VRAM at full precision. But just for the sake of it, you can run it on a 4GB GPU. Or similarly you can run 405B Llama3.1 on 8GB VRAM. And this by using airllm library and layered inference. 📌 The secret is the layer-wise inference which is essentially the "divide and conquer" approach 💡 Note, it will not be usable for any serious case, but this amazing repo just shows that its possible. ---- 📌 The reason large language models are large is because, they occupy a lot of memory is mainly due to their structure containing many “layers.” An LLM starts with an embedding projection layer, followed by numerous transformer layers, all identical. A 70B class model has as many as 80 layers. But during inference, each layer is independent, relying only on the output of the previous layer. Therefore, after running a layer, its memory can be released, keeping only the layer’s output. Based on this concept, AirLLM has implemented layered inference. How ❓ During inference in a Transformer-based LLM, layers are executed sequentially. The output of the previous layer is the input to the next. Only one layer executes at a time. Therefore, it is completely unnecessary to keep all layers in GPU memory. We can load whichever layer is needed from disk when executing that layer, do all the calculations, and then completely free the memory after. This way, the GPU memory required per layer is only about the parameter size of one transformer layer, 1/80 of the full model, around 1.6GB. 📌 Then using flash attention to deeply optimizes cuda memory access to achieve multi-fold speedups 📌 shard model-files by layers. 📌 Use the meta device feature provided by HuggingFace Accelerate. When you load a model via meta device, the model data is not actually read in, only the code is loaded. Memory usage is 0. 📌 Provides options for doing quantization with a `compression` param `compression`: supported options: 4bit, 8bit for 4-bit or 8-bit block-wise quantization

View on Twitter

📊 Media Metadata

{
  "score": 0.91,
  "scored_at": "2025-08-09T13:46:07.551200",
  "import_source": "network_archive_import",
  "media": [
    {
      "type": "photo",
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/1851828950315774208/media_0.jpg?",
      "filename": "media_0.jpg"
    },
    {
      "id": "",
      "type": "photo",
      "url": null,
      "media_url": "https://pbs.twimg.com/media/GbMEeHEaUAAl0RI.jpg",
      "media_url_https": null,
      "display_url": null,
      "expanded_url": null
    }
  ],
  "reprocessed_at": "2025-08-12T15:25:45.335034",
  "reprocessed_reason": "missing_media_array",
  "original_structure": "had_both"
}

🔧 Raw API Response

{
  "user": {
    "created_at": "2014-06-25T22:38:54.000Z",
    "default_profile_image": false,
    "description": "💼 AI Engineer.\n\nCompiling real-time the race towards AGI 🐎.\n\nFollow to stay at bleeding edge AI 🚁\n\nI write daily on my Newsletter → https://t.co/Jfj0r0wLUN",
    "fast_followers_count": 0,
    "favourites_count": 24836,
    "followers_count": 50388,
    "friends_count": 496,
    "has_custom_timelines": true,
    "is_translator": false,
    "listed_count": 973,
    "location": "Ex Inv Banker (Deutsche)",
    "media_count": 10220,
    "name": "Rohan Paul",
    "normal_followers_count": 50388,
    "possibly_sensitive": false,
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/2588345408/1729559315",
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1816185267037859840/Fd18CH0v_normal.jpg",
    "screen_name": "rohanpaul_ai",
    "statuses_count": 28638,
    "translator_type": "none",
    "url": "https://t.co/Jfj0r0wLUN",
    "verified": true,
    "withheld_in_countries": [],
    "id_str": "2588345408"
  },
  "id": "1851828950315774208",
  "conversation_id": "1851828950315774208",
  "full_text": "Run QWen 2.5 72B model on just a 4GB GPU 👨‍🔧\n\nWithout quantization, distillation and pruning  or other model compression techniques. 🔥\n\nNormally this model would require around 276GB of VRAM at full precision. But just for the sake of it, you can run it on a 4GB GPU.\n\nOr similarly you can run 405B Llama3.1 on 8GB VRAM.\n\nAnd this by using airllm library and layered inference.\n\n📌 The secret is the layer-wise inference which is essentially the \"divide and conquer\" approach 💡\n\nNote, it will not be usable for any serious case, but this amazing repo just shows that its possible.\n\n----\n\n📌 The reason large language models are large is because, they occupy a lot of memory is mainly due to their structure containing many “layers.”\n\nAn LLM starts with an embedding projection layer, followed by numerous transformer layers, all identical.\n\nA 70B class model has as many as 80 layers. But during inference, each layer is independent, relying only on the output of the previous layer.\n\nTherefore, after running a layer, its memory can be released, keeping only the layer’s output. Based on this concept, AirLLM has implemented layered inference.\n\nHow ❓\n\nDuring inference in a Transformer-based LLM, layers are executed sequentially. The output of the previous layer is the input to the next. Only one layer executes at a time.\n\nTherefore, it is completely unnecessary to keep all layers in GPU memory. We can load whichever layer is needed from disk when executing that layer, do all the calculations, and then completely free the memory after.\n\nThis way, the GPU memory required per layer is only about the parameter size of one transformer layer, 1/80 of the full model, around 1.6GB.\n\n📌 Then using flash attention to deeply optimizes cuda memory access to achieve multi-fold speedups\n\n📌 shard model-files by layers.\n\n📌 Use the meta device feature provided by HuggingFace Accelerate. When you load a model via meta device, the model data is not actually read in, only the code is loaded. Memory usage is 0.\n\n📌 Provides options for doing quantization with a `compression` param\n\n`compression`: supported options: 4bit, 8bit for 4-bit or 8-bit block-wise quantization",
  "reply_count": 34,
  "retweet_count": 207,
  "favorite_count": 1120,
  "hashtags": [],
  "symbols": [],
  "user_mentions": [],
  "urls": [],
  "media": [
    {
      "media_url": "https://pbs.twimg.com/media/GbMEeHEaUAAl0RI.jpg",
      "type": "photo"
    }
  ],
  "url": "https://twitter.com/rohanpaul_ai/status/1851828950315774208",
  "created_at": "2024-10-31T03:30:01.000Z",
  "#sort_index": "1851828950315774208",
  "view_count": 101359,
  "quote_count": 13,
  "is_quote_tweet": false,
  "is_retweet": false,
  "is_pinned": false,
  "is_truncated": true,
  "startUrl": "https://x.com/rohanpaul_ai/status/1851828950315774208"
}