🐦 Twitter Post Details

Viewing enriched Twitter post

@TheAhmadOsman

Hugging Face has released a 214-page MASTERCLASS on how to train LLMs > it’s called The Smol Training Playbook > and if want to learn how to train LLMs, > this GIFT is for you > this training bible walks you through the ENTIRE pipeline > covers every concept that matters from why you train, > to what you train, to how you actually pull it off > from pre-training, to mid-training, to post-training > it turns vague buzzwords into step-by-step decisions > architecture, tokenization, data strategy, and infra > highlights the real-world gotchas > instabilities, scaling headaches, debugging nightmares > distills lessons from building actual > state-of-the-art LLMs, not just toy models how modern transformer models are actually built > tokenization: the secret foundation of every LLM > tokenizer fundamentals > vocabulary size > byte pair encoding > custom vs existing tokenizers > all the modern attention mechanisms are here > multi-head attention > multi-query attention > grouped-query attention > multi-latent attention > every positional encoding trick in the book > absolute position embedding > rotary position embedding > yaRN (yet another rotary network) > ablate-by-frequency positional encoding > no position embedding > randomized no position embedding > stability hacks that actually work > z-loss regularization > query-key normalization > removing weight decay from embedding layers > sparse scaling, handled > mixture-of-experts scaling > activation ratio tuning > choosing the right granularity > sharing experts between layers > load balancing across experts > long-context handling via ssm > hybrid models: transformer plus state space models data curation = most of your real model quality > data curation is the main driver of your model’s actual quality > architecture alone won’t save you > building the right data mixture is an art, > not just dumping in more web scrapes > curriculum learning, adaptive mixes, ablate everything > you need curriculum learning: > design data mixes hat evolve as training progresses > use adaptive mixtures that shift emphasis > based on model stage and performance > ablate everything: run experiments to systematically > test how each data source or filter impacts results > smollm3 data > the smollm3 recipe: balanced english web data, > broad multilingual sources, high-quality code, and diverse math datasets > without the right data pipeline, > even the best architecture will underperform the training marathon > do your preflight checklist or die > check your infrastructure, > validate your evaluation pipelines, > set up logging, and configure alerts > so you don’t miss silent failures > scaling surprises are inevitable > things will break at scale in ways they never did in testing > vanishing throughput? that usually means > you’ve got a hidden shape mismatch or > batch dimension bug killing your GPU utilization > sudden drops in throughput? > check your software stack for inefficiencies, > resource leaks, or bad dataloader code > seeing noisy, spiky loss values? > your data shuffling is probably broken, > and the model is seeing repeated or ordered data > performance worse than expected? > look for subtle parallelism bugs > tensor parallel, data parallel, > or pipeline parallel gone rogue > monitor like your GPUs depend on it (because they do) > watch every metric, track utilization, spot anomalies fast > mid-training is not autopilot > swap in higher-quality data to improve learning, > extend the context window if you want bigger inputs, > and use multi-stage training curricula to maximize gains > the difference between a good model and a failed run is > almost always vigilance and relentless debugging during this marathon post-training > post-training is where your raw base model > actually becomes a useful assistant > always start with supervised fine-tuning (sft) > use high-quality, well-structured chat data and > pick a solid template for consistent turns > sft gives you a stable, cost-effective baseline > don’t skip it, even if you plan to go deeper > next, optimize for user preferences > direct preference optimization (dpo), > or its variants like kernelized (kto), > online (orpo), or adversarial (apo) > these methods actually teach the model > what “better” looks like beyond simple mimicry > once you’ve got preference alignment,go on-policy: > reinforcement learning from human feedback (rlhf) > or on-policy distillation, which lets your model learn > from real interactions or stronger models > this is how you get reliability and sharper behaviors > the post-training pipeline is where > assistants are truly sculpted; > skipping steps means leaving performance, > safety, and steerability on the table infra is the boss fight > this is where most teams lose time, > money, and sanity if they’re not careful > inside every gpu > you’ve got tensor cores and cuda cores for the heavy math, > plus a memory hierarchy (registers, shared memory, hbm) > that decides how fast you can feed data to the compute units > outside the gpu, your interconnects matter > pcie for gpu-to-cpu, > nvlink for ultra-fast gpu-to-gpu within a node, > infiniband or roce for communication between nodes, > and gpudirect storage for feeding massive datasets > straight from disk to gpu memory > make your infra resilient: > checkpoint your training constantly, > because something will crash; > monitor node health so you can kill or restart > sick nodes before they poison your run > scaling isn’t just “add more gpus” > you have to pick and tune the right parallelism: > data parallelism (dp), pipeline parallelism (pp), tensor parallelism (tp), > or fully sharded data parallel (fsdp); > the right combo can double your throughput, > the wrong one can bottleneck you instantly to recap > always start with WHY > define the core reason you’re training a model > is it research, a custom production need, or to fill an open-source gap? > spec what you need: architecture, model size, data mix, assistant type > transformer or hybrid > set your model size > design the right data mixture > decide what kind of assistant or > use case you’re targeting > build infra for the job, plan for chaos, pick your stability tricks > build infrastructure that matches your goals > choose the right GPUs > set up reliable storage > and plan for network bottlenecks > expect failures, weird bugs, > and sudden bottlenecks at scale > select your stability tricks in advance: > know which techniques you’ll use to fight loss spikes, > unstable gradients, and hardware hiccups closing notes > the pace of LLM development is relentless, > but the underlying principles never go out of style > and this PDF covers what actually matters > no matter how fast the field changes > systematic experimentation is everything > run controlled tests, change one variable at a time, and document every step > sharp debugging instincts will save you > more time (and compute budget) than any paper or library > deep knowledge of both your software stack > and your hardware is the ultimate unfair advantage; > know your code, know your chips > in the end, success comes from relentless curiosity, > tight feedback loops, and a willingness to question everything > even your own assumptions if i had this two years ago, it would have saved me so much time > if you’re building llms, > read this before you burn gpu months happy hacking

View on Twitter

📊 Media Metadata

{
  "media": [
    {
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2005041046930268662/media_0.jpg?",
      "media_url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2005041046930268662/media_0.jpg?",
      "type": "photo",
      "filename": "media_0.jpg"
    }
  ],
  "processed_at": "2025-12-31T02:50:02.586637",
  "pipeline_version": "2.0"
}

🔧 Raw API Response

{
  "type": "tweet",
  "id": "2005041046930268662",
  "url": "https://x.com/TheAhmadOsman/status/2005041046930268662",
  "twitterUrl": "https://twitter.com/TheAhmadOsman/status/2005041046930268662",
  "text": "Hugging Face has released a 214-page\nMASTERCLASS on how to train LLMs\n\n> it’s called The Smol Training Playbook\n> and if want to learn how to train LLMs,\n> this GIFT is for you\n\n> this training bible walks you through the ENTIRE pipeline\n> covers every concept that matters from why you train,\n> to what you train, to how you actually pull it off\n\n> from pre-training, to mid-training, to post-training\n> it turns vague buzzwords into step-by-step decisions\n> architecture, tokenization, data strategy, and infra\n\n> highlights the real-world gotchas\n> instabilities, scaling headaches, debugging nightmares\n\n> distills lessons from building actual\n> state-of-the-art LLMs, not just toy models\n\nhow modern transformer models are actually built\n\n> tokenization: the secret foundation of every LLM\n> tokenizer fundamentals\n> vocabulary size\n> byte pair encoding\n> custom vs existing tokenizers\n\n> all the modern attention mechanisms are here\n> multi-head attention\n> multi-query attention\n> grouped-query attention\n> multi-latent attention\n\n> every positional encoding trick in the book\n> absolute position embedding\n> rotary position embedding\n> yaRN (yet another rotary network)\n> ablate-by-frequency positional encoding\n> no position embedding\n> randomized no position embedding\n\n> stability hacks that actually work\n> z-loss regularization\n> query-key normalization\n> removing weight decay from embedding layers\n\n> sparse scaling, handled\n> mixture-of-experts scaling\n> activation ratio tuning\n> choosing the right granularity\n> sharing experts between layers\n> load balancing across experts\n\n> long-context handling via ssm\n> hybrid models: transformer plus state space models\n\ndata curation = most of your real model quality\n\n> data curation is the main driver of your model’s actual quality\n> architecture alone won’t save you\n> building the right data mixture is an art,\n> not just dumping in more web scrapes\n\n> curriculum learning, adaptive mixes, ablate everything\n> you need curriculum learning:\n> design data mixes hat evolve as training progresses\n> use adaptive mixtures that shift emphasis\n> based on model stage and performance\n> ablate everything: run experiments to systematically\n> test how each data source or filter impacts results\n\n> smollm3 data\n> the smollm3 recipe: balanced english web data,\n> broad multilingual sources, high-quality code, and diverse math datasets\n> without the right data pipeline,\n> even the best architecture will underperform\n\nthe training marathon\n\n> do your preflight checklist or die\n> check your infrastructure,\n> validate your evaluation pipelines,\n> set up logging, and configure alerts\n> so you don’t miss silent failures\n\n> scaling surprises are inevitable \n> things will break at scale in ways they never did in testing\n\n> vanishing throughput? that usually means\n> you’ve got a hidden shape mismatch or\n> batch dimension bug killing your GPU utilization\n\n> sudden drops in throughput?\n> check your software stack for inefficiencies,\n> resource leaks, or bad dataloader code\n\n> seeing noisy, spiky loss values?\n> your data shuffling is probably broken,\n> and the model is seeing repeated or ordered data\n\n> performance worse than expected?\n> look for subtle parallelism bugs\n> tensor parallel, data parallel,\n> or pipeline parallel gone rogue\n\n> monitor like your GPUs depend on it (because they do)\n> watch every metric, track utilization, spot anomalies fast\n\n> mid-training is not autopilot\n> swap in higher-quality data to improve learning,\n> extend the context window if you want bigger inputs,\n> and use multi-stage training curricula to maximize gains\n\n> the difference between a good model and a failed run is\n> almost always vigilance and relentless debugging during this marathon\n\npost-training\n\n> post-training is where your raw base model\n> actually becomes a useful assistant\n\n> always start with supervised fine-tuning (sft)\n> use high-quality, well-structured chat data and\n> pick a solid template for consistent turns\n\n> sft gives you a stable, cost-effective baseline\n> don’t skip it, even if you plan to go deeper\n\n> next, optimize for user preferences\n> direct preference optimization (dpo),\n> or its variants like kernelized (kto),\n> online (orpo), or adversarial (apo)\n> these methods actually teach the model\n> what “better” looks like beyond simple mimicry\n\n> once you’ve got preference alignment,go on-policy:\n> reinforcement learning from human feedback (rlhf)\n> or on-policy distillation, which lets your model learn\n> from real interactions or stronger models\n> this is how you get reliability and sharper behaviors\n\n> the post-training pipeline is where\n> assistants are truly sculpted;\n> skipping steps means leaving performance,\n> safety, and steerability on the table\n\ninfra is the boss fight\n\n> this is where most teams lose time,\n> money, and sanity if they’re not careful\n\n> inside every gpu\n> you’ve got tensor cores and cuda cores for the heavy math,\n> plus a memory hierarchy (registers, shared memory, hbm)\n> that decides how fast you can feed data to the compute units\n\n> outside the gpu, your interconnects matter\n> pcie for gpu-to-cpu,\n> nvlink for ultra-fast gpu-to-gpu within a node,\n> infiniband or roce for communication between nodes,\n> and gpudirect storage for feeding massive datasets\n> straight from disk to gpu memory\n\n> make your infra resilient:\n> checkpoint your training constantly,\n> because something will crash;\n> monitor node health so you can kill or restart\n> sick nodes before they poison your run\n\n> scaling isn’t just “add more gpus”\n> you have to pick and tune the right parallelism:\n> data parallelism (dp), pipeline parallelism (pp), tensor parallelism (tp),\n> or fully sharded data parallel (fsdp);\n> the right combo can double your throughput,\n> the wrong one can bottleneck you instantly\n\nto recap\n\n> always start with WHY\n> define the core reason you’re training a model\n> is it research, a custom production need, or to fill an open-source gap?\n\n> spec what you need: architecture, model size, data mix, assistant type\n> transformer or hybrid\n> set your model size\n> design the right data mixture\n> decide what kind of assistant or \n> use case you’re targeting\n\n> build infra for the job, plan for chaos, pick your stability tricks\n> build infrastructure that matches your goals\n> choose the right GPUs\n> set up reliable storage\n> and plan for network bottlenecks\n> expect failures, weird bugs,\n> and sudden bottlenecks at scale\n> select your stability tricks in advance:\n> know which techniques you’ll use to fight loss spikes,\n> unstable gradients, and hardware hiccups\n\nclosing notes\n\n> the pace of LLM development is relentless,\n> but the underlying principles never go out of style\n> and this PDF covers what actually matters\n> no matter how fast the field changes\n\n> systematic experimentation is everything\n> run controlled tests, change one variable at a time, and document every step\n\n> sharp debugging instincts will save you\n> more time (and compute budget) than any paper or library\n\n> deep knowledge of both your software stack\n> and your hardware is the ultimate unfair advantage;\n> know your code, know your chips\n\n> in the end, success comes from relentless curiosity,\n> tight feedback loops, and a willingness to question everything\n> even your own assumptions\n\nif i had this two years ago, it would have saved me so much time\n\n> if you’re building llms,\n> read this before you burn gpu months\n\nhappy hacking",
  "source": "Twitter for iPhone",
  "retweetCount": 449,
  "replyCount": 33,
  "likeCount": 2213,
  "quoteCount": 8,
  "viewCount": 117191,
  "createdAt": "Sat Dec 27 22:20:12 +0000 2025",
  "lang": "en",
  "bookmarkCount": 2966,
  "isReply": false,
  "inReplyToId": null,
  "conversationId": "2005041046930268662",
  "displayTextRange": [
    0,
    318
  ],
  "inReplyToUserId": null,
  "inReplyToUsername": null,
  "author": {
    "type": "user",
    "userName": "TheAhmadOsman",
    "url": "https://x.com/TheAhmadOsman",
    "twitterUrl": "https://twitter.com/TheAhmadOsman",
    "id": "248951926",
    "name": "Ahmad",
    "isVerified": false,
    "isBlueVerified": true,
    "verifiedType": null,
    "profilePicture": "https://pbs.twimg.com/profile_images/1831406651829559296/iMbSOxdP_normal.jpg",
    "coverPicture": "https://pbs.twimg.com/profile_banners/248951926/1751118507",
    "description": "",
    "location": ":~$ curl -L ahmadosman.com/dir",
    "followers": 34167,
    "following": 342,
    "status": "",
    "canDm": true,
    "canMediaTag": false,
    "createdAt": "Tue Feb 08 01:53:15 +0000 2011",
    "entities": {
      "description": {
        "urls": []
      },
      "url": {}
    },
    "fastFollowersCount": 0,
    "favouritesCount": 40753,
    "hasCustomTimelines": true,
    "isTranslator": true,
    "mediaCount": 2679,
    "statusesCount": 18388,
    "withheldInCountries": [],
    "affiliatesHighlightedLabel": {},
    "possiblySensitive": false,
    "pinnedTweetIds": [
      "2005847146604494975"
    ],
    "profile_bio": {
      "description": "ai research, software engineering, infra & hardware, on a mission to build a DGX B300 GPU cluster, i moderate GPUs on r/LocalLLaMA",
      "entities": {
        "description": {},
        "url": {
          "urls": [
            {
              "display_url": "ahmadosman.com/about",
              "expanded_url": "https://www.ahmadosman.com/about",
              "indices": [
                0,
                23
              ],
              "url": "https://t.co/Aigzat4caa"
            }
          ]
        }
      }
    },
    "isAutomated": false,
    "automatedBy": null
  },
  "extendedEntities": {
    "media": [
      {
        "allow_download_status": {
          "allow_download": true
        },
        "display_url": "pic.twitter.com/q9MpJJt5Wl",
        "expanded_url": "https://twitter.com/TheAhmadOsman/status/2005041046930268662/photo/1",
        "ext_media_availability": {
          "status": "Available"
        },
        "features": {
          "large": {
            "faces": [
              {
                "h": 76,
                "w": 76,
                "x": 533,
                "y": 588
              },
              {
                "h": 87,
                "w": 87,
                "x": 205,
                "y": 582
              }
            ]
          },
          "orig": {
            "faces": [
              {
                "h": 76,
                "w": 76,
                "x": 533,
                "y": 588
              },
              {
                "h": 87,
                "w": 87,
                "x": 205,
                "y": 582
              }
            ]
          }
        },
        "id_str": "2005040934539399168",
        "indices": [
          319,
          342
        ],
        "media_key": "3_2005040934539399168",
        "media_results": {
          "id": "QXBpTWVkaWFSZXN1bHRzOgwAAQoAARvTVhtXVpAACgACG9NWNYJbIfYAAA==",
          "result": {
            "__typename": "ApiMedia",
            "id": "QXBpTWVkaWE6DAABCgABG9NWG1dWkAAKAAIb01Y1glsh9gAA",
            "media_key": "3_2005040934539399168"
          }
        },
        "media_url_https": "https://pbs.twimg.com/media/G9NWG1dWkAA0qNF.jpg",
        "original_info": {
          "focus_rects": [
            {
              "h": 840,
              "w": 1500,
              "x": 0,
              "y": 0
            },
            {
              "h": 1500,
              "w": 1500,
              "x": 0,
              "y": 0
            },
            {
              "h": 1710,
              "w": 1500,
              "x": 0,
              "y": 0
            },
            {
              "h": 1950,
              "w": 975,
              "x": 525,
              "y": 0
            },
            {
              "h": 1950,
              "w": 1500,
              "x": 0,
              "y": 0
            }
          ],
          "height": 1950,
          "width": 1500
        },
        "sizes": {
          "large": {
            "h": 1950,
            "w": 1500
          }
        },
        "type": "photo",
        "url": "https://t.co/q9MpJJt5Wl"
      }
    ]
  },
  "card": null,
  "place": {},
  "entities": {},
  "quoted_tweet": null,
  "retweeted_tweet": null,
  "isLimitedReply": false,
  "article": null
}