🐦 Twitter Post Details

Viewing enriched Twitter post

@dair_ai

A 3B model outperforms models 10x its size on reasoning benchmarks. Small language models (SLMs) are often dismissed as fundamentally limited. The belief is that more parameters mean more capability, and that's it. More recent research indicates that the real ceiling isn't parameter count. It's the training methodology. This technical report introduces Nanbeige4-3B, a family of SLMs trained on 23 trillion high-quality tokens and finetuned on over 30 million diverse instructions. The results challenge assumptions about model scaling. On AIME 2024, Nanbeige4-3B-Thinking scores 90.4% versus Qwen3-32B's 81.4%. On GPQA-Diamond, it achieves 82.2% versus Qwen3-14B's 64.0%. This shows that the 3B model consistently outperforms models 4-10x larger. Here's how they did it: Fine-Grained WSD scheduler: Rather than uniform data sampling, they split training into stages with progressively refined data mixtures. High-quality data is concentrated in later stages. On a 1B test model, this improved GSM8K from 27.1% to 34.3% versus vanilla scheduling. Solution refinement with CoT reconstruction: They refine answer quality through iterative critique cycles, then reconstruct a chain-of-thought that logically leads to the improved solution. This yields SFT examples far better than rejection sampling. Dual Preference Distillation: The student model simultaneously learns to mimic teacher output distributions while distinguishing high-quality from low-quality responses. Token-level distillation combined with sequence-level preference optimization. Multi-stage RL: Rather than mixed-corpus training, each RL stage targets a specific domain. STEM reasoning with agentic verifiers. Coding with synthetic test functions. Human preference alignment with pairwise reward models. On the WritingBench leaderboard, Nanbeige4-3B-Thinking (79.03) approaches GPT-5 (83.87) and outperforms DeepSeek-R1 (78.92), Grok-4 (74.65), and O4-mini (72.90). The report demonstrates that carefully engineered small models can match or exceed much larger models when training methodology is optimized at every stage. Paper: https://t.co/bFPJOZycji Learn to build with LLMs and AI Agents in our academy: https://t.co/zQXQt0PMbG

View on Twitter

📊 Media Metadata

{
  "media": [
    {
      "type": "photo",
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/1999488933412110456/media_0.jpg?",
      "filename": "media_0.jpg"
    },
    {
      "type": "photo",
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/1999488933412110456/media_1.png?",
      "filename": "media_1.png"
    }
  ],
  "processed_at": "2025-12-12T14:48:26.537888",
  "pipeline_version": "2.0"
}

🔧 Raw API Response

{
  "type": "tweet",
  "id": "1999488933412110456",
  "url": "https://x.com/dair_ai/status/1999488933412110456",
  "twitterUrl": "https://twitter.com/dair_ai/status/1999488933412110456",
  "text": "A 3B model outperforms models 10x its size on reasoning benchmarks.\n\nSmall language models (SLMs) are often dismissed as fundamentally limited. The belief is that more parameters mean more capability, and that's it.\n\nMore recent research indicates that the real ceiling isn't parameter count. It's the training methodology.\n\nThis technical report introduces Nanbeige4-3B, a family of SLMs trained on 23 trillion high-quality tokens and finetuned on over 30 million diverse instructions.\n\nThe results challenge assumptions about model scaling. On AIME 2024, Nanbeige4-3B-Thinking scores 90.4% versus Qwen3-32B's 81.4%. On GPQA-Diamond, it achieves 82.2% versus Qwen3-14B's 64.0%.\n\nThis shows that the 3B model consistently outperforms models 4-10x larger.\n\nHere's how they did it:\n\nFine-Grained WSD scheduler: Rather than uniform data sampling, they split training into stages with progressively refined data mixtures. High-quality data is concentrated in later stages. On a 1B test model, this improved GSM8K from 27.1% to 34.3% versus vanilla scheduling.\n\nSolution refinement with CoT reconstruction: They refine answer quality through iterative critique cycles, then reconstruct a chain-of-thought that logically leads to the improved solution. This yields SFT examples far better than rejection sampling.\n\nDual Preference Distillation: The student model simultaneously learns to mimic teacher output distributions while distinguishing high-quality from low-quality responses. Token-level distillation combined with sequence-level preference optimization.\n\nMulti-stage RL: Rather than mixed-corpus training, each RL stage targets a specific domain. STEM reasoning with agentic verifiers. Coding with synthetic test functions. Human preference alignment with pairwise reward models.\n\nOn the WritingBench leaderboard, Nanbeige4-3B-Thinking (79.03) approaches GPT-5 (83.87) and outperforms DeepSeek-R1 (78.92), Grok-4 (74.65), and O4-mini (72.90).\n\nThe report demonstrates that carefully engineered small models can match or exceed much larger models when training methodology is optimized at every stage.\n\nPaper: https://t.co/bFPJOZycji\n\nLearn to build with LLMs and AI Agents in our academy: https://t.co/zQXQt0PMbG",
  "source": "Twitter for iPhone",
  "retweetCount": 2,
  "replyCount": 0,
  "likeCount": 5,
  "quoteCount": 0,
  "viewCount": 378,
  "createdAt": "Fri Dec 12 14:38:05 +0000 2025",
  "lang": "en",
  "bookmarkCount": 9,
  "isReply": false,
  "inReplyToId": null,
  "conversationId": "1999488933412110456",
  "displayTextRange": [
    0,
    276
  ],
  "inReplyToUserId": null,
  "inReplyToUsername": null,
  "author": {
    "type": "user",
    "userName": "dair_ai",
    "url": "https://x.com/dair_ai",
    "twitterUrl": "https://twitter.com/dair_ai",
    "id": "889050642903293953",
    "name": "DAIR.AI",
    "isVerified": false,
    "isBlueVerified": true,
    "verifiedType": null,
    "profilePicture": "https://pbs.twimg.com/profile_images/1643277398522187778/31dedbLo_normal.jpg",
    "coverPicture": "https://pbs.twimg.com/profile_banners/889050642903293953/1742055232",
    "description": "Democratizing AI research, education, and technologies.",
    "location": "",
    "followers": 83037,
    "following": 1,
    "status": "",
    "canDm": true,
    "canMediaTag": true,
    "createdAt": "Sun Jul 23 09:12:45 +0000 2017",
    "entities": {
      "description": {
        "urls": []
      },
      "url": {
        "urls": [
          {
            "display_url": "dair.ai",
            "expanded_url": "https://www.dair.ai/",
            "url": "https://t.co/lkqPZtMmfU",
            "indices": [
              0,
              23
            ]
          }
        ]
      }
    },
    "fastFollowersCount": 0,
    "favouritesCount": 3877,
    "hasCustomTimelines": true,
    "isTranslator": false,
    "mediaCount": 85,
    "statusesCount": 2641,
    "withheldInCountries": [],
    "affiliatesHighlightedLabel": {},
    "possiblySensitive": false,
    "pinnedTweetIds": [
      "1999117070576058415"
    ],
    "profile_bio": {},
    "isAutomated": false,
    "automatedBy": null
  },
  "extendedEntities": {
    "media": [
      {
        "display_url": "pic.x.com/eBZ5eqFQ0b",
        "expanded_url": "https://x.com/dair_ai/status/1999488933412110456/photo/1",
        "id_str": "1999488930081832966",
        "indices": [
          277,
          300
        ],
        "media_key": "3_1999488930081832966",
        "media_url_https": "https://pbs.twimg.com/media/G7-clpHagAY4Qg7.jpg",
        "type": "photo",
        "url": "https://t.co/eBZ5eqFQ0b",
        "ext_media_availability": {
          "status": "Available"
        },
        "features": {
          "large": {
            "faces": []
          },
          "medium": {
            "faces": []
          },
          "small": {
            "faces": []
          },
          "orig": {
            "faces": []
          }
        },
        "sizes": {
          "large": {
            "h": 1772,
            "w": 1520,
            "resize": "fit"
          },
          "medium": {
            "h": 1200,
            "w": 1029,
            "resize": "fit"
          },
          "small": {
            "h": 680,
            "w": 583,
            "resize": "fit"
          },
          "thumb": {
            "h": 150,
            "w": 150,
            "resize": "crop"
          }
        },
        "original_info": {
          "height": 1772,
          "width": 1520,
          "focus_rects": [
            {
              "x": 0,
              "y": 0,
              "w": 1520,
              "h": 851
            },
            {
              "x": 0,
              "y": 0,
              "w": 1520,
              "h": 1520
            },
            {
              "x": 0,
              "y": 0,
              "w": 1520,
              "h": 1733
            },
            {
              "x": 398,
              "y": 0,
              "w": 886,
              "h": 1772
            },
            {
              "x": 0,
              "y": 0,
              "w": 1520,
              "h": 1772
            }
          ]
        },
        "media_results": {
          "result": {
            "media_key": "3_1999488930081832966"
          }
        }
      }
    ]
  },
  "card": null,
  "place": {},
  "entities": {
    "hashtags": [],
    "symbols": [],
    "urls": [
      {
        "display_url": "arxiv.org/abs/2512.06266",
        "expanded_url": "https://arxiv.org/abs/2512.06266",
        "url": "https://t.co/bFPJOZycji",
        "indices": [
          2113,
          2136
        ]
      },
      {
        "display_url": "dair-ai.thinkific.com",
        "expanded_url": "https://dair-ai.thinkific.com/",
        "url": "https://t.co/zQXQt0PMbG",
        "indices": [
          2193,
          2216
        ]
      }
    ],
    "user_mentions": []
  },
  "quoted_tweet": null,
  "retweeted_tweet": null,
  "article": null
}