@LiorOnAI
A 35 billion parameter model just beat a 235 billion parameter model. That's not supposed to happen. Qwen3.5-35B-A3B now outperforms its predecessor that had 6x more total parameters, and it does so while using 7x fewer active parameters per token. The breakthrough isn't efficiency for efficiency's sake. It's proof that three specific techniques can compress intelligence better than brute-force scaling: 1. Hybrid attention layers that mix linear attention (fast, scales to long contexts) with standard attention (accurate, catches nuance) in a 3:1 ratio 2. Ultra-sparse experts where only 3 billion of 35 billion parameters activate per token, but those 3 billion are chosen by a router trained on higher-quality data 3. Reinforcement learning scaled across millions of simulated agent environments, not just text prediction The result is a model architecture where intelligence comes from better routing decisions, not bigger weight matrices. This unlocks four things that weren't practical before: 1. Running frontier-class reasoning on a single GPU node instead of a cluster 2. Serving 1 million token contexts in production without exploding costs 3. Building agents that can handle complex tool use without the latency penalty of dense models 4. Fine-tuning on domain data without needing to update 200+ billion parameters If this pattern holds, the next 18 months will belong to teams optimizing routing and data quality, not teams with the biggest GPU budgets.