@awnihannun
The obvious reasons intelligence-per-watt is going up so fast: more efficient architectures, more efficient hardware, and higher quality data. The less obvious reason: finding the right balance on what should be stored in the model's weights and what can be computed through tool use, reasoning, and potentially other types of in-context learning. A simple example: in the earlier LLM days, it was quite likely that for simple arithmetic (e.g. adding two numbers), the model had to basically memorize tuples of (inputs, op, outputs). You can imagine this took up a lot of room in the weights. With reasoning the model can compute this in its chain-of-thought. With tool calling the model can compute this with a tool call. In both cases it saves a lot of space in the weights. I'm sure there is a floor on the smallest LLM that can have say GPT 5.x quality. But that floor could be 5B, it could be 100B. And I don't think anyone really knows because of the above effects. In other words we can probably go much further with a 5B-15B model with exceptional tool calling and reasoning.