@dair_ai
New research from Google: "The Illusion of Deep Learning Architecture". For those following research on continual learning, you may want to bookmark this one. Instead of stacking more layers, what if we give neural networks more levels of learning? The default approach to building more capable AI systems today remains adding depth. More layers, more parameters, more pre-training data. This design philosophy has driven progress from CNNs to Transformers to LLMs. But there's a ceiling that's often not discussed. Current models suffer from what the authors call "computational anterograde amnesia." Their knowledge is frozen after pre-training. They can't continually learn. They can't acquire new skills beyond what fits in their immediate context window. This new research introduces Nested Learning (NL), a paradigm that reframes ML models as interconnected systems of multi-level optimization problems, each with its own "context flow" and update frequency. Optimizers and architectures are fundamentally the same thing. Both are associative memories that compress their own context. Adam and SGD are memory modules that compress gradients. Transformers are memory modules that compress tokens. Pre-training itself is just in-context learning where the context is the entire training dataset. Why does this work matter? NL adds a new design axis beyond depth and width. Instead of deeper networks, you build systems with more levels of nested optimization, each updating at different frequencies. This mirrors how the human brain works, where gamma waves (30-150 Hz) handle sensory information while theta waves (0.5-8 Hz) handle memory consolidation. Building on this framework, the researchers present Hope, an architecture combining self-modifying memory with a continuum memory system that replaces the traditional "long-term/short-term" memory dichotomy with a spectrum of update frequencies. The results: > Hope achieves 100% accuracy on needle-in-a-haystack tasks up to 16K context, where Transformers score 79.8%. > On BABILong, Hope maintains performance at 10M context length, where GPT-4 fails around 128K. > In continual learning, Hope outperforms in-context learning, EWC, and external-learner methods on class-incremental classification. > On language modeling at 1.3B parameters, Hope achieves 14.39 perplexity on WikiText versus 17.92 for Transformer++. Instead of asking "how do we make networks deeper," NL asks "how do we give networks more levels of learning." The path to continual learning may not be bigger models but models that learn at multiple timescales simultaneously. Paper: https://t.co/ArKfAZUCLu Learn to build with AI agents in our academy: https://t.co/zQXQt0PMbG