@arankomatsuzaki
SWAX: short windows, long memory ⢠Hybrid of sliding-window attn + xLSTM RNN ⢠Counter-intuitive: shorter windows ā better long-term recall ⢠Fix: stochastic window sizes = strong short + long context performance ⢠Outperforms fixed window attention https://t.co/Lg44gM7S8E