@YichenJiang9
We show Transformers generalize on complex data by using shared attention patterns for similar structures BUT how to avoid overfitting on low-complexity data? 🚨SQ-Transformer explicitly quantizes embeddings structurally & learns systematic attention https://t.co/eaeG5gBo0d 🧵 https://t.co/B8XcHs6lfG