- Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer · 4 authors
- MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies · 7 authors
- Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing · 8 authors 1