LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning? Paper • 2503.19990 • Published Mar 25 • 34
Understanding Incremental Learning of Gradient Descent: A Fine-grained Analysis of Matrix Sensing Paper • 2301.11500 • Published Jan 27, 2023
DistillSpec: Improving Speculative Decoding via Knowledge Distillation Paper • 2310.08461 • Published Oct 12, 2023 • 1
On the SDEs and Scaling Rules for Adaptive Gradient Algorithms Paper • 2205.10287 • Published May 20, 2022
Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking Paper • 2311.18817 • Published Nov 30, 2023
RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval Paper • 2402.18510 • Published Feb 28, 2024
A Quadratic Synchronization Rule for Distributed Deep Learning Paper • 2310.14423 • Published Oct 22, 2023
The Marginal Value of Momentum for Small Learning Rate SGD Paper • 2307.15196 • Published Jul 27, 2023