Unraveling the Gradient Descent Dynamics of Transformers Paper • 2411.07538 • Published Nov 12, 2024 • 2
No More Adam: Learning Rate Scaling at Initialization is All You Need Paper • 2412.11768 • Published 20 days ago • 41