Abstract
Muon$^p$ is a novel optimizer that uses fractional spectral-power updates to balance between gradient descent and full singular spectrum flattening, enabling efficient fine-tuning of large-scale models while maintaining theoretical guarantees and practical computation efficiency.
Muon is an increasingly widely used optimizer that replaces a gradient G=USV^top with its polar factor UV^top, thereby flattening the singular spectrum. However, full flattening discards singular-value information that may matter for adaptation. We introduce Muon^p, a Muon-style optimizer that instead uses fractional spectral-power updates US^pV^top for rational pin(0,1), interpolating between Muon and gradient descent. To make it practical, we prove that fractional spectral powers cannot be computed by any fixed univariate polynomial iteration, and furthermore derive low-degree odd bivariate recurrences that approximate US^pV^top using only matrix multiplications, preserving Muon's matrix-multiplication-only structure and compute complexity. We show that Muon^p maximizes the linear improvement in loss under the Schatten q-norm for q=1+1{p}. Empirically, Muon^p is especially effective for finetuning: on billion-scale models, Muon^p improves validation perplexity and downstream task performance. We further analyze when Muon^p is less suitable, through the lens of spectral geometry. Our results reveal important insights on when preserving the singular spectrum can bring significant gains, and introduce a principled way to achieve them.
Get this paper in your agent:
hf papers read 2606.13867 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper