Controlled LLM Training on Spectral Sphere

Tian Xie; Haoming Luo; Haoyu Tang; Yiwen Hu; Jason Klein Liu; Qingnan Ren; Yang Wang; Wayne Xin Zhao; Rui Yan; Bing Su; Chong Luo; Baining Guo

Controlled LLM Training on Spectral Sphere

Tian Xie ,
Haoming Luo ,
Haoyu Tang ,
Yiwen Hu ,
Jason Klein Liu ,
Qingnan Ren ,
Yang Wang ,
Wayne Xin Zhao ,
Rui Yan ,
Bing Su ,
Chong Luo ,
Baining Guo

ICML 2026 | January 2026

Download BibTex

Scaling large models requires optimization strategies that ensure rapid convergence grounded in stability. Maximal Update Parametrization (\(\mu\)P) provides a theoretical safeguard for width-invariant \(\theta(1)\) activation control, whereas emerging optimizers like Muon are only “half-aligned” with these constraints: they control updates but allow weights to drift. To address this limitation, we introduce the Spectral Sphere Optimizer (SSO), which enforces strict module-wise spectral constraints on both weights and their updates. By deriving the steepest descent direction on the spectral sphere, SSO realizes a fully \(\mu\)P-aligned optimization process. To enable large-scale training, we implement SSO as an efficient parallel algorithm within Megatron. Through extensive pretraining on diverse architectures, including Dense 1.7B, MoE 8B-A1B, and 200-layer DeepNet models, SSO consistently outperforms AdamW and Muon. Furthermore, we observe significant practical stability benefits, including improved MoE router load balancing, suppressed outliers, and strictly bounded activations.

GitHub