Controlled LLM Training on Spectral Sphere

  • Tian Xie ,
  • Haoming Luo ,
  • Haoyu Tang ,
  • Yiwen Hu ,
  • Jason Klein Liu ,
  • Qingnan Ren ,
  • Yang Wang ,
  • Wayne Xin Zhao ,
  • Rui Yan ,
  • Bing Su ,
  • ,

ICML 2026 |

Scaling large models requires optimization strategies that ensure rapid convergence grounded in stability. Maximal Update Parametrization (\(\mu\)P) provides a theoretical safeguard for width-invariant \(\theta(1)\) activation control, whereas emerging optimizers like Muon are only “half-aligned” with these constraints: they control updates but allow weights to drift. To address this limitation, we introduce the Spectral Sphere Optimizer (SSO), which enforces strict module-wise spectral constraints on both weights and their updates. By deriving the steepest descent direction on the spectral sphere, SSO realizes a fully \(\mu\)P-aligned optimization process. To enable large-scale training, we implement SSO as an efficient parallel algorithm within Megatron. Through extensive pretraining on diverse architectures, including Dense 1.7B, MoE 8B-A1B, and 200-layer DeepNet models, SSO consistently outperforms AdamW and Muon. Furthermore, we observe significant practical stability benefits, including improved MoE router load balancing, suppressed outliers, and strictly bounded activations.

GitHubGitHub