MetroRLHF: Enabling Memory-Effective Training for On-Policy RLHF via Adaptive Sequence Streaming

Efficient Reasoning @ NeurIPS 2025 |

Reinforcement learning from human feedback (RLHF) has become the standard post-training technique for endowing large language models (LLMs) with helpful, harmless, and intent-consistent behavior. In practice, however, its adoption is hampered by prohibitive memory consumption during the phase of the policy-model update, especially when training on long-form generation tasks. In this paper, we propose MetroRLHF, a memory-efficient, on-policy RLHF ap proach that exploits the inference-time computations to reduce the training-time memory budget and to skip unnecessary work. By re-using the inference-phase materialized K,V context, the inter-token dependencies are freely removed that normally force the entire sequence to train in parallel. Building upon fine-grained subsequence streaming, RLHF can train the productive tokens in an effective manner. This yields a training pipeline that matches the exact behavior of conven tional full-sequence RLHF while using less memory and incurring no arithmetic recomputation. Experiments on the Qwen-3 models demonstrate that MetroRLHF rescheduled algorithm reduces peak training memory usage to 1/3.8 ∼ 1/5.9, enabling not only memory-effective but also semantic-reliable fine-tuning for LLM.