Overview of RLVR

Overview of RLVR

Author

Updated

Jun, 08, 2026

Traning Inference Mismatch

What is TIM

现代的 RL 框架比如 verl 会使用不同的框架来完成 inference engine 和 training engine, 比如说使用 megatron (Shoeybi et al., 2020) 进行训练,使用 FSDP (Zhao et al., 2023) 进行训练。 尽管使用的模型一致,但由于 kernel 等实现不同,inference engine 和 training engine 的输出不完全一致,这就导致了 training inference mismatch 问题。

RL 的目标函数为

J(θ)=ExD,yπθ(x)[R(x,y)]\mathcal{J}(\theta) = \mathbb{E}_{x\sim\mathcal{D},y\sim\pi_{\theta}(\cdot\mid x)}[R(x,y)]

对应的梯度为

θJ(θ)=ExD,yπθ(x)[R(x,y)θlogπθ(θ)]\nabla_{\theta}\mathcal{J}(\theta) =\mathbb{E}_{x\sim\mathcal{D},y\sim\pi_{\theta}(\cdot\mid x)}[R(x,y)\nabla_{\theta}\log\pi_{\theta}(\theta)]

由于 training engine 和 inference engine 的不同,我们的的梯度里的两个分布实际上是不一样的,即

θJ(θ)=ExD,yμθ(x)[R(x,y)θlogπθ(θ)]\nabla_{\theta}\mathcal{J}(\theta) =\mathbb{E}_{x\sim\mathcal{D},\textcolor{red}{y\sim\mu_{\theta}(\cdot\mid x)}}[R(x,y)\nabla_{\theta}\log\pi_{\theta}(\theta)]

这里 μθ\mu_{\theta} 是 inference engine 对 policy πθ\pi_{\theta} 的实现。 可以看到,只要 training engine 和 inference engine 的输出稍微不对,我们的算法就从 on-policy 变成了 off-policy. 下面是一个具体的例子:TODO

Why TIM

thinking machine lab, non-deterministic inference, numerical instability.

  1. 训练和推理引擎会使用不同的 kernel 来实现 peak performance
  2. 就算是一个 engine, batch-invariant kernels 也会被禁用来实现 maximum throughput.

Analysis

我们使用 importance sampling 对目标函数进行改写得到

J(θ)=ExD,yμθold(x)[πθ(yx)μθold(x)R(x,y)]\mathcal{J}(\theta) = \mathbb{E}_{x\sim\mathcal{D}, y\sim\mu_{\theta_{\mathrm{old}}}(\cdot\mid x)}\left[\frac{\pi_{\theta}(y\mid x)}{\mu_{\theta_{\mathrm{old}}}(\cdot\mid x) }R(x,y)\right]

[yao2025offpolicy] 提出了 TODO

MiniRL (Zheng et al., 2025) 提出了一个 token-level 和 sequence-level 等价性的分析方法 TODO

Solutions

System solutions

  1. 获取真实的采样概率
  2. 解决数值不一致性

但是这些改进仍然无法解决 off-policy 问题。

MoE models

MoE RL 训练不稳定性解决:

Router Replay

即在训练阶段,使用推理阶段的 MoE router, 避免 sample 时对 policy gradient 引入噪声。

相关工作有

TODO: https://openreview.net/pdf?id=8MHqvb4lK9

  1. DeepSeek-AI, Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., Dai, D., Guo, D., Yang, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., … Pan, Z. (2025). DeepSeek-V3 Technical Report. https://arxiv.org/abs/2412.19437
  2. Guo, J., Sun, Y., Huang, Z., Wang, Z., Wen, Z., Zhang, Z., Zhou, J., & Kok, S. (2026). KPop: Taming Training–Inference Mismatch in Reinforcement Learning with Adaptive Masking Regions. https://ringtech.notion.site/kpop
  3. Ma, W., Zhang, H., Zhao, L., Song, Y., Wang, Y., Sui, Z., & Luo, F. (2025). Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers. https://arxiv.org/abs/2510.11370
  4. Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. (2020). Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. https://arxiv.org/abs/1909.08053
  5. Zhao, Y., Gu, A., Varma, R., Luo, L., Huang, C.-C., Xu, M., Wright, L., Shojanazeri, H., Ott, M., Shleifer, S., Desmaison, A., Balioglu, C., Damania, P., Nguyen, B., Chauhan, G., Hao, Y., Mathews, A., & Li, S. (2023). PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel. https://arxiv.org/abs/2304.11277
  6. Zheng, C., Dang, K., Yu, B., Li, M., Jiang, H., Lin, J., Liu, Y., Lin, H., Wu, C., Hu, F., Yang, A., Zhou, J., & Lin, J. (2025). Stabilizing Reinforcement Learning with LLMs: Formulation and Practices. https://arxiv.org/abs/2512.01374
  7. Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y., Men, R., Yang, A., Zhou, J., & Lin, J. (2025). Group Sequence Policy Optimization. https://arxiv.org/abs/2507.18071