Traning Inference Mismatch
What is TIM
现代的 RL 框架比如 verl 会使用不同的框架来完成 inference engine 和 training engine, 比如说使用 megatron (Shoeybi et al., 2020) 进行训练,使用 FSDP (Zhao et al., 2023) 进行训练。 尽管使用的模型一致,但由于 kernel 等实现不同,inference engine 和 training engine 的输出不完全一致,这就导致了 training inference mismatch 问题。
RL 的目标函数为
对应的梯度为
由于 training engine 和 inference engine 的不同,我们的的梯度里的两个分布实际上是不一样的,即
这里 是 inference engine 对 policy 的实现。 可以看到,只要 training engine 和 inference engine 的输出稍微不对,我们的算法就从 on-policy 变成了 off-policy. 下面是一个具体的例子:TODO
Why TIM
thinking machine lab, non-deterministic inference, numerical instability.
- 训练和推理引擎会使用不同的 kernel 来实现 peak performance
- 就算是一个 engine, batch-invariant kernels 也会被禁用来实现 maximum throughput.
Analysis
我们使用 importance sampling 对目标函数进行改写得到
[yao2025offpolicy] 提出了 TODO
MiniRL (Zheng et al., 2025) 提出了一个 token-level 和 sequence-level 等价性的分析方法 TODO
Solutions
System solutions
- 获取真实的采样概率
- 解决数值不一致性
但是这些改进仍然无法解决 off-policy 问题。
MoE models
MoE RL 训练不稳定性解决:
Router Replay
即在训练阶段,使用推理阶段的 MoE router, 避免 sample 时对 policy gradient 引入噪声。
相关工作有
- DeepSeek-V3 (DeepSeek-AI et al., 2025)
- Wenhan et.al (Ma et al., 2025)
- GSPO (Zheng, Liu, et al., 2025)
- IcePop TODO
- KPop (Guo et al., 2026)
TODO: https://openreview.net/pdf?id=8MHqvb4lK9
- DeepSeek-AI, Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., Dai, D., Guo, D., Yang, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., … Pan, Z. (2025). DeepSeek-V3 Technical Report. https://arxiv.org/abs/2412.19437
- Guo, J., Sun, Y., Huang, Z., Wang, Z., Wen, Z., Zhang, Z., Zhou, J., & Kok, S. (2026). KPop: Taming Training–Inference Mismatch in Reinforcement Learning with Adaptive Masking Regions. https://ringtech.notion.site/kpop
- Ma, W., Zhang, H., Zhao, L., Song, Y., Wang, Y., Sui, Z., & Luo, F. (2025). Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers. https://arxiv.org/abs/2510.11370
- Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. (2020). Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. https://arxiv.org/abs/1909.08053
- Zhao, Y., Gu, A., Varma, R., Luo, L., Huang, C.-C., Xu, M., Wright, L., Shojanazeri, H., Ott, M., Shleifer, S., Desmaison, A., Balioglu, C., Damania, P., Nguyen, B., Chauhan, G., Hao, Y., Mathews, A., & Li, S. (2023). PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel. https://arxiv.org/abs/2304.11277
- Zheng, C., Dang, K., Yu, B., Li, M., Jiang, H., Lin, J., Liu, Y., Lin, H., Wu, C., Hu, F., Yang, A., Zhou, J., & Lin, J. (2025). Stabilizing Reinforcement Learning with LLMs: Formulation and Practices. https://arxiv.org/abs/2512.01374
- Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y., Men, R., Yang, A., Zhou, J., & Lin, J. (2025). Group Sequence Policy Optimization. https://arxiv.org/abs/2507.18071