Training Inference Mismatch

Introduction

现代的 RL 框架比如 verl 会使用不同的框架来完成 inference engine 和 training engine, 比如说使用 megatron (Shoeybi et al., 2020) 进行训练，使用 FSDP (Zhao et al., 2023) 进行训练。尽管使用的模型一致，但由于 training engine 和 inference engine 的优化目标不同：前者要求数值稳定性和精度，后者强调最大化 throughput. 这种优化目标的不一致使得 inference engine 和 training engine 的实现和最终输出不完全一致，最终体现为随着训练进行，gradient norm 会突然爆炸，reward 突然下降，模型训练崩溃。我们将这个问题称为 training-inference mismatch (TIM) 问题。

之前大家研究的都是 dense model, dense model 由于连续性比较强，即 $f_{\theta}(x+\delta x)\approx f_{\theta}(x)$ ，因此 TIM 现象不严重。

但是，现在大部分模型都是 MoE 模型，MoE 模型的 MLP 是一个离散架构，由 router 来选取 topK 专家进行计算，然后再进行加权求和得到最终输出。由于 router 的不连续性，现在 $f_{\theta}(x+\delta x)$ 和 $f_{\theta}(x)$ 会出现比较大的误差，而这种误差会持续影响模型训练，最终导致训练崩溃。

本文将系统性回顾和整理 TIM 相关的论文，blog, 并通过实现来验证和复现相关现象以及解决方法。

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. (2020). Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. https://arxiv.org/abs/1909.08053
Zhao, Y., Gu, A., Varma, R., Luo, L., Huang, C.-C., Xu, M., Wright, L., Shojanazeri, H., Ott, M., Shleifer, S., Desmaison, A., Balioglu, C., Damania, P., Nguyen, B., Chauhan, G., Hao, Y., Mathews, A., & Li, S. (2023). PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel. https://arxiv.org/abs/2304.11277

Method

Problem definition

RL 的目标函数为

\mathcal{J}(\theta) = \mathbb{E}_{x\sim\mathcal{D},y\sim\pi_{\theta}(\cdot\mid x)}[R(x,y)]

对应的梯度为

\nabla_{\theta}\mathcal{J}(\theta) =\mathbb{E}_{x\sim\mathcal{D},y\sim\pi_{\theta}(\cdot\mid x)}[R(x,y)\nabla_{\theta}\log\pi_{\theta}(\theta)]

由于 training engine 和 inference engine 的不同，我们的采样的输出 $y$ 实际上来源是 $\mu_{\theta_{\mathrm{old}}}$ , 这里 $\mu_{\theta_{\mathrm{old}}}$ 是 inference engine 对 policy $\pi_{\theta_{\mathrm{old}}}$ 的实现，如 vllm 和 SGLang.

我们使用 importance sampling 对目标函数进行改写得到

\htmlId{tim_objective}{\begin{equation} \mathcal{J}(\theta) = \mathbb{E}_{x\sim\mathcal{D}, y\sim\mu_{\theta_{\mathrm{old}}}(y\mid x)}\left[\frac{\pi_{\theta}(y\mid x)}{\mu_{\theta_{\mathrm{old}}}(y\mid x) }R(x,y)\right] \end{equation}}

对应的梯度为

\nabla_{\theta}\mathcal{J}(\theta) =\mathbb{E}_{x\sim\mathcal{D}, y\sim\mu_{\theta_{\mathrm{old}}}(y\mid x)}\left[\frac{\pi_{\theta}(y\mid x)}{\mu_{\theta_{\mathrm{old}}}(\cdot\mid x) }R(x,y)\nabla_{\theta}\log \pi_{\theta}(y\mid x)\right]

Quantization

这一节我们探究如何诊断 training inference mismatch 以及 training inference mismatch 随训练的变化

一个比较直观的评估 metric 是衡量 $\mu_{\theta_{\mathrm{old}}}$ 和 $\pi_{\theta}$ 两个分布之间的区别，即 KL divergence

\mathrm{KL}_{TIM} = \mathbb{E}_{s\sim d_{\mu_{\theta_{\mathrm{old}}}}}\left[\mathrm{KL}(\mu_{\theta_{\mathrm{old}}}\mid\mid \pi_{\theta})\right]

这里 $d_{\pi}$ 是 policy $\pi$ 的 stationary distribution, $s$ 是 context 的 state.

$\mathrm{KL}_{TIM}$ 的计算方式如下

rollout_log_probs = batch.batch["rollout_log_probs"]
actor_old_log_probs = batch.batch["old_log_probs"]
response_mask = batch.batch["response_mask"]
log_ratio = actor_old_log_probs - rollout_log_probs
kl_tim_k3 = torch.exp(log_ratio) - log_ratio - 1
kl_tim_k3 = masked_mean(kl_tim_k3.response_mask)

我们发现：

$\mathrm{KL}_{TIM}$ 与 entropy 的 spike 的相关性非常高，而与 reward 的相关性没那么高
训练引擎会给 inference policy sample 的 token 极低的概率，导致梯度爆炸

TIM 的 pattern: KL divergence 对于推理引擎 $\mu_{\theta_{\mathrm{old}}}$ 采样概率低的 token 更大，特别是在 $\mu_{\theta_{\mathrm{old}}}(y\mid y_{<t>})$ 趋近 $0$ 时。

TIM 对于多轮工具调用的非第一轮更严重，其原因在于模型接收到的工具调用结果是一个 OOD 的输入，模型对于 OOD context 更容易输出低概率的 token, 而这些低概率的 token 更容易导致 TIM.

结果：

vllm log ppl 在 non-first-round 更大，说明模型更容易采样到低概率 token

原因：

RL 优化器会让模型权重超出 bf16 的精度
kernel 级别的优化会方法这个误差

上述两个步骤会形成一个循环，mismatch 会产生 biased and noise gradients, 这又会导致参数朝着更极端的区域更新，导致 TIM 现象越来越严重。

MoE models

MoE 模型的训练推理不一致性会进一步被放大，这是因为 MoE 模型中 router 输出的是离散的值，任何一点微小的扰动都会导致最终选取的专家有所不同，而这种不同又会进一步影响后面 layer 的结果。随着训练进行，策略更新会进一步改变路由选择，加剧训练推理不一致性。

Analysis

基于这个梯度，MiniRL (Zheng et al., 2025) 提出了一个分析训练推理不一致性的框架。注意到我们可以将 importance sampling (IS) 重写为

\frac{\pi_{\theta}(y\mid x)}{\mu_{\theta_{\mathrm{old}}}(y\mid x) } = \frac{\pi_{\theta_{\mathrm{old}}}(y_t\mid x,y_{<t})}{\mu_{\theta_{\mathrm{old}}}(y_t\mid x,y_{<t})}\cdot \frac{\pi_{\theta}(y_t\mid x,y_{<t})}{\pi_{\theta_{\mathrm{old}}}(y_t\mid x,y_{<t})}

其中第一项代表了 training-inference discrepancy, 第二项代表了 policy staleness.

对于 training-inference discrepancy, 这一点主要是训练和推理精度不一致导致的，比如训练和推理使用了不同的 kernel 和实现方式等.
对于 policy staleness, 这一点主要是我们使用了 micro batch 来提高训练效率。

对于 MoE 模型，我们可以进一步将 IS 重写为

\frac{\pi_{\theta}(y_t\mid x,y_{<t})}{\mu_{\theta_{\mathrm{old}}}(y_t\mid x,y_{<t}) } = \frac{\pi_{\theta}(y_t\mid x,y_{<t},\textcolor{red}{e_t^{\pi}})}{\mu_{\theta_{\mathrm{old}}}(y_t\mid x,y_{<t},\textcolor{red}{e_{\mathrm{old},t}^{\pi}}) } =\frac{\pi_{\theta_{\mathrm{old}}}(y_t\mid x,y_{<t},\textcolor{red}{e_{\mathrm{old},t}^{\pi}})}{\mu_{\theta_{\mathrm{old}}}(y_t\mid x,y_{<t},\textcolor{red}{e_{\mathrm{old},t}^{\mu}})}\cdot \frac{\pi_{\theta}(y_t\mid x,y_{<t},\textcolor{red}{e_t^{\pi}})}{\pi_{\theta_{\mathrm{old}}}(y_t\mid x,y_{<t},\textcolor{red}{e_{\mathrm{old},t}^{\pi}})}

这里 $e^\pi$ , $e^\mu$ 分别是训练和推理引擎的 routed experts. 可以看到，现在 importance sampling 在前面的基础上引入了额外的两项：

训练和推理引擎的 expert routing 不一致
policy model $\pi_{\theta}$ 和旧 policy model $\pi_{\theta_{\mathrm{old}}}$ 的 expert routing 不一致

[yao2025offpolicy] 提出了 TODO

Zheng, C., Dang, K., Yu, B., Li, M., Jiang, H., Lin, J., Liu, Y., Lin, H., Wu, C., Hu, F., Yang, A., Zhou, J., & Lin, J. (2025). Stabilizing Reinforcement Learning with LLMs: Formulation and Practices. https://arxiv.org/abs/2512.01374

Solutions

解决 TIM 的方法可以分为两类：

infra 上进行优化，解决 training engine 和 inference engine 实现不一致的问题
algorithm 上进行优化，通过设计算法来解决不收敛的问题

KAT coder v1 pro
GEPO
DeepSeek-V3 (DeepSeek-AI et al., 2025)
GSPO (Zheng et al., 2025)
MiniRL

thinking machine lab, non-deterministic inference, numerical instability. 但是其问题在于

训练和推理引擎会使用不同的 kernel 来实现 peak performance
就算是一个 engine, batch-invariant kernels 也会被禁用来实现 maximum throughput.

Ring-Flash-Linear-2.0 手动对齐了不同的 kernels,

(Qi et al., 2025) 提出了使用 FP16 精度来进行训练。作者认为 BF16 精度在 post-training 阶段的精度太低，会影响最终结果。 FP16 进行训练时需要使用 dynamic scaling factor.

Infra Optimizations

MiniMax-M1 (MiniMax et al., 2025) 提出了使用 FP32 的 lm head. 但是仍然比较难避免训练崩溃。

Thinking Machine labs 提出了 batch-invariant inference kernels, 但是会导致训练效率vjiangdi

(Yao et al., 2025) 提到，我们可以

获取 inference engine 真实的采样概率
解决 inference engine 和 training engine 的数值不一致性
deterministic kernel implementations

AreaL (Fu et al., 2026) 使用了 decoupled PPO (Hilton et al., 2022) 来构建 rollout generation 和 gradient computation 之间的关系，但是 AreaL 对于相差比较大的 sample, 会直接丢弃。

(DeepSeek-AI, Liu, Mei, et al., 2025) 提出了 sampling mask replay 来解决因为 top-p 和 top-k sampling 导致的 action space 不一致的问题

[] 提出了使用 FP16 来进行 RL 的训练，这是因为 FP16 的指数位更多，所能表示的精度更高。

Router Replay

在训练阶段，使用推理阶段的 MoE router, 避免 sample 时对 policy gradient 引入噪声。

vanilla router replay: R2, 梯度更新时，replay inference engine 选中的专家，减少策略滞后性
rollout router replay: R3, 在训练引擎中重放推理引擎选中的专家，减少训练推理差异。相关工作有 DeepSeek-V3.2 (DeepSeek-AI, Liu, Mei, et al., 2025), (Ma et al., 2025)

Kernel Optimization

利用 torch.compile 来固化 RoPE 实现，底层消除算子行为差异
通过 enable_batch_invariant_mode 来强制训练端采用与 SGLang 一致的算子，消除 batch size 对计算结果的影响
training 和 inference 都使用 FlashAttention3 作为 backend, 来实现 bitwise equal
使用 DeepGEMM 进行矩阵乘法

Algorithm Optimizations

Truncated Importance Sampling

(Yao et al., 2025) 提出了使用 truncated importance sampling (TIS) 来优化目标函数的梯度，新的梯度变为

\htmlId{tis_reinforce_gradient}{\begin{equation} \nabla_{\theta}\mathcal{J}(\theta) = \mathbb{E}_{x\sim\mathcal{D}, y\sim\mu_{\theta_{\mathrm{old}}}(y\mid x)}\left[\min\left(\frac{\pi_{\theta}(y\mid x)}{\mu_{\theta_{\mathrm{old}}}(y\mid x) }, C\right)R(x,y)\nabla_{\theta}\log \pi_{\theta}(y\mid x)\right] \end{equation}}

其中 $C>0$ 是一个超参数。

对于 PPO (Schulman et al., 2017) 算法，其目标函数为

\mathcal{J}_{PPO}(\theta) = \mathbb{E}_{x\sim\mathcal{D}, y\sim\mu_{\theta_{\mathrm{old}}}(y\mid x)}\left[\min\left(\frac{\pi_{\theta}(y\mid x)}{\pi_{\theta_{\mathrm{old}}}(y\mid x) }\hat{A}_t,\mathrm{clip}\left(\frac{\pi_{\theta}(y\mid x)}{\pi_{\theta_{\mathrm{old}}}(y\mid x) },1-\epsilon,1+\epsilon\right)\hat{A}_t\right)\right]

使用 TIS 之后，梯度为

\htmlId{tis_ppo_gradient}{\begin{equation} \nabla_{\theta}\mathcal{J}_{PPO}(\theta) = \mathbb{E}_{x\sim\mathcal{D}, y\sim\mu_{\theta_{\mathrm{old}}}(y\mid x)}\left[\min\left(\frac{\pi_{\theta}(y\mid x)}{\mu_{\theta_{\mathrm{old}}}(y\mid x) }, C\right)\nabla_{\theta}\min\left(\frac{\pi_{\theta}(y\mid x)}{\pi_{\theta_{\mathrm{old}}}(y\mid x) }\hat{A}_t,\mathrm{clip}\left(\frac{\pi_{\theta}(y\mid x)}{\pi_{\theta_{\mathrm{old}}}(y\mid x) },1-\epsilon,1+\epsilon\right)\hat{A}_t\right)\right] \end{equation}}

Masked Importance Sampling

(Liu et al., 2025) 提出了 sequence level 的 MIS

\mathcal{J}_{seq-mis}(\theta) = \mathbb{E}_{x\sim\mathcal{D}, y\sim\mu_{\theta_{\mathrm{old}}}(y\mid x)}\left[\rho\cdot\mathbf{1}\left(\rho \leq C\right)\cdot\min\left(\frac{\pi_{\theta}(y\mid x)}{\pi_{\theta_{\mathrm{old}}}(y\mid x) }\hat{A}_t,\mathrm{clip}\left(\frac{\pi_{\theta}(y\mid x)}{\pi_{\theta_{\mathrm{old}}}(y\mid x) },1-\epsilon,1+\epsilon\right)\hat{A}_t\right)\right]

其中

\rho = \frac{\pi_{\theta}(y\mid x)}{\mu_{\theta_{\mathrm{old}}}(y\mid x)}

DeepSeek-V3.2 (DeepSeek-AI, Liu, Mei, et al., 2025) 提出了针对 rollout 的的 binary mask

\mathcal{J}_{GRPO}(\theta) = \mathbb{E}_{x\sim\mathcal{D}, y_i\sim\mu_{\theta_{\mathrm{old}}}(y\mid x),i=1,\dots,G}\left[\frac{1}{G}\sum_{i=1}^G\frac{1}{|y_i|}\sum_{t=1}^{|y_i|}\mathcal{M}_{i,t}\cdot\min\left(r_{i,t}(\theta)\hat{A}_{i,t},\mathrm{clip}\left(r_{i,t}(\theta),1-\epsilon,1+\epsilon\right)\hat{A}_t\right)\right]

其中

r_{i,t}(\theta) = \frac{\pi_{\theta}(y_{i,t}\mid y_{i,<t},x)}{\pi_{\theta_{old}}(y_{i,t}\mid y_{i,<t},x)}

$M_{i,t}$ 是针对 advantage 为负的 sequence 的 masking

\mathcal{M}_{i,t} = \begin{cases} 0, &\text{ if } \hat{A}_{i,t}<0,\frac{1}{|y_i|}\sum_{i=1}^{|y_i|}\log r_{i,t}>\delta\\ 1, &\text{ otherwise} \end{cases}

IcePop (Zhao et al., 2025) 提出了基于 mask 的 importance sampling 方法，其目标函数如下所示

\mathcal{J}_{IcePop}(\theta) = \mathbb{E}_{x\sim\mathcal{D}, y\sim\mu_{\theta_{\mathrm{old}}}(y\mid x)}\left[\mathcal{M}\left(\frac{\pi_{\theta}(y\mid x)}{\mu_{\theta_{\mathrm{old}}}(y\mid x)},\alpha,\beta\right)\cdot\min\left(\frac{\pi_{\theta}(y\mid x)}{\pi_{\theta_{\mathrm{old}}}(y\mid x) }\hat{A}_t,\mathrm{clip}\left(\frac{\pi_{\theta}(y\mid x)}{\pi_{\theta_{\mathrm{old}}}(y\mid x) },1-\epsilon,1+\epsilon\right)\hat{A}_t\right)\right]

其中

\mathcal{M}\left(x,\alpha,\beta\right) = \begin{cases} x, &\text{ if } k\in [\alpha,\beta]\\ 0, &\text{ otherwise} \end{cases}

但是，随着训练进行， IcePop mask 的 token 数越来越少，导致 TIM 现象仍然存在，为了解决这个问题，KPop (Guo et al., 2026) 进行了优化，最终目标函数一致，但是 mask 区域变成了

\mathcal{M}\left(y_t;\phi\right) = \mathbf{1}\left(\mathrm{KL}_B(\pi_{\theta}(y_t\mid y_{<t},x)\mid\mid \mu_{\theta_{\mathrm{old}}}(y_t\mid y_{<t},x))\leq \phi\right)\cdot \mathbf{1}\left(\mathrm{KL}_B(\mu_{\theta_{\mathrm{old}}}(y_t\mid y_{<t},x)\mid\mid \pi_{\theta}(y_t\mid y_{<t},x))\leq \phi\right)

其中 $\mathrm{KL}_B(P(x)\mid\mid Q(x))$ 是 symmetry binary KL:

\mathrm{KL}_B(P(x)\mid\mid Q(x)) = P(x)\log \frac{P(x)}{Q(x)} + (1-P(x))\log\frac{1-P(x)}{1-Q(x)}

DeepSeek-AI, Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., Dai, D., Guo, D., Yang, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., … Pan, Z. (2025). DeepSeek-V3 Technical Report. https://arxiv.org/abs/2412.19437
DeepSeek-AI, Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., Lu, C., Zhao, C., Deng, C., Xu, C., Ruan, C., Dai, D., Guo, D., Yang, D., … Qu, Z. (2025). DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models. https://arxiv.org/abs/2512.02556 back: 1, 2, 3
Fu, W., Gao, J., Shen, X., Zhu, C., Mei, Z., He, C., Xu, S., Wei, G., Mei, J., Wang, J., Yang, T., Yuan, B., & Wu, Y. (2026). AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning. https://arxiv.org/abs/2505.24298
Guo, J., Sun, Y., Huang, Z., Wang, Z., Wen, Z., Zhang, Z., Zhou, J., & Kok, S. (2026). KPop: Taming Training–Inference Mismatch in Reinforcement Learning with Adaptive Masking Regions. https://ringtech.notion.site/kpop
Hilton, J., Cobbe, K., & Schulman, J. (2022). Batch size-invariance for policy optimization. https://arxiv.org/abs/2110.00641
Liu, J., Li, Y., Fu, Y., Wang, J., Liu, Q., & Jiang, Z. (2025, September). When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch. https://richardli.xyz/rl-collapse
Ma, W., Zhang, H., Zhao, L., Song, Y., Wang, Y., Sui, Z., & Luo, F. (2025). Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers. https://arxiv.org/abs/2510.11370
MiniMax, :, Chen, A., Li, A., Gong, B., Jiang, B., Fei, B., Yang, B., Shan, B., Yu, C., Wang, C., Zhu, C., Xiao, C., Du, C., Zhang, C., Qiao, C., Zhang, C., Du, C., Guo, C., … Sun, Z. (2025). MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention. https://arxiv.org/abs/2506.13585
Qi, P., Liu, Z., Zhou, X., Pang, T., Du, C., Lee, W. S., & Lin, M. (2025). Defeating the Training-Inference Mismatch via FP16. https://arxiv.org/abs/2510.26788
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. https://arxiv.org/abs/1707.06347
Yao, F., Liu, L., Zhang, D., Dong, C., Shang, J., & Gao, J. (2025). Your Efficient RL Framework Secretly Brings You Off-Policy RL Training. In Feng Yao’s Notion. https://fengyao.notion.site/off-policy-rl back: 1, 2
Zhao, X., Liu, Y., Xu, K., Guo, J., Wang, Z., Sun, Y., Kong, X., Cao, Q., Jiang, L., Wen, Z., Zhang, Z., & Zhou, J. (2025). Small Leak Can Sink a Great Ship–Boost RL Training on MoE with IcePop! https://ringtech.notion.site/icepop
Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y., Men, R., Yang, A., Zhou, J., & Lin, J. (2025). Group Sequence Policy Optimization. https://arxiv.org/abs/2507.18071