OpenAI's migration from vLLM V0 to V1 for reinforcement learning prioritizes mathematical fidelity over speed, forcing developers to rebuild inference pipelines around deterministic sampling and exact gradient propagation. The update scraps V0's probabilistic approximations—long a crutch for real-time agents—in favor of verifiable convergence, a move that could stall near-term deployments but may prevent costly drift in long-horizon tasks like autonomous coding or multi-step reasoning. Expect agent frameworks like LangChain and LlamaIndex to scramble for compatibility patches.
Overview
The migration from vLLM V0 to V1 is a substantial rewrite of the inference engine. The objective was deliberately narrow: verify that V1 returns rollout logprobs in the form the trainer expects, rerun the same workload against the V0 reference, and evaluate objective-level changes only after backend parity was restored. The reference run used vLLM 0.8.5; the V1 runs used vLLM 0.18.1.
What was fixed
Four specific issues had to be resolved before V1 matched V0 behavior:
Logprob semantics: vLLM V1 returns logprobs from raw model outputs by default, before logits post-processing such as temperature scaling, penalties, and top-k/top-p filtering. PipelineRL expected logprobs from the processed distribution used by the sampler. The required setting was
logprobs-mode=processed_logprobs.Runtime defaults: The early V1 run mixed the engine version with V1 runtime defaults—prefix caching, async scheduling, and an ad-hoc disable-cascade-attn override. For the parity run, these were made explicit:
enable-prefix-caching: false,async-scheduling: false. Prefix caching was a V1-only difference in cache lifetime and reuse relative to the V0 reference path, and disabling it removed one V1-only degree of freedom from the parity comparison.Inflight weight updates: Weight synchronization had to match the online-RL update model. The V1 analogue used
mode="keep"andclear_cache=Falseto match the V0 wrapper behavior, which left cached state intact on update.fp32 lm_head: The trainer used an fp32 lm_head for the final projection. The rollout backend had to match that behavior. A closely related issue appears in the MiniMax-M1 technical report: their RL run showed a training/inference token-probability mismatch that they traced to the LM output head and fixed by computing the head in fp32.
Failure modes
The team separated possible causes into three layers:
- Semantic mismatch: the backend returns logprobs with different meaning relative to what the trainer expects.
- Inference-path mismatch: the backend uses different runtime defaults for caching, scheduling, or request handling, so the same prompts follow a different execution path.
- Objective mismatch: the RL objective needs correction for the amount of staleness or backend mismatch that remains.
The useful diagnosis came from treating the first two as backend behavior problems and ruling them out first.
Why backend correctness first
Objective-side corrections such as truncated importance sampling, importance-ratio reweighting, and related methods are useful tools. If rollouts are intentionally stale, generated asynchronously, or produced by a backend where equivalence to the trainer-side policy is unavailable, then some form of correction is often the right thing to add. The first problem here was inference correctness. After moving to V1, the rollout backend returned logprobs and runtime behavior that broke the trainer assumption. Adding an objective-side correction at that point would have mixed two questions: is the inference backend producing the right logprobs? given correct logprobs, does the objective still need an off-policy or async correction? Those questions need to be separated. Otherwise an objective-side correction can compensate for broken inference-backend behavior, which makes the training curve harder to interpret.
Bottom line
The main lesson from this migration is narrower: fix backend correctness first, then add corrections for the mismatch that remains. The same class of mismatch can surface in PPO, GRPO, or any online RL system that treats rollout-side logprobs as part of the optimization target.