Haozhe Zhang | 估 KL penalty 的三个 estimator / Three ways to estimate the KL penalty

中文

PPO 和 GRPO 这套 RL 框架每一步都要算一个 KL(π ‖ π_ref) = E_{x∼π}[log(π(x)/π_ref(x))]，作为 reference policy 的 anchor 防止训练把模型行为推得太远。整 vocab 上求和精确计算这个量太贵——一个 step 的 logits 张量已经吃掉一大块 VRAM，再叠一个 full-vocab KL summation 显存就不够——所以社区都用 sample-based estimator 近似。

John Schulman 那篇 blog post 整理了三个 estimator，社区一般叫 K1、K2、K3。Paper 里几乎不写为什么选了哪一个，但同一份 RL 配方换一个 estimator，最终 reward 能差到 single-digit 百分点。这篇是把三个 estimator 的推导、bias / variance 分析、和实际选哪一个放在一起。

记号：x ∼ π 是一条 sample，r = π_ref(x) / π(x)。

K1：最朴素的 plug-in estimator

K1 = −log r = log(π / π_ref)

直接根据 KL 的定义：E_{x∼π}[−log r] = E_{x∼π}[log(π/π_ref)] = KL(π ‖ π_ref)，所以 K1 是无偏的。

但 K1 单条 sample 没有任何约束——r > 1 时 −log r 是负的，r < 1 时是正的。Policy 漂移一上来 r 的分布两边都拖很远，单 sample 估计噪声极大。Variance 在三个里最大。

直接用 K1 训练的实际问题是：KL 在 batch 上估出来有时为负——这跟 KL 定义上永远 ≥ 0 矛盾，体现在训练里就是 KL penalty 偶尔贡献一个负的 loss，等同于鼓励 policy 偏离 ref。看起来荒谬，但低 sample-size 的 batch 上 K1 确实允许这件事发生。

K2：用平方换稳定，代价是估的不是 KL 了

K2 = 0.5 · (log r)²

K2 永远非负（一个平方），单 sample variance 比 K1 低一个量级——直觉上是因为 K2 把符号信息丢了、只保留幅度。

代价是有偏。K2 估的是 0.5 · E[(log r)²]，不是 KL。两者只在 π ≈ π_ref 时由 Taylor 重合：

log r = (r − 1) − 0.5·(r − 1)² + O((r − 1)³)

平方一下到 leading order：(log r)² ≈ (r − 1)²。同时 KL 本身在该 limit 下展开 KL ≈ 0.5·E[(r − 1)²]。所以 K2 ≈ KL 只在 π 和 π_ref 距离小时成立；policy 漂得越远，K2 偏差越大。

实践上 K2 的风险是它在训练过程中”看起来”很稳——非负、variance 低——但你监控的不是想 penalize 的那个 KL。Policy 大幅偏离 ref 的时候，K2 给的数字可能远低于真实 KL，模型已经漂飞但监控看不到，penalty 也压不回来。

K3：control variate 拯救一切

K3 = (r − 1) − log r

K3 是 K1 加一个零均值 control variate (r − 1)。看 (r − 1) 在 π 下的期望：

E_{x∼π}[r − 1] = E_{x∼π}[π_ref/π] − 1
              = Σ_x π(x) · (π_ref(x)/π(x)) − 1
              = Σ_x π_ref(x) − 1
              = 1 − 1 = 0

(r − 1) 期望为 0，加到 K1 上不改变期望——E[K3] = E[K1] + E[r − 1] = KL + 0 = KL，所以 K3 仍然无偏。

variance 为什么比 K1 低？control variate 这个 trick 的关键是被加的那一项跟原 estimator 同向相关。r 大的时候 (r − 1) 大、−log r 小（甚至为负）；r 小的时候反过来——两者的 noise 部分相互抵消，加起来 variance 下降。

K3 还有一个意外的好处：永远非负。把 −log r 在 r = 1 附近 Taylor 展开：

−log r = −(r − 1) + 0.5·(r − 1)² − (r − 1)³/3 + ...
K3 = (r − 1) + (−log r) = 0.5·(r − 1)² + O((r − 1)³)

leading term 0.5·(r − 1)² 非负。更严格地说 K3 在 r = 1 处取 0，一阶导 K3'(r) = 1 − 1/r 在 r = 1 处也为 0，二阶导 K3''(r) = 1/r² > 0 恒为正——K3 是 r 的凸函数，全局最小值 0 在 r = 1 处取到。

总结：K3 同时拿到了 K1 的无偏、K2 的非负、以及一个比 K1 低的 variance。Schulman 推荐这个，DeepSeek / R1 系默认用的也是它。

三个 estimator 在训练里的实际差距

不同 estimator 在同一份 RL 配方上的影响并不小。一套 GRPO + verifier 跑下来，K1 → K3 切换后最终 reward 能差 single-digit 百分点——具体多少看任务（math 类影响相对小，code 类受影响更大）。社区里 Paper 几乎不写自己用了哪一个 estimator，意味着 published 的”算法提升”里有一部分实际上是 estimator 差异，不是算法差异。

实践上默认用 K3。K2 用的人也有，但要意识到它在大 policy drift 下估的根本不是想要的 KL；K1 几乎只在做 sanity check 时见过——variance 高到训练根本起不来。

control variate 这个 trick 不是 RL 独有的。REINFORCE 里 advantage = reward − baseline 那个 baseline subtraction 本质上也是 control variate，目的一样：减一个零均值量来降 variance，不动期望。K3 是同一套思路在 KL 估计上的一个特别整洁的应用。

English

PPO and GRPO both need to compute KL(π ‖ π_ref) = E_{x∼π}[log(π(x)/π_ref(x))] at every training step, as the reference-policy anchor that prevents the model from drifting too far during training. Computing this exactly by summing over the entire vocabulary isn’t feasible — the logits tensor for a single step already eats a substantial fraction of VRAM, and a full-vocab KL summation pushes it over. So the community estimates it from samples.

John Schulman’s blog post lays out three estimators, conventionally called K1, K2, and K3. Papers almost never document which one they used, but swapping K1 for K3 inside the same RL recipe moves the final reward by single-digit percentage points. This post walks through the derivations, the bias / variance trade-offs, and which one to actually use.

Notation: x ∼ π is a sample, r = π_ref(x) / π(x).

K1: the plug-in estimator

K1 = −log r = log(π / π_ref)

Unbiased by definition: E_{x∼π}[−log r] = E_{x∼π}[log(π/π_ref)] = KL(π ‖ π_ref).

K1 has no sign constraint at the single-sample level. When r > 1 (the reference assigns this sample more mass than the current policy does), −log r is negative; when r < 1, it’s positive. Under any real policy drift, r has heavy tails on both sides, and the per-sample estimate can be a large positive or large negative number. K1 has the highest variance of the three.

A practical consequence: KL estimated with K1 on a small batch can come out negative on average — mathematically impossible for the true KL. In training, that shows up as a negative KL-penalty contribution to the loss, briefly rewarding the model for moving away from the reference. It looks absurd, but on a small enough batch K1 actually allows it.

K2: variance for bias

K2 = 0.5 · (log r)²

K2 is always non-negative (squared), and its single-sample variance is about an order of magnitude lower than K1’s. Intuitively, K2 throws away the sign of log r and keeps only magnitude.

The cost is bias. K2 estimates 0.5 · E[(log r)²], not KL. The two coincide only when π ≈ π_ref, via Taylor: log r ≈ (r − 1) − 0.5·(r − 1)² + …, so (log r)² ≈ (r − 1)² to leading order, and KL ≈ 0.5·E[(r − 1)²] in the same limit. Once the policy drifts meaningfully, K2 starts undershooting the true KL — the screen shows small, well-behaved values while the quantity being measured has detached from the KL the penalty was supposed to enforce.

The practical risk is that the training curve looks healthy — non-negative KL, low variance — while the actual policy is drifting more than the loss reports. The penalty stops doing its job before you notice.

K3: a control variate gets you unbiased and low variance

K3 = (r − 1) − log r

K3 is K1 plus a zero-mean control variate (r − 1). The expectation of (r − 1) under π is zero:

E_{x∼π}[r − 1] = E_{x∼π}[π_ref/π] − 1
              = Σ_x π(x) · (π_ref(x)/π(x)) − 1
              = Σ_x π_ref(x) − 1
              = 1 − 1 = 0

Adding it to K1 leaves the expectation untouched (still KL), but variance drops because (r − 1) and −log r are positively correlated — when one is high the other is also high, and adding them cancels part of their shared noise.

K3 is also always non-negative. Taylor-expanding −log r around r = 1:

−log r = −(r − 1) + 0.5·(r − 1)² − (r − 1)³/3 + ...
K3 = (r − 1) + (−log r) = 0.5·(r − 1)² + O((r − 1)³)

The leading term is 0.5·(r − 1)² ≥ 0. More rigorously, K3 vanishes at r = 1, its first derivative K3'(r) = 1 − 1/r also vanishes there, and its second derivative K3''(r) = 1/r² > 0 is strictly positive — K3 is convex in r with global minimum 0 at r = 1.

So K3 collects all three properties at once: unbiased (like K1), non-negative (like K2), and lower variance than K1. Schulman recommends K3; DeepSeek and the R1-lineage code use it by default.

What the difference actually looks like

The gap between estimators matters more than the literature suggests. Across the same GRPO-plus-verifier recipe, switching from K1 to K3 moves the final reward by single-digit percentage points — the exact size depends on task (math is less sensitive, code more so), but it’s well above noise. Most papers don’t state which estimator they used, which means published “algorithm improvements” can be partially or entirely attributable to estimator choice across the comparison.

Practical default is K3. K2 has its users, but anyone reaching for it should know that under real policy drift the quantity being measured isn’t the KL anymore. K1 mostly shows up in pedagogical code or sanity checks — its variance makes training too unstable to actually use.

The control-variate trick isn’t specific to KL estimation. REINFORCE’s baseline subtraction (advantage = reward − baseline) is the same idea: subtract a zero-mean quantity to reduce variance without moving the expectation. K3 is that pattern applied very neatly to KL.