Haozhe Zhang | A practitioner's tour of RL for LLMs

Every six weeks the field invents a new acronym for “PPO with one of the components rearranged.” Here’s the alphabet soup I’m actually running today, why we picked each, and what I’d tell someone starting from scratch.

The lineage in 90 seconds

Classic RLHF (PPO). SFT → train a reward model → run PPO against it. Four models in memory (policy, ref, critic, RM), brittle hyperparameters, expensive rollouts. This is what InstructGPT used and what most people still picture when they hear “RLHF.” It works; it’s just heavy.

DPO. Direct Preference Optimization showed the RM-then-PPO loop is mathematically equivalent to a classification objective on preference pairs. So you skip RL entirely and do supervised learning on (prompt, preferred, dispreferred). Two models, no critic, no rollouts. We used DPO for early preference tuning — stylistic stuff like tone, refusals, persona. It works for that. It does not give you reasoning. You cannot DPO your way to math.

GRPO. DeepSeek’s contribution. Drop the critic from PPO and replace it with a group-relative advantage: sample K answers per prompt, normalize their rewards within the group, that’s the baseline. Cuts memory ~50% versus PPO. The bigger deal is that it pairs naturally with verifiable rewards — math correctness, code execution, format match — instead of a learned reward model. If your task has a verifier, GRPO is the right default. R1 made this famous and rightly so.

“Clip GRPO” you’ll see in the wild is just GRPO with PPO-style importance-ratio clipping, which most implementations do anyway. Not a separate algorithm.

GSPO. The Qwen3 paper. GRPO’s token-level importance ratios are noisy, especially on MoE — you can get pathological updates where one expert is being trained on data that was generated by a totally different expert. GSPO redefines the importance ratio at the sequence level with length normalization. Trains MoE without the ladder of stabilization hacks GRPO needs. If you’re touching MoE, just use GSPO.

AReaL. The Ant / Tsinghua async RL infrastructure. Not an algorithm — a system. The old loop was synchronous: generate a batch, then train on it, GPUs idle while the longest generation finishes. AReaL decouples generation workers and training workers entirely. Throughput up ~2.7×. The paper does it on math reasoning but the design generalizes anywhere you’re rollout-bound.

What I’d actually do today

Verifiable rewards (math, code, format) → GRPO with clipping. The verifier matters more than the algorithm. Spend your time on the reward signal.
MoE → GSPO. Don’t fight GRPO’s instability when there’s a better tool.
Only have preferences → DPO. Don’t pretend you have a reward function when you don’t.
Rollout-bound on big models → go async. AReaL or roll your own. Synchronous RL is hard to justify on anything large.

The things that actually bit us

Reward hacking is not a theoretical concern. Our first attempt to reward “structured math output” produced solutions wrapped in beautiful LaTeX that contained no actual math. The model learned the format perfectly and ignored the substance. Whatever you reward, hold out an independent sanity check the model can’t game.

Off-policy data goes stale fast. After a few hundred steps, samples from the original policy are basically noise. Either regenerate rollouts often or keep a very small staleness window. AReaL handles this with a staleness-aware PPO variant; if you’re DIY, at least track the divergence.

Hyperparameter sensitivity is the dirty secret. The same algorithm with slightly different kl_coef, group size K, batch size, or learning rate moves between “modest improvement” and “catastrophic forgetting.” Budget for a real sweep on small models before scaling up. The cost is much lower than discovering at 70B that your kl_coef was off by a factor of three.

Verifier quality dominates everything. The best optimizer with a noisy verifier is worse than the worst optimizer with a clean one. Most of our wins didn’t come from algorithm changes — they came from improving the reward signal: tightening the math checker, adding execution timeouts, catching format-only cheats.

TL;DR

If you’re starting today: use a clean verifier, use GRPO, run it async, and don’t blame the algorithm before you’ve audited your reward.