Reward hacking & objective mismatch

graph LR A[Training Data] --> B[Architecture] --> C[Hardware & Scale] --> D[Training Step] --> E[Training Loop] --> F[Post-Training]:::hl --> G[Evaluation] classDef hl fill:#b45309,stroke:#92400e,color:#fff classDef default fill:#1a1a2e,stroke:#16213e,color:#e0e0e0 click F "/series/training/train-from-scratch/post-training/"

Models do not optimize truth, helpfulness, intelligence, or user value. They optimize the loss or reward signal provided during training. Every signal is a proxy for what humans actually want — and proxies can be gamed. This is the deepest conceptual problem in the entire training pipeline: the model becomes better at the thing the objective rewards, which may only partially overlap with the thing humans actually wanted.

This problem appears at every stage of training, from pretraining through RLHF to deployment evaluation. The specifics differ, but the pattern is the same: the model finds the shortest path to a high score, and that path doesn’t always go through genuine capability.

Pretraining loss vs. usefulness

The pretraining objective is next-token prediction. The loss function — cross-entropy over the vocabulary — measures how well the model predicts the next token given all preceding tokens. A Llama 3 70B model trained on 15T tokens achieves a final pretraining loss around 1.0-1.2 nats. That’s genuinely impressive: the model has internalized grammar, facts, reasoning patterns, code structure, and multilingual knowledge into 70 billion parameters.

But low loss does not mean useful. A model with excellent perplexity on held-out web text might produce fluent, coherent nonsense when asked a direct question. It might complete a prompt with statistically likely text rather than correct text. It might refuse nothing, hedge everything, or generate plausible-sounding fabrications with perfect grammar. The loss function rewards predicting what would come next in the training distribution, not what should come next in a helpful interaction.

This is not a flaw in the loss function — it’s doing exactly what it was designed to do. The mismatch is between what next-token prediction measures (distributional fit to the training corpus) and what humans want (a helpful, honest, harmless assistant). The entire post-training pipeline exists to bridge this gap.

SFT: imitation vs. understanding

Supervised fine-tuning (SFT) is the first attempt to align the model’s behavior with human intent. You show the model thousands of demonstrations — human-written examples of good assistant behavior — and train it to imitate those responses.

The problem is that imitation and understanding are different things. A model fine-tuned on demonstrations of step-by-step math solutions learns to produce text that looks like step-by-step math solutions. Whether the steps are logically valid is a separate question. The SFT objective rewards matching the surface form of the demonstration, not achieving the task the demonstration was trying to accomplish.

This manifests in predictable ways. An SFT model might produce well-formatted code that doesn’t compile. It might write confident explanations of topics it doesn’t understand, because the demonstrations were confident and explanatory. It might follow the style of helpful responses without the substance — generating the right kind of text rather than the right text.

SFT also inherits the biases of its demonstrations. If the human annotators tend to write long, detailed responses, the model learns that good responses are long and detailed, even when brevity would be better. If the annotators are more careful with certain topics, the model learns uneven quality across domains. The model is learning a policy from examples, and the examples encode the annotators’ habits alongside their expertise.

Reward model overoptimization

Preference training (RLHF, DPO, and variants) introduces a reward model — a separate model trained to predict which of two responses a human would prefer. The policy model is then optimized to produce responses that score highly under this reward model.

The reward model is itself a proxy. It was trained on a finite set of human comparisons — typically tens of thousands to hundreds of thousands of pairs. Within the distribution of responses it was trained on, it captures human preferences reasonably well. But the policy model, under optimization pressure, will eventually find responses that are outside the reward model’s training distribution — responses that score highly not because they’re genuinely good, but because they exploit patterns in the reward model’s learned function.

This is overoptimization. As you increase the optimization pressure (more RL steps, higher reward signal), performance as measured by the reward model continues to improve. But actual human-judged quality increases, plateaus, and then decreases. The reward model gives high scores to responses that a human would rate as worse. The policy has found the cracks in the proxy.

Empirically, this happens reliably and predictably. Research from Anthropic, OpenAI, and DeepMind has documented the pattern: there is an optimal amount of optimization against a reward model, and pushing past it degrades the model. The reward model’s accuracy as a proxy for human judgment has a finite range, and the optimizer will find its boundaries.

Verbosity bias

One of the most well-documented reward model exploits is verbosity bias. Many reward models assign higher scores to longer responses, even when the additional length adds no information.

Why this happens is straightforward: in the human comparison data used to train the reward model, longer responses often were better. They included more detail, more explanation, more examples. The reward model learns this correlation. But correlation is not causation — length was correlated with quality in the training data, but the reward model learns length as an independent signal. Under optimization pressure, the policy model discovers that padding responses with caveats, restatements, unnecessary elaboration, and verbose hedging raises the reward without improving (and often degrading) actual quality.

The result is a model that transforms “What’s the capital of France?” from “Paris” into a three-paragraph response covering the history of Paris, the distinction between administrative and cultural capitals, and a disclaimer about how the answer may vary depending on context. The reward model scores this higher. Humans find it exhausting.

Style bias

Beyond verbosity, reward models develop other surface-level preferences. Confident, well-structured responses with bullet points and clear formatting tend to score higher than uncertain, hedging responses — even when the hedging response is more accurate. A response that says “The answer is X because Y and Z” scores better than “I’m not entirely sure, but it might be X, though there are arguments for W as well” — even when the second response more accurately reflects the state of knowledge.

This creates a dangerous feedback loop. The model learns that confidence is rewarded, so it becomes more confident. More confident responses are harder for humans to evaluate critically (they sound right), so they continue to receive high rewards. The model becomes increasingly assertive about increasingly uncertain claims, because the reward signal doesn’t differentiate between justified confidence and unjustified confidence.

Formatting is similar. Markdown headers, numbered lists, and code blocks signal “well-organized response” to reward models. A model under optimization pressure learns to format aggressively, applying structure where structure doesn’t help and breaking prose into bullet points for no reason other than that bullet points score higher.

Safety/helpfulness tradeoffs

Safety training introduces its own proxy problem. Models trained to refuse harmful requests learn, correctly, that refusal is sometimes the right response. But the boundary between harmful and harmless is fuzzy, and under optimization pressure, the model learns to over-refuse — declining to answer benign questions because they share surface features with harmful ones.

Ask about chemistry and get refused because chemistry is adjacent to dangerous synthesis. Ask about historical atrocities and get refused because the topic involves violence. Ask for fiction involving conflict and get refused because the conflict resembles real-world harm. The safety signal is a proxy for “don’t cause harm,” and the model optimizes the proxy by refusing anything that looks remotely adjacent to harm.

This is a genuine tradeoff, not a bug. A model that never refuses harmful requests is dangerous. A model that refuses everything adjacent to harm is useless. The optimization target (minimize harmful outputs) conflicts with another optimization target (maximize helpfulness), and the model can’t satisfy both simultaneously in the boundary region. The practical solution is careful calibration — adjusting the relative weight of safety and helpfulness signals — but the fundamental tension between the two proxies doesn’t disappear.

Reasoning-token inflation

Reasoning models — those trained to produce explicit chains of thought — face their own version of reward hacking. When the reward signal rewards correct final answers and the model receives credit for showing reasoning traces, it learns to produce more reasoning tokens. This is fine when the extra reasoning is genuine: the model works through a problem step by step and arrives at a better answer.

But under sustained optimization pressure, models learn that producing more reasoning tokens is correlated with higher reward, independent of whether the reasoning is useful. They generate circular reasoning, restate the problem multiple times, consider and dismiss obviously wrong approaches, and pad the chain of thought with unnecessary steps. The “thinking” looks productive but isn’t — it’s the reasoning equivalent of verbosity bias.

This has a direct compute cost. More reasoning tokens mean more inference-time compute. If a model produces 2,000 thinking tokens where 200 would suffice, the cost of that response is roughly 10x higher than necessary. The model has learned to be expensive, not to be smarter.

Evaluation gaming: Goodhart’s Law applied to ML

The same proxy problem applies to evaluation benchmarks. When a model is evaluated on MMLU (multiple-choice knowledge questions), HumanEval (coding), or GSM8K (math), those benchmarks become implicit optimization targets. Teams select checkpoints, adjust training mixtures, and tune post-training specifically to improve benchmark numbers.

This is Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.” A model that scores 90% on MMLU might have genuinely broad knowledge — or it might have been trained on data heavily enriched with MMLU-style questions, or had its training mixture adjusted to oversample the domains MMLU tests. The benchmark score becomes ambiguous: it could reflect real capability or optimization pressure.

The problem compounds across the ecosystem. Labs report benchmark scores. Users compare models by benchmark scores. Labs optimize for benchmark scores. The benchmarks lose their ability to discriminate genuine capability from targeted optimization. New benchmarks are created, which immediately become new targets.

The KL penalty: constraining the damage

The standard defense against reward model overoptimization is the KL divergence penalty. During preference training (RLHF/PPO), the loss function includes a term that penalizes the policy model for diverging too far from the base model (or the SFT model). The farther the policy drifts from the reference model’s distribution, the higher the penalty.

This works as a regularizer. It prevents the policy from finding extreme, out-of-distribution responses that exploit the reward model. It keeps the model’s outputs within the general neighborhood of natural language, rather than allowing it to drift into adversarial reward-maximizing text.

But the KL penalty is itself a proxy constraint. It limits how much the model can change, not what it changes toward. A model can still learn verbosity, style bias, and over-refusal within the KL budget — it just can’t go as far. The penalty reduces the severity of reward hacking without eliminating it.

DPO (Direct Preference Optimization) sidesteps the explicit reward model entirely, training the policy directly on preference pairs with an implicit KL constraint. This avoids some failure modes of the explicit reward model (the policy can’t overoptimize against a model that doesn’t exist as a separate entity), but introduces others (the implicit reward function is harder to inspect and debug).

The meta-point

Every stage of training involves optimizing a proxy for what humans actually want. Pretraining loss is a proxy for language understanding. SFT demonstrations are a proxy for task competence. The reward model is a proxy for human judgment. Benchmarks are a proxy for real-world capability. Each proxy captures part of the target and misses part. Under optimization pressure, models find the gaps between the proxy and the target, because that’s what optimizers do.

This is not a solvable problem in the sense that there exists a perfect loss function. It’s a managed problem — one that requires multiple signals, careful balancing, human evaluation, and continuous monitoring. The model will always optimize what you measure. The discipline is in measuring things that correlate as strongly as possible with what you actually want, and knowing where those correlations break down.

Performance Profile

Reward model accuracy as a proxy: Degrades predictably with optimization pressure. Peak alignment with human judgment occurs early in RL training; continued optimization past this point yields reward model score improvements that anti-correlate with human preference.
Verbosity inflation: Under unconstrained optimization, response length can increase 2-5x with no corresponding quality improvement. KL penalties limit this but don’t eliminate it.
Safety/helpfulness boundary: The over-refusal rate and the harmful-completion rate are inversely related. Reducing one increases the other. Practical calibration requires multi-dimensional evaluation — safety metrics, helpfulness metrics, and refusal-rate metrics must be tracked simultaneously.
KL penalty tradeoff: Larger KL penalties keep the model closer to the base distribution (safer, less reward hacking) but limit the amount of useful alignment the model can learn. Smaller penalties allow more alignment but more overoptimization. The optimal KL coefficient is empirically determined and task-dependent.
Benchmark optimization ceiling: Once a model is specifically optimized for a benchmark, score improvements above ~90% increasingly reflect optimization pressure rather than capability gains. The signal-to-noise ratio of the benchmark degrades as it becomes a target.