Curious Now

Story

Alignment Makes Language Models Normative, Not Descriptive

Artificial Intelligence

Key takeaway

Language models are now optimized to match human preferences, not just describe behavior - this could change how they are used for things like recommendation and decision-making.

Read the paper

Quick Explainer

This work examines how aligning large language models (LLMs) to match human preferences affects their ability to predict strategic behavior in multi-round games. The key finding is that while alignment improves LLMs' performance on one-shot decisions, it degrades their ability to forecast complex, history-dependent human behavior in strategic interactions. The authors attribute this to alignment narrowing the behavioral distribution encoded in the model, causing it to focus on normative, socially-approved patterns rather than the full range of human strategic dynamics. In settings where normative predictions suffice, this bias helps, but in multi-round games it fails to capture key descriptive factors like reciprocity and retaliation that shape human choices.

Deep Dive

Technical Deep Dive: Alignment Makes Language Models Normative, Not Descriptive

Overview

This work examines how post-training alignment of large language models (LLMs) impacts their ability to predict human behavior in strategic, multi-round games. The key finding is that while alignment improves LLMs' performance on one-shot decisions and non-strategic tasks, it degrades their ability to predict complex, history-dependent human behavior in multi-round strategic games. The authors attribute this to alignment's tendency to narrow the behavioral distribution encoded in the model, collapsing diversity toward normative, socially-approved patterns rather than the full range of human strategic dynamics.

Methodology

  • The authors evaluated 120 same-provider base-aligned model pairs on over 10,000 human decisions across four families of multi-round strategic games: bargaining, persuasion, negotiation, and repeated matrix games.
  • They used a token probability extraction task to predict human choices, normalizing the model's internal logits to obtain a predicted probability of the "accept" action.
  • Pairs were filtered to exclude models that did not concentrate sufficient probability mass on the decision tokens.
  • The authors also tested boundary conditions with one-shot matrix games and non-strategic binary lotteries.

Results

  • In the multi-round strategic games, base models outperformed their aligned counterparts by a ratio of 9.7:1 in terms of Pearson correlation with human decisions ($p < 10^{-40}$).
  • This base advantage held across all 23 model families, 10 prompt formulations, and game configuration parameters.
  • However, the advantage reversed in the one-shot matrix games (aligned models won 4.1:1, $p < 10^{-6}$) and binary lotteries (aligned models won 2.2:1, $p < 10^{-3}$).
  • Within the multi-round games, aligned models had an advantage at round 1 before interaction history developed, but this advantage disappeared as history accumulated.
  • The authors found that aligned models' predictions were systematically more correlated with game-theoretic equilibrium solutions, suggesting alignment induces a "normative bias" toward socially-approved patterns.

Interpretation

The authors argue that the distributional narrowing documented in prior work on alignment tax is the key mechanism behind these results. Alignment optimizes models to match annotator preferences, collapsing diversity toward dominant modes and suppressing the "tails" where complex, history-dependent strategic behavior resides.

In settings where normative predictions suffice (one-shot games, lotteries), this normative bias helps. But in multi-round strategic interactions, it degrades predictive accuracy by failing to capture reciprocity, retaliation, and other descriptive dynamics that shape human behavior.

Limitations

  • The GLEE multi-round game data comes from human-LLM interactions, not human-human.
  • The analysis is restricted to binary/ternary decisions; continuous action spaces remain untested.
  • All pairs are open-weight; closed-source models could not be evaluated.
  • The one-shot boundary condition uses a separate dataset, though within-game round 1 provides convergent evidence.
  • Alternative explanations beyond distributional narrowing cannot be fully ruled out.

What Comes Next

  • Isolating the specific multi-round dynamics (opponent modeling, history integration, novelty) that drive the base advantage.
  • Extending to continuous negotiations, auctions, or coalition formation tasks.
  • Developing alignment methods that preserve behavioral diversity while adding helpfulness.
  • Testing whether the normative shift persists at extreme model scales.

Source