AI Safety/Deep Dive/01/06/2026/9 min read

Constitutional Alignment and Reward Models

Training assistants to critique and revise their own outputs against a written set of principles, reducing reliance on human preference labels.

Bai et al. · Anthropic · 2025

Mara Chen

Editor, ML researcher

Constitutional Alignment and Reward Models

Reinforcement learning from human feedback is powerful but expensive and inconsistent. Constitutional methods replace much of the human labeling with a model that critiques its own responses against an explicit list of principles.

Two stages

First, the model generates a response, critiques it, and revises it using the constitution. Second, a reward model trained on these AI-generated preferences guides reinforcement learning.

draft -> self-critique vs. principles -> revision -> preference data

Open questions

The approach raises governance questions: who writes the constitution, how are conflicts resolved, and how do principles generalize to edge cases the authors never anticipated?

Citation

Bai, Y. et al. (2025). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073.

Source paper

Keep reading

Related papers

More AI Safety

LLMs

Comments

Add a practical note, implementation detail, or question. Comments are saved for editorial review.

No approved comments are visible yet. Start the discussion below.

Constitutional Alignment and Reward Models

Two stages

Open questions

Citation

Related papers

Compute-Optimal Training: Scaling Laws Revisited

CSS Container Queries Explained

Sparse Mixture-of-Experts at Inference Scale

Comments

Constitutional Alignment and Reward Models

Two stages

Open questions

Citation

Related papers

Compute-Optimal Training: Scaling Laws Revisited

CSS Container Queries Explained

Sparse Mixture-of-Experts at Inference Scale

Comments

The papers that matter, summarized weekly.