Reinforcement learning from human feedback is powerful but expensive and inconsistent. Constitutional methods replace much of the human labeling with a model that critiques its own responses against an explicit list of principles.

Two stages

First, the model generates a response, critiques it, and revises it using the constitution. Second, a reward model trained on these AI-generated preferences guides reinforcement learning.

draft -> self-critique vs. principles -> revision -> preference data

Open questions

The approach raises governance questions: who writes the constitution, how are conflicts resolved, and how do principles generalize to edge cases the authors never anticipated?