Skip to main content
ModelTerms

Safety & Alignment · intermediate

Alignment

Alignment is the problem of making an AI system pursue what humans actually want rather than the literal letter of its training objective. RLHF and Constitutional AI are alignment techniques.

Explanation

A model trained only on next-token prediction has no inherent preference for being helpful, honest, or harmless. Alignment is the process of shaping it to actually act in line with human values — and the open research question of how to do that reliably as models get more capable.

In practice today: instruction tuning + RLHF (or DPO) + safety fine-tuning + constitutional AI + system prompts. Alignment research also studies failure modes (sycophancy, reward hacking, deceptive alignment) that are not caught by these techniques.

Examples

  • Tuning a model to refuse to help with bioweapon synthesis.
  • Catching when a reward model is gaming the metric instead of being genuinely helpful.

Frequently asked

What is Alignment?

Alignment is the problem of making an AI system pursue what humans actually want rather than the literal letter of its training objective. RLHF and Constitutional AI are alignment techniques.

What is an example of alignment?

Tuning a model to refuse to help with bioweapon synthesis.

How is Alignment related to Reinforcement Learning from Human Feedback?

Alignment and Reinforcement Learning from Human Feedback are both safety & alignment concepts. RLHF fine-tunes an LLM to maximize a reward model that was itself trained on human preference judgments between candidate responses.

Is Alignment considered intermediate?

Alignment is generally considered intermediate-level material in the AI and LLM space.

Reinforcement Learning from Human FeedbackTraining

RLHF fine-tunes an LLM to maximize a reward model that was itself trained on human preference judgments between candidate responses.

Constitutional AISafety & Alignment

Constitutional AI is Anthropic's alignment technique that uses a written set of principles ("constitution") plus AI feedback to shape model behavior instead of relying entirely on human labels.

Red-TeamingSafety & Alignment

Red-teaming is the practice of deliberately trying to elicit dangerous, biased, or otherwise undesired behavior from an AI system, to surface problems before deployment.

JailbreakSafety & Alignment

A jailbreak is a prompt that bypasses an LLM's safety training, getting it to produce content it would normally refuse. A perennial cat-and-mouse game with model providers.

GuardrailsSafety & Alignment

Guardrails are runtime checks that filter or modify LLM inputs and outputs to enforce policy — blocking PII leaks, detecting prompt injection, enforcing output formats, or moderating content.

Side-by-side comparisons

Sources