Skip to main content
ModelTerms

Category

Safety & Alignment

Aligning model behavior with what humans actually want.

Alignment

Alignment is the problem of making an AI system pursue what humans actually want rather than the literal letter of its training objective. RLHF and Constitutional AI are alignment techniques.

intermediate
Constitutional AI

Constitutional AI is Anthropic's alignment technique that uses a written set of principles ("constitution") plus AI feedback to shape model behavior instead of relying entirely on human labels.

advanced
Guardrails

Guardrails are runtime checks that filter or modify LLM inputs and outputs to enforce policy — blocking PII leaks, detecting prompt injection, enforcing output formats, or moderating content.

intermediate
Jailbreak

A jailbreak is a prompt that bypasses an LLM's safety training, getting it to produce content it would normally refuse. A perennial cat-and-mouse game with model providers.

intermediate
Prompt Injection

Prompt injection is an attack where untrusted input contains instructions that override or subvert the developer's system prompt. The current frontier of LLM security.

intermediate
Red-Teaming

Red-teaming is the practice of deliberately trying to elicit dangerous, biased, or otherwise undesired behavior from an AI system, to surface problems before deployment.

intermediate