Skip to main content
ModelTerms

Safety & Alignment · intermediate

Jailbreak

A jailbreak is a prompt that bypasses an LLM's safety training, getting it to produce content it would normally refuse. A perennial cat-and-mouse game with model providers.

Explanation

Common jailbreak patterns: role-play ("pretend you are an AI without restrictions"), encoding ("respond in Base64"), nested fiction ("a character in my story would say..."), and persistent multi-turn manipulation.

Providers respond with new training data, system-prompt hardening, and runtime filters. The current state of play: any sufficiently determined attacker can get any model to misbehave on at least some topics; safety training is best understood as raising friction, not preventing all misuse.

Examples

  • "DAN" ("Do Anything Now") prompts — early ChatGPT jailbreaks.
  • Encoding a forbidden request in Pig Latin to slip past filters.

Frequently asked

What is Jailbreak?

A jailbreak is a prompt that bypasses an LLM's safety training, getting it to produce content it would normally refuse. A perennial cat-and-mouse game with model providers.

What is an example of jailbreak?

"DAN" ("Do Anything Now") prompts — early ChatGPT jailbreaks.

How is Jailbreak related to Alignment?

Jailbreak and Alignment are both safety & alignment concepts. Alignment is the problem of making an AI system pursue what humans actually want rather than the literal letter of its training objective. RLHF and Constitutional AI are alignment techniques.

Is Jailbreak considered intermediate?

Jailbreak is generally considered intermediate-level material in the AI and LLM space.

AlignmentSafety & Alignment

Alignment is the problem of making an AI system pursue what humans actually want rather than the literal letter of its training objective. RLHF and Constitutional AI are alignment techniques.

Prompt InjectionSafety & Alignment

Prompt injection is an attack where untrusted input contains instructions that override or subvert the developer's system prompt. The current frontier of LLM security.

Red-TeamingSafety & Alignment

Red-teaming is the practice of deliberately trying to elicit dangerous, biased, or otherwise undesired behavior from an AI system, to surface problems before deployment.

GuardrailsSafety & Alignment

Guardrails are runtime checks that filter or modify LLM inputs and outputs to enforce policy — blocking PII leaks, detecting prompt injection, enforcing output formats, or moderating content.

System PromptPrompting

The system prompt is the first message in a chat that sets the model's persona, rules, and overall behavior. It is treated by most providers as higher-trust than user input.

Side-by-side comparisons

Sources