Skip to main content
ModelTerms

Safety & Alignment · intermediate

Red-Teaming

Red-teaming is the practice of deliberately trying to elicit dangerous, biased, or otherwise undesired behavior from an AI system, to surface problems before deployment.

Explanation

Originally a security term (a red team plays attacker against the defenders), adapted for AI. Red teamers — humans, automated systems, or both — generate adversarial prompts spanning bioweapons, CSAM, fraud, manipulation, self-harm, and edge-case ethical dilemmas. The findings drive new safety data and patches before public release.

Frontier labs run red-teaming internally and engage external red teams (and sometimes the public via bug-bounty programs). Outputs feed both safety training and disclosed model cards.

Examples

  • OpenAI's pre-release red team for GPT-4.
  • Anthropic's adversarial robustness work.

Frequently asked

What is Red-Teaming?

Red-teaming is the practice of deliberately trying to elicit dangerous, biased, or otherwise undesired behavior from an AI system, to surface problems before deployment.

What is an example of red-teaming?

OpenAI's pre-release red team for GPT-4.

How is Red-Teaming related to Alignment?

Red-Teaming and Alignment are both safety & alignment concepts. Alignment is the problem of making an AI system pursue what humans actually want rather than the literal letter of its training objective. RLHF and Constitutional AI are alignment techniques.

Is Red-Teaming considered intermediate?

Red-Teaming is generally considered intermediate-level material in the AI and LLM space.

AlignmentSafety & Alignment

Alignment is the problem of making an AI system pursue what humans actually want rather than the literal letter of its training objective. RLHF and Constitutional AI are alignment techniques.

JailbreakSafety & Alignment

A jailbreak is a prompt that bypasses an LLM's safety training, getting it to produce content it would normally refuse. A perennial cat-and-mouse game with model providers.

GuardrailsSafety & Alignment

Guardrails are runtime checks that filter or modify LLM inputs and outputs to enforce policy — blocking PII leaks, detecting prompt injection, enforcing output formats, or moderating content.

Prompt InjectionSafety & Alignment

Prompt injection is an attack where untrusted input contains instructions that override or subvert the developer's system prompt. The current frontier of LLM security.

Side-by-side comparisons

Sources