Safety & Alignment · intermediate
Red-Teaming
Red-teaming is the practice of deliberately trying to elicit dangerous, biased, or otherwise undesired behavior from an AI system, to surface problems before deployment.
Explanation
Originally a security term (a red team plays attacker against the defenders), adapted for AI. Red teamers — humans, automated systems, or both — generate adversarial prompts spanning bioweapons, CSAM, fraud, manipulation, self-harm, and edge-case ethical dilemmas. The findings drive new safety data and patches before public release.
Frontier labs run red-teaming internally and engage external red teams (and sometimes the public via bug-bounty programs). Outputs feed both safety training and disclosed model cards.
Examples
- OpenAI's pre-release red team for GPT-4.
- Anthropic's adversarial robustness work.
Frequently asked
What is Red-Teaming?
Red-teaming is the practice of deliberately trying to elicit dangerous, biased, or otherwise undesired behavior from an AI system, to surface problems before deployment.
What is an example of red-teaming?
OpenAI's pre-release red team for GPT-4.
How is Red-Teaming related to Alignment?
Red-Teaming and Alignment are both safety & alignment concepts. Alignment is the problem of making an AI system pursue what humans actually want rather than the literal letter of its training objective. RLHF and Constitutional AI are alignment techniques.
Is Red-Teaming considered intermediate?
Red-Teaming is generally considered intermediate-level material in the AI and LLM space.