Skip to main content
ModelTerms

Safety & Alignment · intermediate

Guardrails

Guardrails are runtime checks that filter or modify LLM inputs and outputs to enforce policy — blocking PII leaks, detecting prompt injection, enforcing output formats, or moderating content.

Explanation

Guardrails sit outside the model. The model might be willing to do something you do not want; the guardrail catches it before the user sees it (output filter) or before it reaches the model (input filter).

Common patterns: regex/PII scrubbing, classifier-based moderation (OpenAI Moderation API, Llama Guard), schema validation, and policy classifiers. Guardrails are complementary to safety training, not a replacement.

Examples

  • Llama Guard checking every model response for unsafe categories.
  • JSON schema validation rejecting malformed outputs and re-prompting.

Frequently asked

What is Guardrails?

Guardrails are runtime checks that filter or modify LLM inputs and outputs to enforce policy — blocking PII leaks, detecting prompt injection, enforcing output formats, or moderating content.

What is an example of guardrails?

Llama Guard checking every model response for unsafe categories.

How is Guardrails related to Alignment?

Guardrails and Alignment are both safety & alignment concepts. Alignment is the problem of making an AI system pursue what humans actually want rather than the literal letter of its training objective. RLHF and Constitutional AI are alignment techniques.

Is Guardrails considered intermediate?

Guardrails is generally considered intermediate-level material in the AI and LLM space.

AlignmentSafety & Alignment

Alignment is the problem of making an AI system pursue what humans actually want rather than the literal letter of its training objective. RLHF and Constitutional AI are alignment techniques.

Prompt InjectionSafety & Alignment

Prompt injection is an attack where untrusted input contains instructions that override or subvert the developer's system prompt. The current frontier of LLM security.

JailbreakSafety & Alignment

A jailbreak is a prompt that bypasses an LLM's safety training, getting it to produce content it would normally refuse. A perennial cat-and-mouse game with model providers.

System PromptPrompting

The system prompt is the first message in a chat that sets the model's persona, rules, and overall behavior. It is treated by most providers as higher-trust than user input.

Side-by-side comparisons

Sources