Safety & Alignment · intermediate
Guardrails
Guardrails are runtime checks that filter or modify LLM inputs and outputs to enforce policy — blocking PII leaks, detecting prompt injection, enforcing output formats, or moderating content.
Explanation
Guardrails sit outside the model. The model might be willing to do something you do not want; the guardrail catches it before the user sees it (output filter) or before it reaches the model (input filter).
Common patterns: regex/PII scrubbing, classifier-based moderation (OpenAI Moderation API, Llama Guard), schema validation, and policy classifiers. Guardrails are complementary to safety training, not a replacement.
Examples
- Llama Guard checking every model response for unsafe categories.
- JSON schema validation rejecting malformed outputs and re-prompting.
Frequently asked
What is Guardrails?
Guardrails are runtime checks that filter or modify LLM inputs and outputs to enforce policy — blocking PII leaks, detecting prompt injection, enforcing output formats, or moderating content.
What is an example of guardrails?
Llama Guard checking every model response for unsafe categories.
How is Guardrails related to Alignment?
Guardrails and Alignment are both safety & alignment concepts. Alignment is the problem of making an AI system pursue what humans actually want rather than the literal letter of its training objective. RLHF and Constitutional AI are alignment techniques.
Is Guardrails considered intermediate?
Guardrails is generally considered intermediate-level material in the AI and LLM space.