Safety / Implementation

Guardrails

Foundational [2/5]
Safety guardrails LLM guardrails Content filters

Definition

Guardrails are programmatic constraints and checks placed around LLM systems to ensure outputs stay within acceptable boundaries. They filter inputs and outputs, validate responses, and prevent misuse - acting as safety barriers around the core model.

Guardrails provide defense-in-depth, catching issues that model training alone may not prevent.

Key Concepts

  • Input guardrails: Filter/validate user prompts before processing
  • Output guardrails: Validate/filter model responses before returning
  • Topic rails: Keep conversation on allowed topics
  • Format rails: Ensure outputs match expected structure

Examples

Architecture
Guardrail System Design
GUARDRAIL ARCHITECTURE: ┌─────────────────────────────────────────────┐ │ User Input │ └─────────────────┬───────────────────────────┘ ↓ ┌─────────────────────────────────────────────┐ │ INPUT GUARDRAILS │ │ ┌─────────────────────────────────────┐ │ │ │ • PII Detection │ │ │ │ • Toxic Content Filter │ │ │ │ • Jailbreak Detection │ │ │ │ • Topic Classification │ │ │ │ • Input Length Limits │ │ │ └─────────────────────────────────────┘ │ │ ↓ PASS ↓ BLOCK │ └─────────┬───────────────┴──────────────────┘ ↓ ↓ ┌─────────────────┐ ┌──────────────────┐ │ LLM │ │ Error Response │ └────────┬────────┘ └──────────────────┘ ↓ ┌─────────────────────────────────────────────┐ │ OUTPUT GUARDRAILS │ │ ┌─────────────────────────────────────┐ │ │ │ • Fact Verification │ │ │ │ • Harmful Content Check │ │ │ │ • Format Validation │ │ │ │ • Competitor Mention Check │ │ │ │ • Confidence Threshold │ │ │ └─────────────────────────────────────┘ │ │ ↓ PASS ↓ BLOCK │ └─────────┬───────────────┴──────────────────┘ ↓ ↓ ┌─────────────────┐ ┌──────────────────┐ │ Final Output │ │ Fallback/Retry │ └─────────────────┘ └──────────────────┘
Implementation
Guardrails in Code
NEMO GUARDRAILS (NVIDIA): from nemoguardrails import LLMRails, RailsConfig config = RailsConfig.from_path("./config") rails = LLMRails(config) # Define rails in Colang define user ask about competitors "What do you think of [competitor]?" "Is [competitor] better?" define bot refuse competitor discussion "I'm focused on helping with our products. I can't comment on competitors." define flow user ask about competitors bot refuse competitor discussion # Use in application response = await rails.generate( messages=[{"role": "user", "content": user_input}] ) GUARDRAILS AI LIBRARY: from guardrails import Guard from guardrails.hub import ToxicLanguage, PIIDetection guard = Guard().use_many( ToxicLanguage(on_fail="exception"), PIIDetection(on_fail="fix"), # Redact PII ) # Validate output validated_output = guard.validate(llm_output) CUSTOM IMPLEMENTATION: def apply_guardrails(user_input, llm_output): # Input checks if detect_jailbreak(user_input): return "I can't help with that request." if contains_pii(user_input): user_input = redact_pii(user_input) # Output checks if is_off_topic(llm_output, allowed_topics): return retry_with_topic_constraint() if contains_harmful_content(llm_output): return "I apologize, let me try again." if not matches_format(llm_output, expected_schema): return retry_with_format_instruction() return llm_output

Interactive Exercise

Design Guardrails

For a medical information chatbot, list 3 critical guardrails you would implement and explain why.

Pro Tips
  • Layer guardrails: cheap/fast checks first, expensive checks later
  • Log guardrail triggers to understand failure modes
  • Test guardrails adversarially before deployment
  • Have fallback responses ready when guardrails trigger

Related Terms