Guardrails | HyperKit.ai

Definition

Guardrails are programmatic constraints and checks placed around LLM systems to ensure outputs stay within acceptable boundaries. They filter inputs and outputs, validate responses, and prevent misuse - acting as safety barriers around the core model.

Guardrails provide defense-in-depth, catching issues that model training alone may not prevent.

Key Concepts

Input guardrails: Filter/validate user prompts before processing
Output guardrails: Validate/filter model responses before returning
Topic rails: Keep conversation on allowed topics
Format rails: Ensure outputs match expected structure

Examples

Architecture

Guardrail System Design

GUARDRAIL ARCHITECTURE:

┌─────────────────────────────────────────────┐
│              User Input                      │
└─────────────────┬───────────────────────────┘
                  ↓
┌─────────────────────────────────────────────┐
│         INPUT GUARDRAILS                     │
│  ┌─────────────────────────────────────┐    │
│  │ • PII Detection                      │    │
│  │ • Toxic Content Filter               │    │
│  │ • Jailbreak Detection               │    │
│  │ • Topic Classification              │    │
│  │ • Input Length Limits               │    │
│  └─────────────────────────────────────┘    │
│         ↓ PASS          ↓ BLOCK            │
└─────────┬───────────────┴──────────────────┘
          ↓                      ↓
┌─────────────────┐      ┌──────────────────┐
│      LLM        │      │  Error Response  │
└────────┬────────┘      └──────────────────┘
         ↓
┌─────────────────────────────────────────────┐
│        OUTPUT GUARDRAILS                     │
│  ┌─────────────────────────────────────┐    │
│  │ • Fact Verification                  │    │
│  │ • Harmful Content Check             │    │
│  │ • Format Validation                 │    │
│  │ • Competitor Mention Check          │    │
│  │ • Confidence Threshold              │    │
│  └─────────────────────────────────────┘    │
│         ↓ PASS          ↓ BLOCK            │
└─────────┬───────────────┴──────────────────┘
          ↓                      ↓
┌─────────────────┐      ┌──────────────────┐
│  Final Output   │      │ Fallback/Retry   │
└─────────────────┘      └──────────────────┘

Implementation

Guardrails in Code

NEMO GUARDRAILS (NVIDIA):

from nemoguardrails import LLMRails, RailsConfig

config = RailsConfig.from_path("./config")
rails = LLMRails(config)

# Define rails in Colang
define user ask about competitors
  "What do you think of [competitor]?"
  "Is [competitor] better?"

define bot refuse competitor discussion
  "I'm focused on helping with our products.
   I can't comment on competitors."

define flow
  user ask about competitors
  bot refuse competitor discussion

# Use in application
response = await rails.generate(
    messages=[{"role": "user", "content": user_input}]
)

GUARDRAILS AI LIBRARY:

from guardrails import Guard
from guardrails.hub import ToxicLanguage, PIIDetection

guard = Guard().use_many(
    ToxicLanguage(on_fail="exception"),
    PIIDetection(on_fail="fix"),  # Redact PII
)

# Validate output
validated_output = guard.validate(llm_output)

CUSTOM IMPLEMENTATION:

def apply_guardrails(user_input, llm_output):
    # Input checks
    if detect_jailbreak(user_input):
        return "I can't help with that request."

    if contains_pii(user_input):
        user_input = redact_pii(user_input)

    # Output checks
    if is_off_topic(llm_output, allowed_topics):
        return retry_with_topic_constraint()

    if contains_harmful_content(llm_output):
        return "I apologize, let me try again."

    if not matches_format(llm_output, expected_schema):
        return retry_with_format_instruction()

    return llm_output

Interactive Exercise

✎

Design Guardrails

For a medical information chatbot, list 3 critical guardrails you would implement and explain why.

Pro Tips

Layer guardrails: cheap/fast checks first, expensive checks later
Log guardrail triggers to understand failure modes
Test guardrails adversarially before deployment
Have fallback responses ready when guardrails trigger

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms